Thursday, January 22, 2009

Spam, Ham, PHP, and Bayesian filtering

Most of the time I just want to google a package concept, find it and employ it.

When I had a task of analyzing free form text answers into a set of categories, I first remembered a project to do just that written 30 years ago using Snobol 4. Well this time, I've got the whole web, PHP and MySQL. Then I though of bayesian filters (as used in spam filters.)

Google showed me a simple implementation in PHP and starting from there... Well really, I didn't want a spam filter (that would be a filter whose object was to identify stuff that you don't want) rather I wanted ham filters to identify what I did want (slightly semantic, but with all the identifiers giving the wrong voice in the program, it became intolerable. I also wanted a gang of filters that operated in sync with each other.

The categories for the answers might be: Positive, Suggestion, Negative, Neutral, and Other (spam?). So I created a code (small integer) for each category and augmented the database tables to include this code. Now when you decide that an answer is Positive, you also decide that the answer is NOT any of the others. The operations of learning and unlearning (when you change a code) cycle through all the codes and insert the new answer as Ham or as Not Ham.

Consulting the filters to determine the coding for a new answer amounts to evaluating all the filters and taking the largest rating -- unless none of them produces a significant answer, then I initially assign it to the last code. The web page allows users (marketers, they probably are) to view all answers in a given code (plus all the unverified codes) and see if any are miscoded and if so, change them.

Do you need this too? Give me a shout.

Friday, January 09, 2009

Target Advertising

Advertising in closed groups is difficult because the group is closed, no one wants spam, and because often advertising equals spam, no one wants advertising. Yet everyone also wants to know, when they want to know.

In this respect a portal is often the best choice, a single place where members of the group can go to find advertising. Well isn't that just like the yellow pages books? Pretty much, except that access is limited by paying for the advertising. This paying question is one of the issues that the open community Craigslist.com deals with.

In order to provide a portal for the small group, that is free, I have created the experimental site http://BuyFromBahais.com. It serves a place where other Baha'i businesses and artisans may post ads about themselves with links to their own sites. It is searchable and able to serve, like the yellow pages, as the place to start looking for products or services that are provided by Baha'is.

Interesting challenges always occur in public facing web sites. Spammers, both human and computer, arrive to see what fun they can have at your expense. Therefore it must be made difficult for this to happen and there must be a remedy when it does. The ability for the community to flag offending ads allows many eyes to participate in the monitoring of the web site.

Users may upload an image along with each ad they post. In addition, each user may upload a banner ad, that will be displayed on the top of the web site from time to time.

Initial users are being invited to post ads by virtue of other announcements that they have made. Thus testing will proceed with motivated users.