Creative Words: Spam, Ham, PHP, and Bayesian filtering

Most of the time I just want to google a package concept, find it and employ it.

When I had a task of analyzing free form text answers into a set of categories, I first remembered a project to do just that written 30 years ago using Snobol 4. Well this time, I've got the whole web, PHP and MySQL. Then I though of bayesian filters (as used in spam filters.)

Google showed me a simple implementation in PHP and starting from there... Well really, I didn't want a spam filter (that would be a filter whose object was to identify stuff that you don't want) rather I wanted ham filters to identify what I did want (slightly semantic, but with all the identifiers giving the wrong voice in the program, it became intolerable. I also wanted a gang of filters that operated in sync with each other.

The categories for the answers might be: Positive, Suggestion, Negative, Neutral, and Other (spam?). So I created a code (small integer) for each category and augmented the database tables to include this code. Now when you decide that an answer is Positive, you also decide that the answer is NOT any of the others. The operations of learning and unlearning (when you change a code) cycle through all the codes and insert the new answer as Ham or as Not Ham.

Consulting the filters to determine the coding for a new answer amounts to evaluating all the filters and taking the largest rating -- unless none of them produces a significant answer, then I initially assign it to the last code. The web page allows users (marketers, they probably are) to view all answers in a given code (plus all the unverified codes) and see if any are miscoded and if so, change them.

Do you need this too? Give me a shout.

Creative Words

Thursday, January 22, 2009

Spam, Ham, PHP, and Bayesian filtering

No comments:

About Me

Blog Archive

Links

Followers