Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Naïve Bayesian filtering might not work very well on this kind of text. It basically looks like a regular comment, until you start recognizing that it always follows the same pattern. Your basic Bayesian classifier will throw all of the words in a set before analyzing them, which loses all of the information about patterns and word order. The resulting words are considered "independent" which means that even though the template might generate the words "pretty worth bloggers content online" every time it uses the first template, the naïve Bayesian classifier will never figure that part out.

My suspicion is that existing Bayesian classifiers have pushed the spammers to develop more natural-seeming templates, like this one.



You can run a Bayesian filter over pairs (or even triplets) of words (although this could cause the probabilities to be a bit off, because pairs like "although this" and "this could" are not truly independent). The downside is that as you do this, you drastically increase the size of the model and the amount of training data required.

Bayesian filters can also take more than just the words in the text into account - for example, they can take the submitting IP address (or perhaps /24 or ASN) into account, or a spam classification from external sources.

There are certainly better methods that could be built for recognising unknown templates - a simple known-state Markov model would be sufficient for the cases where templates substitute one word at a time, and you could conceivably use an unsupervised learning algorithm to discover an unknown number of models from a large corpus of comments.


Something like a Hidden Markov Model might be better then? That way you can can keep the information about word order.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: