Naïve Bayesian filtering might not work very well on this kind of text. It basic...

A1kmm · on April 23, 2013

You can run a Bayesian filter over pairs (or even triplets) of words (although this could cause the probabilities to be a bit off, because pairs like "although this" and "this could" are not truly independent). The downside is that as you do this, you drastically increase the size of the model and the amount of training data required.

Bayesian filters can also take more than just the words in the text into account - for example, they can take the submitting IP address (or perhaps /24 or ASN) into account, or a spam classification from external sources.

There are certainly better methods that could be built for recognising unknown templates - a simple known-state Markov model would be sufficient for the cases where templates substitute one word at a time, and you could conceivably use an unsupervised learning algorithm to discover an unknown number of models from a large corpus of comments.

D4N_ · on April 23, 2013

Something like a Hidden Markov Model might be better then? That way you can can keep the information about word order.