[NCLUG] looking for patterns in text

Anthony Foiani tkil at scrye.com
Tue Sep 10 10:09:46 MDT 2013


John Gilmore <j.arthur.gilmore at gmail.com> writes:

> My first impulse would be to start with a statistical filter, the same
> sort often used to filter spam. "bayes" is the keyword you'd want.

n-grams might also be applicable here, especially for direct cut&paste
detection.  Apparently that can work with the Bayesian bits as well.

Although, doing a quick bit of research to avoid sounding like a
*complete* idiot, I did find this:

  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.4413&rep=rep1&type=pdf

Which seems to be working on exactly this problem.  The OP might
find hints there.

t.


More information about the NCLUG mailing list