[NCLUG] looking for patterns in text

Joshua Datko jbdatko at gmail.com
Mon Sep 9 18:15:07 MDT 2013


It sounds like you need a "classifier."  Like John / Quentin's
recommendation, a Bayesian classifier is a good approach.  Your
"training set" is your standard and then you run a "test set" against
it.  Although, your training set is only one document, so I'd be
interested to know how well the Bayesian classifier handles your
inputs.

There are several machine learning toolkits out there that handle an
array of algorithms.  I've used Weka (in java) for research:
http://www.cs.waikato.ac.nz/ml/weka/ and it contains many classifiers.

Apparently, people like PyML for python: http://pyml.sourceforge.net/
but I never used it.

Josh

On Mon, Sep 9, 2013 at 5:59 PM, Quentin Hartman <qhartman at gmail.com> wrote:
> I tinkered with some Bayesian stuff for a previous employer. This is likely
> a very good approach to this problem (the spam filter as a base is a great
> suggestion) but be warned that if you can't find something reasonably off
> the shelf you will be entering the world of Real Math(tm), and so if you
> don't have a very strong mathematical background it will likely make your
> brain hurt. It is not for the faint of heart.
>
> QH
>
>
>
>
> On Mon, Sep 9, 2013 at 5:49 PM, John Gilmore <j.arthur.gilmore at gmail.com>wrote:
>
>> My first impulse would be to start with a statistical filter, the same
>> sort often used to filter spam. "bayes" is the keyword you'd want.
>>
>> On Mon, Sep 9, 2013 at 4:10 PM, Mike Cullerton <michaelc at cullerton.com>
>> wrote:
>> > Hey Folks,
>> >
>> > I'm helping a neighbor learn python, and we're using a problem they have
>> at work.
>> >
>> > They have text they want to parse, and compare to a known standard. They
>> want to sort the text based on how similar it is to the standard.
>> >
>> > They receive feedback from the public during engineering projects. Some
>> of this feedback is original. Some is copy/pasted from form letter
>> boilerplate. They'd like to parse the feedback text and sort it based on
>> how similar it is to the boilerplate.
>> >
>> > I'm guessing there's work out there already on this kind of stuff.
>> >
>> > I've done some basic searches, but I'm not getting what I want. I'm
>> hoping someone here knows some terms I can use to get started on my
>> searching.
>> >
>> > Any thoughts welcome.
>> >
>> > Thanks,
>> > Mike
>> >
>> > _______________________________________________
>> > NCLUG mailing list       NCLUG at lists.nclug.org
>> >
>> > To unsubscribe, subscribe, or modify
>> > your settings, go to:
>> > http://lists.nclug.org/mailman/listinfo/nclug
>> _______________________________________________
>> NCLUG mailing list       NCLUG at lists.nclug.org
>>
>> To unsubscribe, subscribe, or modify
>> your settings, go to:
>> http://lists.nclug.org/mailman/listinfo/nclug
>>
> _______________________________________________
> NCLUG mailing list       NCLUG at lists.nclug.org
>
> To unsubscribe, subscribe, or modify
> your settings, go to:
> http://lists.nclug.org/mailman/listinfo/nclug


More information about the NCLUG mailing list