I'm currently on a learning project to extract an individuals name from their CV/Resume.
Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody).
My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating something akin to a CV corpus?
My initial thoughts are that I'd probably have a more success by sentence splitting, removing obvious text and applying a bit of logic to make a best guess on the individual's name.
I can see how training would work if the a name appears in within a structured sentence, however as a standalone entity without context (Akbar Agho for example) I suspect it will struggle regardless of the training.
Is there a level of AI that if given enough data would begin to formulate a pattern for finding a name or should I maybe just go for applying a level of logic based string extraction?
I'd appreciate people's thoughts, opinions and suggestions.
Side note: I having been using PHP with Appache Tika to do the initial text extraction from Doc/Pdf and am experimenting with Stanford and OpenNLP via PHP/Commandline.
Chris
I guess you'll probably improve name identification if you create a CV corpus, this also depends on the size of your corpus (you could gather such a corpus by crawling CV websites).
Using data mining is probably, in my opinion, your best option. I don't know in details what options are proposed by Apache Tika, but the more information you have on the layout of the CV, the better. For instance, patterns should probably rely on the fact that names are on top of the document, and close to birth date / marital status / image / address.
In that case, you won't be any more in a sequential labelling case (as Stanford NER does): in a CV, a name is usually not surrounded by text. It should most probably be a classification task of candidates segments of text to which patterns may be converted as (numeric or binary) attributes.
Pattern extractor may easily be found or implemented and should be considered as a preprocessing before machine learning. Don't forget, indeed, to also use lists of first and last names (and frequent prefixes / suffixes : -son, -vitch, -man, Ben-, de, etc.) that are indeed unavoidable criteria to decide what segment is likely to be a name. As other names often appear in a CV, this is why I believe using layout should also be an important feature.
I'd be interested to know what features are efficient... would you let us know?