Text anonymization using supervised machine learning

1k Views Asked by At

I have a lot of text documents containing company and personal names. I have aligned text documents where the above have been manually anonymized (names replaced with a single unique character).

I want to use this corpora to train a system to perform automatic anonymization on unseen documents - that is simply replacing words with a character. Primary problem is to recognice words to be anonymized, secondary problem is to replace words by unique character. I can do the secondary problem.

Python is preferred and I'm thinking sklearn must contain the necessary tools.

How would I go about this? There are many articles on stackoverflow on supervised learning, but I'm not sure they match my situation. I suspect this is a fairly simple problem to solve, and I'm not necessarily looking for a complete solution, but some starting pointers would be nice. Also any insight on which algorithms would work better is much appreciated.

0

There are 0 best solutions below