how can i use weka to terminology extraction?

289 Views Asked by MSepehr At 03 January 2014 at 06:26

i need to extract domain-specific terms from a big training corpus, such as political terms or etc .how can i use Weka and it's filters to aim this object? can i use feature vector produced by StringToVector() filter in Weka to do this or not?

Original Q&A

There are 1 best solutions below

Jose Maria Gomez Hidalgo On 03 January 2014 at 09:24

You can at least partly, as far as you have an appropriate dataset. For instance, let us assume you have a dataset like this one:

@relation test

@attribute text String
@attribute politics {yes,no}
@attribute religion {yes,no}

@data
"this is a text about politics",yes,no
"this text is about religion",no,yes
"this text mixes everything",yes,yes

For instance, for getting terms about politics, you can:

Remove the religion attribute.
Apply the StringToWordVector filter to the text attribute to get terms.
Apply the AttributeSelection filter with Ranker and InfoGainAttributeEval to get the top ranked terms.

This latter step will give you a list of terms that are most predictive for the politics category. Most of them will be terms in the politics domain (although it is possible that some terms are predictive but just because they are not in the politics domain - that is, they provide negative evidence).

The quality of the terms you get depens on the dataset. The more topics it deals with, the better for your results; so instead of having two classes (politics, religion, like in my dataset), it is much better to have plenty of them and many examples for each category.

how can i use weka to terminology extraction?

There are 1 best solutions below

Related Questions in TEXT

Related Questions in TERMINOLOGY

Related Questions in WEKA

Related Questions in CATEGORIZATION

Trending Questions

Popular # Hahtags

Popular Questions