How to decide correct NLP approach for a project

58 Views Asked by At

I'm working on an NLP project. My task is to determine the category and Sentiment Score of Turkish Call Center conversations from the conversations themselves. We are using Python as our programming language. However, I'm stuck in the initial stages of the project among hundreds of alternative solutions. I have 300,000 rows of Customer Representative and Customer Text data, and I have cleaned them nicely through the preprocessing stage. All are currently in a sentence tokenized form and have been through other standard preprocessing stages. Customer representative and customer conversations are ready in different columns in a ~600 MB csv. Before deciding on modeling algorithms, my manager expects me to prepare the training dataset. The dataset should contain the following information, and we will decide later which ones are necessary and which ones are unnecessary, and then we will finalize the dataset:

1. Word vectors should be extracted  
2. Sentence vectors should be extracted  
3. Summaries of the conversations should be extracted, and sentence and word vectors of these summaries should be extracted  
4. NER counts should be extracted  
5. POS counts should be extracted  
6. Morphological analysis should be extracted, and affixes, conjunctions, etc. should be shown numerically  
7. Sentiment score should be extracted in terms of both subjectivity and polarity  
8. Keywords and their vectors should be extracted  
9. Topic modeling should be done, and which topic each conversation is closest to should be added to the dataset  
10. Similarity scores between summary and main conversations should be extracted  
11. The rarity ratio of the words mentioned in a conversation should be extracted and the average should be taken sentence by sentence (We think it will give an idea about how rich the sentence is in terms of meaning)  
   

The problems I'm facing are as follows:

  1. What is the best word vector extraction library for the Turkish language? There are techniques like Word2vec, NGram, GloVe, etc. There are also relatively newer techniques like Fasttext. BERT is also an option. Which one should I choose? How will I compare their performance? Should I train my own model, or should I prefer pre-trained models? (E.g., Fasttext has models trained on Turkish language) Which one is superior or more current to which? Which article or research should I base on, if they all used a different technique?

  2. Gensim seems like a library with tools that provide solutions to a lot of NLP problems. It's such a big library that I couldn't even grasp its full capabilities. Should I just proceed using Gensim, or should I use different tools together? Will it meet all my needs? How will I know?

  3. There are a lot of tools doing lemmatizing, these tools are also vectorizing, since I will also do lemmatizing, should I use their vectorization features, or should I benefit from the most mentioned vectorization tools above? Which one gives the best result? I've read a lot of comparison articles, they all talk about different results.

  4. There are SBERT models trained on Turkish for extracting sentence vectors. Will the vectors I will get when I use SBERT become meaningless when I extract word vectors with another tool? After all, I will extract these with different methods and they will be in the same dataset.

Due to the ambiguity of the superiority of such alternative solutions to each other, I'm confused.

Actually, my reason for writing here is to learn and take as an example the approach and discipline of accessing accurate information of people who have worked on such projects. What I want from you is advice on how an NLP project should be conducted correctly as someone who wants to improve himself on NLP area.

1

There are 1 best solutions below

0
On
  1. Using word-vectors is really just a matter of tokenizing, then looking up the word-vectors – from either a pretrained model, or something trained on your own data.

If you have enough text, training your own model can better capture the unique senses of the word in your domain.

You'd likely want to use word2vec or FastText algorithms for training individual word vectors, and if you're using Python the Gensim implementations of those algorithms work well with many features.

The potential benefit of FastText would be that if Turkish word-substrings often hint at a word's meaning, then your end-model could, when later being asked for words it didn't recognize from initial training ("out of vocabulary"), offer you a synthesized "guess" vector that's better than nothing. But: if you whole project will train on a fixed corpus, rather than say create some fixed model applied to new items before a full-retrain, that's not a big benefit.

And: the easy interchangeability of these algorithms means that as soon as you ahve your own downstream repeatable evaluation of end-models, you can (& should) try them both, even with a variety of training parameters, to see what works best.

  1. Because Gensim is a grab-bag of algorithms, you wouldn't really choose to "use Gensim". You'd decide the approaches you want to try, and use Gensim as part of your solution if/when it has good implementations of the the relevant algorithms. It doesn't really preclude or require much else.

  2. Unsure what tools you're referring to, but keep in mind you may not need to lemmatize at all. If you've got plenty of text with examples of all your word-forms, and/or using modern dense-modelling (like Fasttext word vectors of your own training or form a larger domain), you may not need to coalesce multiple word-variants via lemmatizing. (This answer might change based on Turkish word morphology, which I know nothing about. If it has an especially wide array of related word forms – such that some would be too rare to model alone – lemmatizing might help a lot. But it's definitely something you shouldn't assume you need – rather testing approaches with and without.)

  3. If you were using a particular BERT-like model for longer-text vectorization, I think you'd want to stick with the inherent word vectors of the same model for any compatible text-vector to word-vector comparisons. But, not sure that would be a consideration in real approaches. (If you get a quality dense text-vector for your texts from any neural model, you may already be past needing to worry about per-word vectors.)

More general advice:

You could start with some bog-simple classic text-classification/sentiment-classification approach – like a bag-of-word reprsentation of each text, and simple classifer algorithm, similar to those in the sklearn flowchart, in a process similar to that shown in sklearn text guides.

Then, even if the initial results are weak, you already have:

  • a skeleton/pipeline of key steps, into which you can sub alternatives (such as text representations adding subword ngrams, or word-vector averages, or BERT-like longer-text vectors – rather than a simple bag-of-words)
  • some repeatable quantitative evaluation of the end result, to check when other tweaks help or hurt

Then, you improve from there, trying alternative algorithms/parameters, comparing results.