I see that people often remove special characters like () "" : [] from data before training translation machine. Could you explain for me the benefits of doing so?
Why special characters like () "" : [] are often removed from data before training translation machine?
796 Views Asked by phan-anh.tuan At
1
There are 1 best solutions below
Related Questions in NLP
- command line parameter in word2vec
- Annotator dependencies: UIMA Type Capabilities?
- term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?
- Stanford Entity Recognizer (caseless) in Python Nltk
- How to interpret scikit's learn confusion matrix and classification report?
- Detect (predefined) topics in natural text
- Amazon Machine Learning for sentiment analysis
- How to Train an Input File containing lines of text in NLTK Python
- What exactly is the difference between AnalysisEngine and CAS Consumer?
- keywords in NEGATIVE Sentiment using sentiment Analysis(stanfordNLP)
- MaxEnt classifier implementation in java for linguistic features?
- Are word-vector orientations universal?
- Stanford Parser - Factored model and PCFG
- Training a Custom Model using Java Code - Stanford NER
- Topic or Tag suggestion algorithm
Related Questions in TOKENIZE
- PHP split string into known tokens and remaining words add to single-worded array
- splitting a string but keeping empty tokens c++
- How to remove a custom word pattern from a text using NLTK with Python
- Chinese sentence segmenter with Stanford coreNLP
- Tokenize sentence into words, considering special characters
- Tokenised Login System - The Correct Way?
- printing the checksum of a txt file without printing the file path
- pySpark convert a list or RDD element to value (int)
- Javascript word tokenizer library with support for multiple languages (as many as possible)
- Tokenizing a phone number in C
- nltk: word_tokenize changes quotes
- Backslashes and escaping chars in Python vs Perl regexes
- How to identify n-gram before tokenization in stanford core-nlp?
- Gold POS in Stanford parser
- tokenize math expressions
Related Questions in MACHINE-TRANSLATION
- Is Speech-to-Text-to-Translation an Impossible Dream?
- Translate value before sending information
- State of the art language translation toolkit
- How to speedup mkcls step in mgiza++ or giza++, it taking up lots of time for word clustering?
- Tensorflow seq2seq `feed_previous' argument`
- How to differentiate between real improvement and random noise?
- What is the difference between ensembling and averaging models?
- What is the difference between mteval-v13a.pl and NLTK BLEU?
- finetuning tensorflow seq2seq model
- What do the counts in giza++ phrase-table mean?
- How to deal with punctuations in machine translation
- Aligning Parallel Sentences for Languages without a Written Form
- Sockeye WMT German to English news translation stopping criterium
- Why special characters like () "" : [] are often removed from data before training translation machine?
- Text language translation in android
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Date clean-up or pre-processing is performed so that algorithms could focus on important, linguistically meaningful "words" instead of "noise". See "Removing Special Characters":
Whenever this noise finds its way into a model, it can produce output at inference, that contains these unexpected (sequences of) characters, and even affect overall translations. It is a frequent case with brackets in Japanese translations.