Why special characters like () "" : [] are often removed from data before training translation machine?

763 Views Asked by phan-anh.tuan At 29 July 2025 at 07:00

I see that people often remove special characters like () "" : [] from data before training translation machine. Could you explain for me the benefits of doing so?

Original Q&A

There are 1 best solutions below

Wiktor Stribiżew On 03 October 2020 at 12:32 BEST ANSWER

Date clean-up or pre-processing is performed so that algorithms could focus on important, linguistically meaningful "words" instead of "noise". See "Removing Special Characters":

Special characters, as you know, are non-alphanumeric characters. These characters are most often found in comments, references, currency numbers etc. These characters add no value to text-understanding and induce noise into algorithms.

Whenever this noise finds its way into a model, it can produce output at inference, that contains these unexpected (sequences of) characters, and even affect overall translations. It is a frequent case with brackets in Japanese translations.

Why special characters like () "" : [] are often removed from data before training translation machine?

There are 1 best solutions below

Related Questions in NLP

Related Questions in TOKENIZE

Related Questions in MACHINE-TRANSLATION

Trending Questions

Popular # Hahtags

Popular Questions