Why special characters like () "" : [] are often removed from data before training translation machine?

769 Views Asked by At

I see that people often remove special characters like () "" : [] from data before training translation machine. Could you explain for me the benefits of doing so?

1

There are 1 best solutions below

0
On BEST ANSWER

Date clean-up or pre-processing is performed so that algorithms could focus on important, linguistically meaningful "words" instead of "noise". See "Removing Special Characters":

Special characters, as you know, are non-alphanumeric characters. These characters are most often found in comments, references, currency numbers etc. These characters add no value to text-understanding and induce noise into algorithms.

Whenever this noise finds its way into a model, it can produce output at inference, that contains these unexpected (sequences of) characters, and even affect overall translations. It is a frequent case with brackets in Japanese translations.