I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías
and also Gestión
or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form). Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.