I have a string description of a company, which is nasty written by different users (hand-typed). Here is a example (focus on the dots, spaces, first letters etc..):
XXXX is a Global menagement consulting,Technology services and outsourcing company, with 257000people serving clients in more than 120 countries.. combining unparalleled experience, comprehensive capabilities across all industries and business functions,and extensive research on the worlds most successfull companies, XXXX collaborates with clients to help them become high-performance businesses and governments., the company generated net revenues of US$27.9 Billion for the fiscal year ended 31.07.2012..
Now what i want is to format the string to a bit nicer version like this:
XXXX is a global management consulting, technology services and outsourcing company, with 257,000 people serving clients in more than 120 countries. Combining unparalleled experience, comprehensive capabilities across all industries and business functions, and extensive research on the world’s most successful companies, XXXX collaborates with clients to help them become high-performance businesses and governments. The company generated net revenues of US$27.9 billion for the fiscal year ended Aug. 31, 2012.
My question is: Is there any library with already defined methods which could do all the spelling corrections, unneeded space removal, etc .. ?
So far, I do it be replacing stuff like " ," with ", " and toUpperCase() if the is a "///." in front etc..
desc = desc.replace(" ", " ");
desc = desc.replace("..", ".");
desc = desc.replace(" .", ".");
desc = desc.replace(" ,", ", ");
desc = desc.replace(".,", ".");
desc = desc.replace(",.", ".");
desc = desc.replace(", .", ".");
desc = desc.replace("*", "");
I'm sure there is a cleaner and better version to do this. Using regex maybe??
Any solution would be appreciated.
If I were trying to solve your problem, I would probably read the text 1
charat a time, and format it as you go. For example, in psuedocode...Basically if you read the text 1
charat a time, you can detect any of your chosen formatting problems as you go, and correct them immediately. This provides a performance improvement, as you're only reading over the text string once (not 8 times, like you are currently doing), and allows you to add as many different/complex formatting changes as you want. The downside, however, is that you need to write the logic yourself rather than relying on in-built functions.