Gazette Document ("The Government of the Hong Kong Special Administrative Region Gazette")
I'd like to extract the company information from the above PDF document. Due to the company's restrictions, I can only use Uipath's ReadPDF activity to extract the text in the PDF. I already dropped the head and tail and got all the company entries in the body. The structure of the company entry is as below,
BRNumber EnglishName ChineseName
The BRNumber could be a combination of letters and digits, which length is 8. EnglishName could be a combination of letters and special characters. ChineseName could be a combination of Chinese characters and special characters.
When either EnglishName or ChineseName is too long, it will be separated into 2 lines.
Either EnglishName or ChineseName can be empty.
BRNumber, EnglishName, ChineseName are separated by space. Words in EnglishName are also separated by space.
How can I extract BRNumber, EnglishName and ChineseName?
I tried to separate a single line with the regex
([A-Za-z0-9\-]*)\s*(([A-Za-z0-9-'&.,\s()/]*)\s*([A-Za-z0-9-'&.,\s()/]*)\s*([A-Za-z0-9-'&.,\s()/]*))\s*([\d\u4e00-\u9fff-\s()()]*)
But when a ChineseName not start with Chinese character, the result is incorrect.
For example,
C1234567 | 20 Hello Co.Ltd | 20 你好有限公司
will become
C1234567 | 20 Hello Co.Ltd 20 | 你好有限公司
The bar is just for showing more clearly. Please ignore it.


Instead of RegEx, just use plain old string manipulation...
Search from the front for the first space to get BRNumber. Search from the back for the last space to get ChineseName. Everything in-between must be EnglishName. Add an extra check for a multi-line string to handle the edge case.
Something like:
Here's the class being used in a Button click event:
Here's the output from the IDE: