VB how to extract BRNumber, EnglishName, ChineseName from a string that is separated by space?

78 Views Asked by At

Gazette Document ("The Government of the Hong Kong Special Administrative Region Gazette")

enter image description here

enter image description here

I'd like to extract the company information from the above PDF document. Due to the company's restrictions, I can only use Uipath's ReadPDF activity to extract the text in the PDF. I already dropped the head and tail and got all the company entries in the body. The structure of the company entry is as below,

BRNumber EnglishName ChineseName

The BRNumber could be a combination of letters and digits, which length is 8. EnglishName could be a combination of letters and special characters. ChineseName could be a combination of Chinese characters and special characters.

When either EnglishName or ChineseName is too long, it will be separated into 2 lines.

Either EnglishName or ChineseName can be empty.

BRNumber, EnglishName, ChineseName are separated by space. Words in EnglishName are also separated by space.

How can I extract BRNumber, EnglishName and ChineseName?

I tried to separate a single line with the regex ([A-Za-z0-9\-]*)\s*(([A-Za-z0-9-'&.,\s()/]*)\s*([A-Za-z0-9-'&.,\s()/]*)\s*([A-Za-z0-9-'&.,\s()/]*))\s*([\d\u4e00-\u9fff-\s()()]*) But when a ChineseName not start with Chinese character, the result is incorrect.

For example,

C1234567 | 20 Hello Co.Ltd | 20 你好有限公司

will become

C1234567 | 20 Hello Co.Ltd 20 | 你好有限公司

The bar is just for showing more clearly. Please ignore it.

3

There are 3 best solutions below

6
Idle_Mind On

Instead of RegEx, just use plain old string manipulation...

Search from the front for the first space to get BRNumber. Search from the back for the last space to get ChineseName. Everything in-between must be EnglishName. Add an extra check for a multi-line string to handle the edge case.

Something like:

Public Class DataValues

    Public BRNumber As String
    Public EnglishName As String
    Public ChineseName As String

    Public Shared Function GetDataValues(ByVal data As String) As DataValues
        Dim dv As New DataValues
        Dim firstSpace As Integer = data.IndexOf(" ")
        Dim lastSpace As Integer = data.LastIndexOf(" ")
        Dim newLine As Integer = data.IndexOf(Environment.NewLine)
        If newLine <> -1 AndAlso firstSpace <> -1 AndAlso firstSpace < newLine Then
            dv.BRNumber = data.Substring(0, firstSpace)
            dv.EnglishName = data.Substring(firstSpace + 1, newLine - firstSpace - 1)
            dv.ChineseName = data.Substring(newLine + Environment.NewLine.Length)
        ElseIf firstSpace <> -1 AndAlso lastSpace <> -1 AndAlso firstSpace <> lastSpace Then
            dv.BRNumber = data.Substring(0, firstSpace)
            dv.EnglishName = data.Substring(firstSpace + 1, lastSpace - firstSpace - 1)
            dv.ChineseName = data.Substring(lastSpace + 1)
        End If
        Return dv
    End Function

    Public Overrides Function ToString() As String
        Return "BRNumber: " & BRNumber & Environment.NewLine &
            "EnglishName: " & EnglishName & Environment.NewLine &
            "ChineseName: " & ChineseName & Environment.NewLine
    End Function

End Class

Here's the class being used in a Button click event:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    Dim data As String = "C1234567 20 Hello Co.Ltd 20 你好有限公司"
    Dim data2 As String = "C1234567 20 Hello Co.Ltd 20" & Environment.NewLine & "你好有限公司"

    Dim dv1 As DataValues = DataValues.GetDataValues(data)
    Debug.Print(dv1.ToString())

    Dim dv2 As DataValues = DataValues.GetDataValues(data2)
    Debug.Print(dv2.ToString())
End Sub

Here's the output from the IDE:

BRNumber: C1234567
EnglishName: 20 Hello Co.Ltd 20
ChineseName: 你好有限公司

BRNumber: C1234567
EnglishName: 20 Hello Co.Ltd 20
ChineseName: 你好有限公司
0
Albert D. Kallal On

assuming this string:

    Dim sTest As String = "BRNumber EnglishName ChineseName"
    Dim sParts As String() = Split(sTest, " ")


    Debug.Print($"BRNumber = {sParts(0)}")
    Debug.Print($"English Name  = {sParts(1)}")
    Debug.Print($"Chinese name  = {sParts(2)}")



 
0
AlexRivax On

If you want to use only RegEx then something like this should work:

(\d{8})\s((\d*)[A-Za-z-'&.,\s()/\d]{1,45})(?(?=\s\d{8})()|((?:\3)(?:(?:\s?)(?:[\u4e00-\u9fff()()]+))))

You just need to adjust the value {1,45} to limit the maximum lenght of the EnglishName.

It is basically doing the work in four groups:

  • Group 1 - BRNumber (Limited to 8 digits)
  • Group 2 - EnglishName (Limited to 1 to 45 characters)
  • Group 3 - Used to extract the numbers of the EnglishName
  • Group 4 - ChineseName using result of Group 3 to include also the number. (Cases without ChineseName will not show anything under this group.