VB how to extract BRNumber, EnglishName, ChineseName from a string that is separated by space?

Question

VB how to extract BRNumber, EnglishName, ChineseName from a string that is separated by space?

78 Views Asked by user8314628 At 20 March 2024 at 16:52

Gazette Document ("The Government of the Hong Kong Special Administrative Region Gazette")

I'd like to extract the company information from the above PDF document. Due to the company's restrictions, I can only use Uipath's ReadPDF activity to extract the text in the PDF. I already dropped the head and tail and got all the company entries in the body. The structure of the company entry is as below,

BRNumber EnglishName ChineseName

The BRNumber could be a combination of letters and digits, which length is 8. EnglishName could be a combination of letters and special characters. ChineseName could be a combination of Chinese characters and special characters.

When either EnglishName or ChineseName is too long, it will be separated into 2 lines.

Either EnglishName or ChineseName can be empty.

BRNumber, EnglishName, ChineseName are separated by space. Words in EnglishName are also separated by space.

How can I extract BRNumber, EnglishName and ChineseName?

I tried to separate a single line with the regex ([A-Za-z0-9\-]*)\s*(([A-Za-z0-9-'&.,\s()/]*)\s*([A-Za-z0-9-'&.,\s()/]*)\s*([A-Za-z0-9-'&.,\s()/]*))\s*([\d\u4e00-\u9fff-\s()（）]*) But when a ChineseName not start with Chinese character, the result is incorrect.

For example,

C1234567 | 20 Hello Co.Ltd | 20 你好有限公司

will become

C1234567 | 20 Hello Co.Ltd 20 | 你好有限公司

The bar is just for showing more clearly. Please ignore it.

Original Q&A

There are 3 best solutions below

**Idle_Mind** · Answer 1 · 2024-03-20T17:35:33.087000

Instead of RegEx, just use plain old string manipulation...

Search from the front for the first space to get BRNumber. Search from the back for the last space to get ChineseName. Everything in-between must be EnglishName. Add an extra check for a multi-line string to handle the edge case.

Something like:

Public Class DataValues

    Public BRNumber As String
    Public EnglishName As String
    Public ChineseName As String

    Public Shared Function GetDataValues(ByVal data As String) As DataValues
        Dim dv As New DataValues
        Dim firstSpace As Integer = data.IndexOf(" ")
        Dim lastSpace As Integer = data.LastIndexOf(" ")
        Dim newLine As Integer = data.IndexOf(Environment.NewLine)
        If newLine <> -1 AndAlso firstSpace <> -1 AndAlso firstSpace < newLine Then
            dv.BRNumber = data.Substring(0, firstSpace)
            dv.EnglishName = data.Substring(firstSpace + 1, newLine - firstSpace - 1)
            dv.ChineseName = data.Substring(newLine + Environment.NewLine.Length)
        ElseIf firstSpace <> -1 AndAlso lastSpace <> -1 AndAlso firstSpace <> lastSpace Then
            dv.BRNumber = data.Substring(0, firstSpace)
            dv.EnglishName = data.Substring(firstSpace + 1, lastSpace - firstSpace - 1)
            dv.ChineseName = data.Substring(lastSpace + 1)
        End If
        Return dv
    End Function

    Public Overrides Function ToString() As String
        Return "BRNumber: " & BRNumber & Environment.NewLine &
            "EnglishName: " & EnglishName & Environment.NewLine &
            "ChineseName: " & ChineseName & Environment.NewLine
    End Function

End Class

Here's the class being used in a Button click event:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    Dim data As String = "C1234567 20 Hello Co.Ltd 20 你好有限公司"
    Dim data2 As String = "C1234567 20 Hello Co.Ltd 20" & Environment.NewLine & "你好有限公司"

    Dim dv1 As DataValues = DataValues.GetDataValues(data)
    Debug.Print(dv1.ToString())

    Dim dv2 As DataValues = DataValues.GetDataValues(data2)
    Debug.Print(dv2.ToString())
End Sub

Here's the output from the IDE:

BRNumber: C1234567
EnglishName: 20 Hello Co.Ltd 20
ChineseName: 你好有限公司

BRNumber: C1234567
EnglishName: 20 Hello Co.Ltd 20
ChineseName: 你好有限公司

**Albert D. Kallal** · Answer 2 · 2024-03-20T18:16:46.603000

assuming this string:

    Dim sTest As String = "BRNumber EnglishName ChineseName"
    Dim sParts As String() = Split(sTest, " ")


    Debug.Print($"BRNumber = {sParts(0)}")
    Debug.Print($"English Name  = {sParts(1)}")
    Debug.Print($"Chinese name  = {sParts(2)}")

**AlexRivax** · Answer 3 · 2024-03-22T14:42:30.773000

If you want to use only RegEx then something like this should work:

(\d{8})\s((\d*)[A-Za-z-'&.,\s()/\d]{1,45})(?(?=\s\d{8})()|((?:\3)(?:(?:\s?)(?:[\u4e00-\u9fff()（）]+))))

You just need to adjust the value {1,45} to limit the maximum lenght of the EnglishName.

It is basically doing the work in four groups:

Group 1 - BRNumber (Limited to 8 digits)
Group 2 - EnglishName (Limited to 1 to 45 characters)
Group 3 - Used to extract the numbers of the EnglishName
Group 4 - ChineseName using result of Group 3 to include also the number. (Cases without ChineseName will not show anything under this group.

VB how to extract BRNumber, EnglishName, ChineseName from a string that is separated by space?

There are 3 best solutions below

Related Questions in REGEX

Related Questions in VB.NET

Related Questions in STRUCTURE

Related Questions in UIPATH

Trending Questions

Popular # Hahtags

Popular Questions