Regular Expression to Extract Text Bounded by '/'

187 Views Asked by At

I need to a regular expression to extract names from a GEDCOM file. The format is:

Fred Joseph /Smith/

Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.

This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:

What I have so far

As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.

Any help would be much appreciated.

5

There are 5 best solutions below

1
On BEST ANSWER

For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.

Here is my proposed solution. It doesn't capture spaces around forenames.

^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$

To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.

See the demo

Explanation

  • ^ start of the line
  • (?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
    • \h* 0 or more horizontal spaces
    • ([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
    • \h* matches the possible remaining spaces without capturing
  • (?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
  • (?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
  • $ end of the line
1
On

For your requirements

([A-z a-z /])+\w*

Sample

4
On

I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':

(.*)(\/?.*\/?)(.*)

Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group

Edit:

Extending on Niitaku solution and looking at having each individual name in its own group, you could use:

^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$

As explained though, if using a language like ruby it would simply be:

ruby -pe '$_ = $_.scan(/\w+/)' file
1
On

Hope this helps (.\*?)\\/(.\*?)\\/(.\*)

0
On

Try this: ^([^/]*)(/[^/]+/)?([^/]*)$

This matches the following:

  • ^ start of string (or with multiline modifier start of line)
  • ([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
    • (/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
    • ([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
    • $ end of string (or with multiline modifier end of line)

You can see in action with your example text here: https://regex101.com/r/9kmKpy/1

To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes: ^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$

https://regex101.com/r/9kmKpy/2