I need to a regular expression to extract names from a GEDCOM file. The format is:
Fred Joseph /Smith/
Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.
This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:
As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.
Any help would be much appreciated.
For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.
Here is my proposed solution. It doesn't capture spaces around forenames.
To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.
See the demo
Explanation
^
start of the line(?:\h*([a-z\h]+\b)\h*)?
first non-capturing group that matches 0 or 1 time:\h*
0 or more horizontal spaces([a-z\h]+\b)
captures in a group letters and spaces, but stops at the end of the last word\h*
matches the possible remaining spaces without capturing(?:\/([a-z\h]+)\/)?
second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes(?:\h*([a-z\h]+\b)\h*)?
third non-capturing group doing the same as first one, capturing the names in a third group.$
end of the line