This pattern
/^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?[Ll]a|[Dd][eu]|St\.|Le|Auf\s+der)\s+\p{L}+\.?)(.*)/gum
parses name tokens.
I had help deriving this pattern (ECMAScript Flavor) and have made small adjustments, but I'm stuck on the third name token in the test string.
Van H. Manning properly parses to Van H. Manning (just use trim() to remove extra space)
Lionel Van Deerlin properly parses to Lionel Van Deerlin
But Van Taylor does not parse to Van Taylor
Can this pattern be adjusted to properly parse Van Taylor along with the other instances of Van?
I'm still working out how this pattern works and how to understand this particular regex wizardry.
TIA
** Update **
Fools errand though it may be, I am doing the best possible version of a parse.
Per the comments, Van H. Manning is distinct because Van is a first name whereas Van Deerlin is a surname.
Similarly to Van H. Manning, Van Taylor consists of Van as a first name and Taylor as a surname.
I can see that part of the logic is that Van ocurring at the beginning of the string distinguishes between surname and last name, however, the pattern is properly grouping Van \w+ already so it seems like a small adjustment is needed.
As far as Van H. Manning being parsed as Van H. Manning, I am using a conditional to handle that. It's beyond me on how to regex that one with everything else and I've already asked for a lot of heavy lifting here.
I think it will get rather complicated to handle all cases because as everybody pointed out, you'll probably get the first name in front or behind the surname (last name or family name). In some countries I even think that your last name can come from your parent's first name, so imagine how complicated it can get to try and detect the order.
But, if you want to stick to a regular expression, you could just use your assumption that if
Vanis at the beginning of the string then it's the first name. In this case, just add two alternatives to your regular expression and capture the parts in several groups. I've named them for easier access, compared to indexed groups. You'll then have to put some logic to see which group is filled or empty.I also used the
iflag for case-insensitive instead of handling it with[Dd].I personally think that having several regular expressions or trying to find a library to handle that for you might be a better idea, especially if you also know the origin of the person, which could help to use specific rules by region of the planet.
The PCRE regex :
The JavaScript version to enhance :