Capturing string parts in RegEx

172 Views Asked by At

I would like to map different parts of a string, some of them are optionally presented, some of them are always there. I'm using the Calibre's built in function (based on Python regex), but it is a general question: how can I do it in regex?

Sample strings:

!!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by 1 Vladimir Popov (Jun 17, 2014 4_1).pdf

The strings' structure is the following:

[importance markings if any, it can be '!' or '!!'][title][ISBN-10 if available]by[author]([publication date and other metadata]).[file type]

Finally I created this regular expression, but it is not perfect, because if ISBN presented the title will contain the ISBN part too...

(?P<title>[A-Za-z0-9].+(?P<isbn>[0-9]{10})|([A-Za-z0-9].*))\sby\s.*?(?P<author>[A-Z0-9].*)(?=\s\()

Here is my sandbox: https://regex101.com/r/K2FzpH/1

I really appreciate any help!

1

There are 1 best solutions below

0
On BEST ANSWER

Instead of using an alteration, you could use:

^!*(?P<title>[A-Za-z0-9].+?)(?:\s+(?P<isbn>[0-9]{10}))?\s+by\s+(?P<author>[A-Z0-9][^(]+)(?=\s\()
  • ^ Start of the string
  • !* Match optional exclamation marks
  • (?P<title>[A-Za-z0-9].+?) Named group title, match of the ranges in the character class followed by matching as least as possible chars
  • (?:\s+(?P<isbn>[0-9]{10}))? Optionally match 1+ whitespace chars and named group isbn which matches 10 digits
  • \s+by\s+ Match by between 1 or more whitspace chars
  • (?P<author>[A-Z0-9][^(]+) Named group author Match either A-Z or 0-9 followed by 1+ times any char except (
  • (?=\s\() Positive lookahead to assert ( directly to the right

Regex demo