Regex to Extract #hashtags from MMD metadata in Python

398 Views Asked by At

I'm trying to extract all the #hashtags from the "Tags: #tag1 #tag2" line of a multimarkdown plaintext file. (I'm in Python multiline mode.)

I've tried using lookaheads:

^(?=Tags:\s.*)#(\w+)\b

and lookbehinds:

#(\w+)\b(?<=Tags:^\s)

Plain vanilla #(\w+)\b works, except it picks up any #hashtag that might appear later in the document.

Any hints, help, instruction appreciated.

2

There are 2 best solutions below

0
On BEST ANSWER
text = "\n\n#bogus\nTags: #foo #bar\n"

First, you need to get the line:

line = re.findall(r'Tags:.+\n', text)
# line = ['Tags: #foo #bar\n']

Lastly, you need to get the tags from the line:

tags = re.findall(r'#(\w+)', line[0])
# tags = ['foo', 'bar']
tags = re.findall(r'#\w+', line[0])
# tags = ['#foo', '#bar']

Lookbehind won't work since you would need to provide a pattern that doesn't have a fixed width.

0
On

First get index where hash is located in the input text and then use re.findall to get repeated captures. Following example prints ['#tag1', '#tag2']

text = "Tags: #tag1 #tag2"

matched = re.search(r'^Tags([^#]+)', text)
if matched:
    tag_text = text[matched.end():]
    hash_tags = re.findall(r'(#(?:[^#\s]+(?:\s*?)))', tag_text)
    print hash_tags