Extract multiple strings between brackets or two different delimiters

94 Views Asked by At

I'm trying to extract data between these two delimiters [,]
for two cases: Beginning with #CB or CR:

Input data:

Scenario,Order_ID,Review
1,1234,<Order No. 10><Remark>Food was good</Remark><UserID>7890</UserID><Filter>[#CB:Customer happy with service]</Filter><Rating>5</Rating>
2,1235,<Rating>Five</Rating><Order No. 17><UserID>7880</UserID><Filter>[#CB:Customer had a good time]</Filter><Remark>Food was good</Remark><Additional>Service up to par</Additional><Remark>[#CB:Customer will return again]</Remark><End>Thats all</End>
3,1236,<Start>We shall begin</Start><Intro>[CR:Fine Dining Experience]</Intro><Rating>Three</Rating><Order No. 12><Filter>[#CB:Food was well prepared]</Filter><Remark>Food was good</Remark><Additional>Ambiance was lively</Additional><Remark>[CR:Hygine was maintained]</Remark><End>Could be better</End><UserID>7880</UserID>

Data wanted:

Scenario,Order_ID,Review1,Review2,Review3,Review4
1,1234,#CB:Customer happy with service,,,
2,1235,#CB:Customer had a good time,#CB:Customer will return again,,
3,1236,CR:Fine Dining Experience,#CB:Food was well prepared,CR:Hygine was maintained,
2

There are 2 best solutions below

0
Ken Myers On BEST ANSWER

Building on the other answer, this is how you could get it in the format you asked for.

import re
import pandas as pd

p = re.compile(r"\[(.*?)\]")

s1 = "<Order No. 10>Food was good7890[#CB:Customer happy with service]Five"
s2 = " Five<Order No. 17>7880[#CB:Customer had a good time]Food was goodService up to par[#CB:Customer will return again]Thats all"
print(p.findall(s1))
print(p.findall(s2))

d = pd.DataFrame({'scenario':[1,2], 'order_id':['1234', '1235'], 'reviews':[s1,s2] })

def padList(l, length=4):
  l = l[:length]
  l += [''] * (length - len(l))
  return l

d[['review1','review2','review3','review4']] = d.apply(lambda row: padList(p.findall(row['reviews'])), axis=1, result_type='expand')

d
0
cpix On

It appears that the data is in XML format (although the <Order No. 10>. makes it invalid; I've swapped it with <Order>. You can iterate through each value and check whether they begin with #CB. Alternatively you can perform a regex search as shown below. However if you are confident that the data always follows the XML format, the former method would be more suitable.

import re

text = ...

# Extract between [#CB: and ]
cb_data = re.findall(r"\[#CB:(.*?)\]", text)
print(cb_data) # prints out CB

# Extract between [CR: and ]
cr_data = re.findall(r"\[CR:(.*?)\]", text)
print(cr_data) # prints out CR

If your data is always in XML:

import xml.etree.ElementTree as ET

xml_str1 = '<Order><Remark>Food was good</Remark><UserID>7890</UserID><Filter>[#CB:Customer happy with service]</Filter><Rating>Five</Rating></Order>'
xml_str2 = '<Order><UserID>7880</UserID><Filter>[#CB:Customer had a good time</Filter><Remark>Food was good</Remark><Additional>Service up to par</Additional><Remark>(#CB:Customer will return again]</Remark><End>Thats all</End></Order>'

root1 = ET.fromstring(xml_str1)
root2 = ET.fromstring(xml_str2)

cb_filter1 = root1.find('Filter').text
cb_data1 = cb_filter1[cb_filter1.find('#CB:')+4:cb_filter1.find(']')]

cb_filter2 = root2.find('Filter').text
cb_data2 = cb_filter2[cb_filter2.find('#CB:')+4:cb_filter2.find(']')]

print(cb_data1) # Customer happy with service
print(cb_data2) # Customer had a good time