Consider a number of list (looping) that has inconsistent string or even length of lists. The list is an output from email (.eml) message body.
Example list 1
['Request 1',
'String example',
'Service:xyz Request Date Time: 4/7/2022 8:20:54 PMService: Sub Service:']
Example list 2
['Request 2',
'String example 1',
'String example 2',
'Service : xyzabc Requested by : example Request Date : 4/8/2022 7:31:17 AM Service : abcdefg Sub Service : abcdefg Current Owner']
Example list 3
['Request 3',
'string example',
'Service : abcdefg Requested by : example Request Date : Thursday, 7 April 2022, 3:29:55 PM Service : abcdefg Sub Service : abcdefg Current Owner','SSC : abcdefg',
'Jam']
The string needs to be parse and classify to seperate DataFrame columns:
- Request
- String example
- Service
- Requested by
- Requested Date (*and Time)
- Service
- Sub Service
- Current Owner
- SSC
The problem is there's not even an exact pattern of string which can be use as parameter to split the string.
Here's the code that I use to read the email file, but the issue is there's a nested list because the if condition.
matches = ["Service", "Requested by", "Request Date"]
for file in eml_files:
with open(file, 'rb') as fp:
name = fp.name
msg = BytesParser(policy=policy.default).parse(fp)
text = msg.get_body(preferencelist=('plain')).get_content()
file_names.append(name)
texts.append(text)
fp.close()
text = text.split("\n")
text = [j.strip('\r') for j in text]
text = [j.strip('\t') for j in text]
text = [j.strip() for j in text if j.strip()]
for idx, te in enumerate(text):
if any(x in te for x in matches):
text[idx] = re.split('Service :|Requested by : |Request Date : |Service : |Sub Service : | Current Owner|SSC : ', te)
df = pd.DataFrame(text).T
As a general gist:
Due to the nature of your lists you can use the following:
Although you should be fine with just the loop