How to parse values appear after the same string in python?

1.5k Views Asked by At

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)

(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)

I am trying to parse the text to store something like this: value1="xxx" and value2="yyy". I wrote python code as follows:

value1_start = content.find('value')
value1_end = content.find(';', value1_start)

value2_start = content.find('value')
value2_end = content.find(';', value2_start)


print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])

But it always returns:

value=xxx
value=xxx

Could anyone tell me how can I parse the text so that the output is:

value=xxx
value=yyy
4

There are 4 best solutions below

5
On BEST ANSWER

Use a regex approach:

re.findall(r'\bvalue=[^;]*', s)

Or - if value can be any 1+ word (letter/digit/underscore) chars:

re.findall(r'\b\w+=[^;]*', s)

See the regex demo

Details:

  • \b - word boundary
  • value= - a literal char sequence value=
  • [^;]* - zero or more chars other than ;.

See the Python demo:

import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
0
On

You already have good answers based on the re module. That would certainly be the simplest way.

If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :

value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)
1
On

For this input:

content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'

use a simple regex and manually strip off the first and last two characters:

import re

values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
    print(value)

Output:

value=xxx
value=yyy

Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.

0
On

Use regex to filter the data you want from the "junk characters":

>>> import re
>>> _input = '#4@5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
    print(match)


value=xxx
value=yyy
>>> 

Summary or the regular expression:

  • [a-zA-Z0-9]+: One or more alphanumeric characters
  • =: literal equal sign
  • [a-zA-Z0-9]+: One or more alphanumeric characters