How can I split a string into tokens?

32.5k Views Asked by Martin Thetford At 19 August 2013 at 11:15

If I have a string

'x+13.5*10x-4e1'

how can I split it into the following list of tokens?

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

Currently I'm using the shlex module:

str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
    tokenList.append(str(token))
return tokenList

But this returns:

['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']

So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.

In an ideal world, e and E would not be recognised as letters in the same way, so

'-4e1'

would become

['-', '4e1']

but

'-4x1'

would become

['-', '4', 'x', '1']

Can anybody help?

Original Q&A

There are 3 best solutions below

Peter Varo On 19 August 2013 at 11:18 BEST ANSWER

Use the regular expression module's split() function, to split at

'\d+' -- digits (number characters) and
'\W+' -- non-word characters:

CODE:

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5

CODE:

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']

Tigran Saluev On 19 August 2013 at 11:44

Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

redrubia On 08 May 2014 at 20:00

Another alternative not suggested here, is to using nltk.tokenize module

How can I split a string into tokens?

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in TOKEN

Related Questions in TOKENIZE

Related Questions in EQUATION

Related Questions in SHLEX

Trending Questions

Popular # Hahtags

Popular Questions