I am trying to replicate bug detection software built for JavaScript file to use it for finding bugs in Python files.
The process involves finding the start and end positions of a token based on column number.
Below is the output of using acorn JS parser on a .js file:
In the above image, the start and end locations of a token are the column numbers in the entire document.
I have checked Python tokenizer, which only gives the loc.start and loc.end values equivalent to the ones in the above picture.
But how to get the start and end values for pythons tokens just like acorn output picture?
In principle, all you need to convert linenumber/offset pairs into byte offsets into the documents is a list of the starting byte offset of each line. So one simple way to do this would be to accumulate the information as the file is read. That's reasonably simple since you can give
tokenize
your own function which returns input lines. So you can collect a mapping from line number to file position, and then wraptokenize
in a function which uses that mapping to add start and end indices.In the following example, I use
file.tell
to extract the current file position. But that won't work if the input is not a seekable file; in that case, you would need to come up with some alternative, such as keeping track of the number of bytes returned [Note 1]. Depending on what you need the indices for, that might or might not be important: if you only need unique numbers, for example, it would be sufficient to keep a running total of the string lengths of each line.Notes
readline
is a string, not abytes
object, so its length is measured in characters rather than bytes; furthermore, on platforms (such as Windows) in which the end-of-line is not a single character, the substitution of the end-of-line with\n
means that the number of characters read does not correspond to the number of characters in the file.