I have a variable with lists with varied number of elements:
['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M']
['124', 'M', '19', 'M', '7', 'M']
['19', 'M', '131', 'M']
['3', 'M', '19', 'M', '128', 'M']
['12', 'M', '138', 'M']
Variable is always number, letter and order matters.
I would to add the values only of consecutive Ms to be (i.e. if there is a D, skip the sum):
['30', 'M', '1', 'D', '120', 'M']
['150', 'M']
['150', 'M']
['150', 'M']
['150', 'M']
ps. the complete story is that I want to convert soft clips to match in a bam file, but got stuck in that step.
#!/usr/bin/python
import sys
import pysam
bamFile = sys.argv[1];
bam = pysam.AlignmentFile(bamFile, 'rb')
for read in bam:
cigar=read.cigarstring
sepa = re.findall('(\d+|[A-Za-z]+)', cigar)
for i in range(len(sepa)):
if sepa[i] == 'S':
sepa[i] = 'M'
You can slice Python lists using a step (sometimes called a stride), you can use this to get every second element, starting at index 1 (for the first letter):
The
[1::2]
syntax means: start at index 1, go on until you run out of elements (nothing entered between the:
delimiters), and step over the list to return every second value.You can do the same thing for the numbers, using
[::2]
, so begin with the value right at the start and take every other value.If you then combine this with the
zip()
function you can pair up your numbers and letters to figure out what to sum:The above function takes your list of numbers and letters and:
"M"
values"M"
, add that value (as an integer) to the running sum."M"
, then adds the current number and letter too."M"
, if there is any.This covers all your example inputs:
There are other methods of looping over a list in fixed-sized groups; you can also create an iterator for the list with
iter()
and then usezip()
to pull in consecutive elements into pairs:This works because
zip()
gets the next element for each value in the pair from the same iterator, so"30"
first, then"M"
, etc.:However, for short lists it is perfectly fine to use slicing, as it can be understood more easily.
Next, you can make the summing a little easier by using the
itertools.groupby()
function to give you your number + letter pairs as separate groups. That function takes an input sequence, and a function to produce the group identifier. When you then loop over its output you are given that group identifier and an iterator to access the group members (those elements that have the same group value).Just pass it the
zip()
iterator build before, and eitherlambda pair: pair[1]
oroperator.itemgetter(1)
; the latter is a little faster but does the same thing as thelambda
, get the letter from the number + letter pair.With separate groups, the logic starts to look a lot simpler:
The output of the function hasn't changed, only the implementation.
Finally, we could turn the function into a generator function, by replacing the
summed += ...
statements withyield from ...
, so it'll still generate a sequence of numeric strings and letters:You can then use
list(sum_m_values(...))
to get a list again, or just use the generator as-is. For long inputs, that could be the preferred option as that means you never need to keep everything in memory all at once.If you can guarantee that only numbers with
M
repeat (so aD
pair is always followed by anM
pair or is the last pair in the sequence), you can even just drop theif
test and just always sum:This works because there will only ever be one
number
value perD
group, summing won’t make that into a different number.