How to count characters across strings in a pandas column

1.1k Views Asked by At

I have a dataframe with the following structure:

prod_sec     
A    
AA    
AAAAAAAAAAB    
AAAABCCCAA    
AACC   
ABCCCBAC

df = pd.DataFrame({'prod_sec': ['A','AA','AAAAAAAAAAB','AAAABCCCAA','AACC','ABCCCBAC']})

Each string is a sequence made up of letters (A to C in this example).
I would like to create a list for each letter that counts the occurrences in each position down the entire pandas column.

For example in the first string A is only in the first position/index and it's not in the other locations.
In the second string the A in the first two positions and it's not in the other locations
In the third string the A has all the positions until the last one. Etc...
I want a total count, for the column, by position.
Here is an example for A:

A            ->     [1,0,0,0,0,0,0,0,0,0,0]    
AA                  [1,1,0,0,0,0,0,0,0,0,0]
AAAAAAAAAAB   ->    [1,1,1,1,1,1,1,1,1,1,0] 
AAAABCCCAA          [1,1,1,1,0,0,0,0,0,0,1]
AACC                [1,1,0,0,0,0,0,0,0,0,0]
ABCCCBAC    ->      [1,0,0,0,0,0,1,0,0,0,0]

so for A, I would want an output similar to the following... A [6,4,2,2,1,1,2,1,1,1,0]
In the end, I'm trying to get a matrix with a row for each character.

                    [6,4,2,2,1,1,2,1,1,1,0]
                    [0,1,0,0,1,1,0,0,0,0,1]
                    [0,0,1,1,0,1,2,0,0,0,0]
1

There are 1 best solutions below

4
On BEST ANSWER

The following should work. You can adjust the result, depending on your exact needs (numpy array, data frame, dictionary, etc). Tell me if you need more help with that.

max_length=max([len(i) for i in df.prod_sec])

d={'A':[0]*max_length, 'B':[0]*max_length, 'C':[0]*max_length}

for i in df.prod_sec:
    for k in range(len(i)):
        d[i[k]][k]+=1

result=pd.DataFrame.from_dict(d, orient='index')