I am trying to analyse some data, but my data contains letters which require standardising. What I would like to be able to do is, for every datatable in the data (this csv data contains 3 datatables) replace the letter T or any other letter for that matter with the next highest integer for that table. The first table contains no errors, the second table contains 1 T and the third contains 2 x t's.
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T
I am expecting this to be a relatively easy thing to code, however whilst I know how to replace all T's
with a number, within a particular column or a particular row, I do not know how to replace each T with a different number depending on the Datatable it is in. Essentially I am looking to produce the following from the above:
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,7
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,6
DatatableC,5
DatatableC,6
Here nothing happened in DatatableA, DatatableB the only T was replaced with the next highest integer in this case it was replaced with a 7, in DatatableC there was two anomalous data points which were both replaced with the next highest integer, which was a 6.
If anyone can point me in the right direction or provide a snippet of something, It would be greatly appreciated. As always constructive comments are also appreciated.
Edit in reply to elyase
I attempted to run the code:
import pandas as pd
df = pd.read_csv('test.csv', sep=',', header=None, names=['datatable', 'col'])
def replace_letter(group):
letters = group.isin(['T', 'Q']) # select letters
group[letters] = int(group[~letters].max()) + 1 # replace by next max
return group
df['col'] = df.groupby('datatable').transform(replace_letter)
print df
and i received the traceback:
Traceback (most recent call last):
File "C:/test.py", line 11, in <module>
df['col'] = df.groupby('datatable').transform(replace_letter)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1981, in transform
res = path(group)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 2006, in <lambda>
slow_path = lambda group: group.apply(lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
ValueError: ("invalid literal for int() with base 10: 'col'", u'occurred at index col')
Is there something I have used in correctly, I could use AEAs answer, but I have been meaning to use pandas more, as the library seems so useful for data manipulations.
Pandas is ideal for this kind of tasks:
Read your csv:
Group, select and replace max:
Write to csv: