I'm sure there must be a quickfix for this but I can't find an answer with a good explanation. I'm looking to iterate over a dataframe and build a crosstab for each pair of columns with pandas. I have subsetted 2 cols from the original data and removed rows with unsuitable data. With the remaining data I am looking to do a crosstab to ultimately build a contingency table to do a ChiX test. Here is my code:
my_data = pd.read_csv(DATA_MATRIX, index_col=0) #GET DATA
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns) #INITIATE DF TO HOLD ChiX-result
for c1 in my_data.columns:
for c2 in my_data.columns:
sample_df = pd.DataFrame(my_data, columns=[c1,c2]) #make df to do ChiX on
sample_df = sample_df[(sample_df[c1] != 0.5) | (sample_df[c2] != 0.5)].dropna() # remove unsuitable rows
contingency = pd.crosstab(sample_df[c1], sample_df[c2]) ##This doesn't work?
# DO ChiX AND STORE P-VALUE IN 'AM': CODE STILL TO WRITE
The dataframe contains the values 0.0, 0.5, 1.0. The '0.5' is missing data so I am removing these rows before making the contingency table, the remaining values that I wish to make the contingency tables from are all either 0.0 or 1.0. I have checked at the code works up to this point. The error printed to the console is:
ValueError: If using all scalar values, you must pass an index
If anyone can explain why this doesn't work? Help to solve in any way? Or even better provide an alternative way to do a ChiX test on the columns then that would be very helpful, thanks in advance!
EDIT: example of the structure of the first few rows of sample_df
col1 col2
sample1 1 1
sample2 1 1
sample3 0 0
sample4 0 0
sample5 0 0
sample6 0 0
sample7 0 0
sample8 0 0
sample9 0 0
sample10 0 0
sample11 0 0
sample12 1 1
A crosstab between two identical entities is meaningless.
pandas
is going to tell you:Meaning it assumes you're passing two different columns from a multi-indexed dataframe with the same name.
In your code, you're iterating over columns in a nested loop, so the situation arises where
c1 == c2
, sopd.crosstab
errors out.The fix would involve adding an
if
check and skipping that iteration if the columns are equal. So, you'd do: