ValueError when using pandas' crosstab

1.4k Views Asked by At

I'm sure there must be a quickfix for this but I can't find an answer with a good explanation. I'm looking to iterate over a dataframe and build a crosstab for each pair of columns with pandas. I have subsetted 2 cols from the original data and removed rows with unsuitable data. With the remaining data I am looking to do a crosstab to ultimately build a contingency table to do a ChiX test. Here is my code:

my_data = pd.read_csv(DATA_MATRIX, index_col=0) #GET DATA
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns) #INITIATE DF TO HOLD ChiX-result

for c1 in my_data.columns:
    for c2 in my_data.columns:
        sample_df = pd.DataFrame(my_data, columns=[c1,c2])  #make df to do ChiX on
        sample_df = sample_df[(sample_df[c1] != 0.5) | (sample_df[c2] != 0.5)].dropna() # remove unsuitable rows

        contingency = pd.crosstab(sample_df[c1], sample_df[c2]) ##This doesn't work?

        # DO ChiX AND STORE P-VALUE IN 'AM': CODE STILL TO WRITE

The dataframe contains the values 0.0, 0.5, 1.0. The '0.5' is missing data so I am removing these rows before making the contingency table, the remaining values that I wish to make the contingency tables from are all either 0.0 or 1.0. I have checked at the code works up to this point. The error printed to the console is:

ValueError: If using all scalar values, you must pass an index

If anyone can explain why this doesn't work? Help to solve in any way? Or even better provide an alternative way to do a ChiX test on the columns then that would be very helpful, thanks in advance!

EDIT: example of the structure of the first few rows of sample_df

           col1  col2
    sample1 1   1
    sample2 1   1
    sample3 0   0
    sample4 0   0
    sample5 0   0
    sample6 0   0
    sample7 0   0
    sample8 0   0
    sample9 0   0
    sample10    0   0
    sample11    0   0
    sample12    1   1
1

There are 1 best solutions below

0
On BEST ANSWER

A crosstab between two identical entities is meaningless. pandas is going to tell you:

ValueError: The name col1 occurs multiple times, use a level number

Meaning it assumes you're passing two different columns from a multi-indexed dataframe with the same name.

In your code, you're iterating over columns in a nested loop, so the situation arises where c1 == c2, so pd.crosstab errors out.


The fix would involve adding an if check and skipping that iteration if the columns are equal. So, you'd do:

for c1 in my_data.columns:
    for c2 in my_data.columns:
        if c1 == c2:
            continue

        ...  # rest of your code