Concatenating DataFrames throws InvalidIndexError

58 Views Asked by At

I'm extracting information from URLs (RSS feeds) to create one big data frame with all the data I need for sentiment analysis. I did a function to get each URL in a dictionary, use a parser, and then put the entries on a DataFrame. After 5 iterations, I get the error:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

I'm using a dictionary like {'name': 'url'} with the code below:

def extract_content(urls):
    df_final = pd.DataFrame()

    for url in urls.values():
        xml = feedparser.parse(url)
        entries = xml['entries']
        df = pd.DataFrame(entries)
        
        if 'media_content' in df.columns:
            df.rename(columns = {'media_content': 'content'}, inplace = True)

        if 'content' not in df.columns:
            df.rename(columns={'summary': 'content'}, inplace=True)

        df = df[['title', 'link', 'published', 'published_parsed', 'content']]
        df_final = pd.concat([df_final, df]).reset_index(drop = True)

    return df_final

How can I fix it?

I tried reset_index() but still doesn't work.

1

There are 1 best solutions below

1
nemo On

Possible duplicated column name

I think that it comes from a duplicate column name. For example, the following code reproduces the error:

df_final = pd.DataFrame({'A': [1, 2], 'B': [3,4]})
df = pd.DataFrame({'A': [1, 2], 'B': [5,5], 'C': [5, 6]})
df.rename(columns = {'C': 'A'}, inplace=True)
df = df[['A', 'B']]
df_final = pd.concat([df_final, df]).reset_index(drop = True)
df_final

In this code, I first rename column 'C' into column 'A' in the DataFrame df. It doesn't throw any error during renaming even if there is already a column named 'A', but it throws the error 'InvalidIndexError: Reindexing only valid with uniquely valued Index objects' during concatenation due to the duplicated column name. I think it is what happens in your case when you rename the column 'media_content' into the column name 'content'. You have not put any check that the column name 'content' already exists in the DataFrame df. If a column name 'content' already exists, it will then produce the reported error during concatenation.

I see two possible solutions here:

Solution 1

You remove the duplicated column before concatenation:

df_final = pd.DataFrame({'A': [1, 2], 'B': [3,4]})
df = pd.DataFrame({'A': [1, 2], 'B': [5,5], 'C': [5, 6]})
df.rename(columns = {'C': 'A'}, inplace=True)
df = df[['A', 'B']]
df = df.loc[:,~df.columns.duplicated()]
df_final = pd.concat([df_final, df]).reset_index(drop = True)
df_final

This outputs without error the expected output (we keep only the first column names 'A'):

    A   B
0   1   3
1   2   4
2   1   5
3   2   5

Solution 2

You only rename a column if the desired name doesn't already exist as a column name in the DataFrame df:

df_final = pd.DataFrame({'A': [1, 2], 'B': [3,4]})
df = pd.DataFrame({'A': [1, 2], 'B': [5,5], 'C': [5, 6]})
if 'A' not in df.columns:
    df.rename(columns = {'C': 'A'}, inplace=True)
df = df[['A', 'B']]
df_final = pd.concat([df_final, df]).reset_index(drop = True)
df_final