Pandas and Top2Vec data frame

60 Views Asked by At

Create a pandas table (DataFrame) with a row for each topic (cluster). Add the following columns for each topic:

  1. 3 columns containing the 3 words most similar to the topic
  2. 3 columns containing the 3 documents most similar to the topic
  3. 3 columns containing the similarity score between the 3 documents from 2. and the topic

Hint: one way to make a DataFrame, is to first make a two-dimensional Python list. Then make a DataFrame from this list.

This is the idea, but it does not work:

import pandas as pd
data = []
for topic_id in range(model.get_num_topics()):
    # Get the top 3 words for the topic
    topic_words = model.topic_words[topic_id][:3]

    # Get the top 3 similar documents for the topic
    doc_indices = model.topic_doc_indices[topic_id][:3]
    similar_docs = [facts_list[idx] for idx in doc_indices]

    # Get the similarity scores between the top 3 documents and the topic
    similarity_scores = model.get_document_topic_similarity(doc_indices, topic_id)

    # Append the information for the current topic to the data list
    data.append([topic_id, topic_words, similar_docs, similarity_scores])

columns = ['Topic', 'Top 3 Words', 'Top 3 Similar Docs', 'Similarity Scores']

df = pd.DataFrame(data, columns=columns)

print(df)
2

There are 2 best solutions below

0
Corralien On

There is insufficient information to correctly answer the question. IIUC, this is what I would do:

import pandas as pd
import numpy as np

# Setup of Top2Vec model
...

data = []
for topic_id in range(model.get_num_topics()):
    # Get the top 3 words for the topic
    topic_words = model.topic_words[topic_id][:3]

    # Get the top 3 similar documents for the topic
    similar_scores, similarity_docs = \
        model.search_documents_by_topic(topic_id, num_docs=3, return_documents=False)
    
    # Append the information for the current topic to the data list
    data.append(np.hstack([topic_id, topic_words, similarity_docs, similar_scores]))
    
columns = ['Topic', 'Word1', 'Word2', 'Word3', 'Doc1',
           'Doc2', 'Doc3', 'Score1', 'Score2', 'Score3']

df = pd.DataFrame(data, columns=cols)

print(df)

Output:

    Topic       Word1      Word2        Word3   Doc1   Doc2   Doc3      Score1      Score2      Score3
0       0     yankees   phillies     playoffs  10990  12698   6046    0.728086   0.7234148  0.72068757
1       1         dsl      sorry           hi   1889   6381  15574   0.5963211   0.5942726  0.58399546
2       2  spacecraft  aerospace   satellites   5822  16510   5788   0.7434824   0.7259543  0.71986336
3       3  encryption    encrypt    encrypted   7749   3850   2499    0.818774  0.81523967  0.81074286
4       4    firearms    firearm    massacres  14366   1118  14164   0.8006699  0.78988576   0.7890597
..    ...         ...        ...          ...    ...    ...    ...         ...         ...         ...
99     99         bob         or      yankees  14386   2498  10527  0.90170467  0.89703965   0.8905804
100   100        lens     camera  photography   9028   1055   3492   0.7746622    0.767372  0.75008094
101   101     candida      yeast    infection   8308   2840  15472   0.9097394   0.8660926  0.86603004
102   102      comics       hulk    wolverine   4725    739  13109  0.93123806    0.929493  0.92893505
103   103    abortion     murder    homicides   2299  15462  12252  0.78680325   0.7657954   0.7650268

[104 rows x 10 columns]
0
Sierra Garcia On

To create a pandas DataFrame with the specified structure, you can modify your code as follows:

import pandas as pd

data = []
for topic_id in range(model.get_num_topics()):
    # Get the top 3 words for the topic
    topic_words = model.topic_words[topic_id][:3]

    # Get the top 3 similar documents for the topic
    doc_indices = model.topic_doc_indices[topic_id][:3]
    similar_docs = [facts_list[idx] for idx in doc_indices]

    # Get the similarity scores between the top 3 documents and the topic
    similarity_scores = model.get_document_topic_similarity(doc_indices, topic_id)

    # Append the information for the current topic to the data list
    data.append([topic_id, topic_words, similar_docs, similarity_scores])

# Flatten the list of lists
flat_data = [item for sublist in data for item in sublist]

# Reshape the flat data into a two-dimensional list
reshaped_data = [flat_data[i:i + len(columns)] for i in range(0, len(flat_data), len(columns))]

# Create the DataFrame
columns = ['Topic', 'Word1', 'Word2', 'Word3', 'Doc1', 'Doc2', 'Doc3', 'Score1', 'Score2', 'Score3']
df = pd.DataFrame(reshaped_data, columns=columns)

print(df)

This code ensures that each topic's information is flattened into a single list before creating the DataFrame. The resulting DataFrame has columns for the topic, top 3 words, top 3 similar docs, and similarity scores as specified.