I'm trying to use upsetplot for finding the intersection between column data in a dataframe. I am using a code from the one provided by the developers of this library, like the following:
import upsetplot
from upsetplot import from_indicators, plot
plot(from_indicators(indicators=pd.notna, data=data), show_counts=True)
plt.show()
So, this code above gave me a graph as an output with the counts of cell/pd_series in a df where is not empty (not a number). But I would like to have a code where instead of notna I could count the "core" items in all columns.
My code above would gave me from this dataframe (changed number to letters in this example):
-------column_1--column_2--column_3--column_4--column_5
row_1-- A -- A -- -- A --
row_2-- B -- -- B -- B --
row_3-- -- -- C -- --
row_4-- D -- D -- -- D --
row_5-- E -- -- E -- --
row_6-- -- -- -- -- F
...a graph sort of like this:
column_1 : **** (4 not_empty)
column_3, column_4 : *** (3 not_empty)
column_2 : ** (2 not_empty)
column_5 : * (1 not_empty)
But actually what I want is a graph with some information like this:
column_1, column_2, column_4 : ** (A, D in_common)
column_1, column_3, column_4 : * (B in_common)
column_1, column_3 : * (E in_common)
column_5 : - (F not_in_common)
Does any of you have some idea on how to change the "pd.notna" for another piece of code that could deliver what I'm looking for? Thanks in advance!
The UpSet plot shows both those graphs. The totals graph is the former, and the intersection/subset plot is the latter.
See https://gist.github.com/jnothman/0fc6daf3d9d75513dd3311e86e06cc8c