Is plotting with Koalas using TopN has any statistic meaning?

61 Views Asked by At

I was going through the source code of Koalas, trying to get a handle on how they actually achieve plotting large datasets. It turns our that they use either sampling or TopN - selecting a given number of records.

I understand the meaning of sampling and internally it uses spark.DataFrame.sample to do it. For TopN, however, they simply take the first max_rows number of records from Koalas' DataFrame using data = data.head(max_rows + 1).to_pandas().

This seems strange and I wonder whether it's correctly reflecting the statistical properties of the dataset doing the data selection in this way.

Koalas DataFrame's plot accessor:

class KoalasPlotAccessor(PandasObject):
    pandas_plot_data_map = {
        "pie": TopNPlotBase().get_top_n,
        "bar": TopNPlotBase().get_top_n,
        "barh": TopNPlotBase().get_top_n,
        "scatter": SampledPlotBase().get_sampled,
        "area": SampledPlotBase().get_sampled,
        "line": SampledPlotBase().get_sampled,
    }
    _backends = {}  # type: ignore
    
    ...

class TopNPlotBase:
    def get_top_n(self, data):
        from databricks.koalas import DataFrame, Series

        max_rows = get_option("plotting.max_rows")
        # Simply use the first 1k elements and make it into a pandas dataframe
        # For categorical variables, it is likely called from df.x.value_counts().plot.xxx().
        if isinstance(data, (Series, DataFrame)):
            data = data.head(max_rows + 1).to_pandas()
        ...
0

There are 0 best solutions below