PySpark "column" object content to display

3.2k Views Asked by At

I am just starting to learn PySpark. I have created a column object, and now I want to see what is in it. Unfortunately, all my research efforts concluded with proposals to access a column of a Spark dataframe. But I want to know how to see what data is in the column object, that I already have. There must be a simple way, but no success to find it.

The code that created the column object:

baskets=groups.agg(pyspark.sql.functions.collect_list("product_id"))['collect_list(product_id)']

I expect something like the baskets.show(), but that just tells me

column object is not callable

1

There are 1 best solutions below

3
On

This creates a dataframe:

import pyspark
baskets=groups.agg(pyspark.sql.functions.collect_list("product_id"))

(However, normally we use less verbose lines)

from pyspark.sql import functions as F
baskets = groups.agg(F.collect_list("product_id"))

Now, baskets is a dataframe and you can use baskets.show()

In your code, you have also appended ['collect_list(product_id)']. This way you created a reference in your code to the column. However, Spark has not created this column. So, there's nothing to display. It's just a reference in the code, so that code can become more readable. Here are the methods of pyspark.sql.Column class. There's nothing there to display column's values. It will "get" values only when it is displayed as part of a dataframe.

It takes some time to understand how Spark works. It uses lazy evaluation.

lazy evaluation means that if you tell Spark to operate on a set of data, it listens to what you ask it to do, writes down some shorthand for it so it doesn’t forget, and then does absolutely nothing. It will continue to do nothing, until you ask it for the final answer.

https://developer.hpe.com/blog/the-5-minute-guide-to-understanding-the-significance-of-apache-spark/