I am just starting to learn PySpark. I have created a column object, and now I want to see what is in it. Unfortunately, all my research efforts concluded with proposals to access a column of a Spark dataframe. But I want to know how to see what data is in the column object, that I already have. There must be a simple way, but no success to find it.
The code that created the column object:
baskets=groups.agg(pyspark.sql.functions.collect_list("product_id"))['collect_list(product_id)']
I expect something like the baskets.show()
, but that just tells me
column object is not callable
This creates a dataframe:
(However, normally we use less verbose lines)
Now,
baskets
is a dataframe and you can usebaskets.show()
In your code, you have also appended
['collect_list(product_id)']
. This way you created a reference in your code to the column. However, Spark has not created this column. So, there's nothing to display. It's just a reference in the code, so that code can become more readable. Here are the methods of pyspark.sql.Column class. There's nothing there to display column's values. It will "get" values only when it is displayed as part of a dataframe.It takes some time to understand how Spark works. It uses lazy evaluation.
https://developer.hpe.com/blog/the-5-minute-guide-to-understanding-the-significance-of-apache-spark/