I'm working through this
tutorial
on manipulating data frame Column objects. The doc string for the
Column
class shows a Columns columnar data and heading (Annex A below). In
contrast, the doc string for the
col
method shows that a Column can specify just a heading, without
containing any data (Annex B below).
While a Column can result from extracting columnar data from a
DataFrame, and possibly performing operations on that, in neither of
the above two manifestations does a Column maintain a pointer to a
source table for either the heading or data. The doc string for
DataFrame's
withColumn
method muddies that up for me. Its prototype says to not refer to
external data frames:
def withColumn(self, colName: str, col: Column) -> "DataFrame":
<...snip...>
The column expression must be an expression over this
:class:`DataFrame`; attempting to add a column from some other
:class:`DataFrame` will raise an error.
<...snip...>
As mentioned in my 2nd paragraph above, however, Columns don't
seem to refer back to any data frames for either name information or
columnar data. All the information seems to be self-contained within
the Column object itself.
How can I understand/resolve this apparent contradiction in what a
Column is, and how should I understand withColumn's stipulation
against referring to external data frames?
Further examples of the disparity can be found from other tutorials.
The DataFrame df from this
tutorial.
can be used to show that a Column object resulting from evaluating
an expression contains the columnar data, e.g., df.salary+1:
df.show(2)
+-----------------+----------+------+------+
| name| dob|gender|salary|
+-----------------+----------+------+------+
| {James, , Smith}|1991-04-01| M| 3000|
|{Michael, Rose, }|2000-05-19| M| 4000|
+-----------------+----------+------+------+
df.withColumn("salary",df.salary+1).show(2)
+-----------------+----------+------+------+
| name| dob|gender|salary|
+-----------------+----------+------+------+
| {James, , Smith}|1991-04-01| M| 3001|
|{Michael, Rose, }|2000-05-19| M| 4001|
+-----------------+----------+------+------+
In contrast, the identical DataFrame from this
tutorial can
be used to show a "salary" column with no data. The tutorial uses
this code:
df.withColumn("salary",col("salary").cast("Integer")).show()
The key difference here is that col("salary") makes no reference
to DataFrame df. In contrast to df.salary+1, therefore, the
Column object that results from evaluating col("salary") cannot
possibly contain any columnar data. The only possibility is that
col("salary").cast("Integer") contains only the column name and
(Spark) data type. This means that withColumn takes those limited
details from the Column object, looks up that column in it's own
DataFrame (the object through which withColumn is invoked, df),
and obtains the data in that way.
Since a Column object has (at least) these two manifestations, I
can picture various ways in which it might behave:
- The
Columnobject result from expression evaluation sometimes contains the column data - Sometimes, a
Columnobject contains only the name of a column and possibly the data type. IfwithColumngets such a 2nd argument (in the above context, not in its signature), perhaps it looks up the named column in in theDataFramethrough which it is invoked - If a
Columnobject is the result of an expression with operations, but it refers to no concrete columnar data, it is conceivable that the operations are stored in the column object symbolically so thatwithColumnknows what to do when it retrieves the data from theDataFrameobject through which it was invoked - If the expression that yields a
Columnobject refers to a column in a sourceDataFrame, such as indf.salary+1above, perhaps theColumnobject internally maintains a reference to the source column and stores the operations in the expression symbolically. When the resulting columnar data is needed, the data is retrieved from the source column and the operations performed. This speculated behaviour allowsColumnobjects to refer to externalDataFrames; the warnings against such external references then make sense.
With the limited doc string information, there are many possibilities
in what to expect of Column's implementation and behaviour.
It's hard to get an understanding of the intended operation and behaviour. Hopefully, someone can shed some light and drastically reduce the
possibilities. If there is more detailed documentation for Spark modules, classes, and methods, I would appreciate a pointer to it.
Thanks.
Annex A: Column class doc string excerpt
>>> df = spark.createDataFrame(
... [(2, "Alice"), (5, "Bob")], ["age", "name"])
Select a column out of a DataFrame
>>> df.name
Column<'name'>
>>> df["name"]
Column<'name'>
Create from an expression
>>> df.age + 1
Column<...>
>>> 1 / df.age
Column<...>
Annex B: col function doc string excerpt
>>> col('x')
Column<'x'>
>>> column('x')
Column<'x'>
The following is not a complete answer, but it is the best answer that I could find over weeks.
According to the comments of this answer, a
Columnobject does not store data. It stores the column name. From theColumn.cast()example in the question posted above, it may also store a data type.It is very possible that it can store a reference to a source
DataFrame. We see evidence of this in the examples of the posted question, wheredf.prefixes the column name. If references to a source table cannot be stored with aColumnobject, then there is little need to prefix column names withDataFramenames. If aColumnobject can in fact store a reference to sourceDataFrame, then it is conceivable and plausible that specifying a sourceDataFramename as a prefix allows the resultingColumnobject to inherit the data type from the named column in the sourceDataFrame. As described in the question,withColumn's warning against referencing externalDataFrame's only makes sense if aColumnobject can contain a reference to a sourceDataFrame.Finally, as suggested in the posted question, it seems that an expression that involves column operations and yields a
Columnobject also stores the operations symbolically. For example,df.salary+1might store the column namesalary, the sourceDataFramedf, and the operation+1. We know thatdf.salaryis aColumnobject that stores no data, so there seems to be little logical alternative the the supposition that the column meta-data (name and possibly type) are be stored along with the sourceDataFramedfand the+1operation.In terms of when the data is fetched and when the operations are performed, it could be that
withColumnevaluates the column expression, fetches the data, and performs the operation. If there is noDataFramereferenced, then perhapswithColumnlooks for an identically named column in theDataFrameobject through which it was called.It is also possible that the fetching of data and performance of column operations is not strictly the responsibility of the
withColumnmethod. It could be that the operator methods ofColumnobjects (e.g.,+,-, etc.) are defined such that they look up the source data, fetch it, and perform the operation corresponding to the operator.I see the latter as less likely because some
Columnobjects lack a reference to a sourceDataFrame, so it would not be possible to "look them up". If this is a valid point, then it is more likely that data fetching and performance of operations is done bywithColumn, since it has a hostDataFramethrough whichwithColumnis invoked. Hence,Columnobjects that lack a reference to aDataFrame, can be looked up through the hostDataFrame.As should be clear, the answer is somewhat speculative, though carefully reasoned. It would be nice to have more transparency into what a
Columnobject is and how it is meant to work. Short of reverse engineering the source code, if there is a better reference that describes theColumnclass, it could enhance this answer or provide a better alternative.