Partial Vertical Caching of DataFrame

273 Views Asked by At

I use spark with parquet. I'd like to be able to cache the columns we use most often for filtering, while keeping the other on disk. I'm running something like:

myDataFrame.select("field1").cache
myDataFrame.select("field1").count
myDataFrame.select("field1").where($"field1">5).count
myDataFrame.select("field1", "field2").where($"field1">5).count

The fourth line doesn't use the cache.

Any simple solutions that can help here?

1

There are 1 best solutions below

1
On

The reason this will not cache is that whenever you do a transformation on a dataframe (e.g. select) you are actually creating a new one. What you basically did is cached a dataframe containing only field1 and a dataframe containing only field1 where it is larger than 5 (probably you meant field2 here but it doesn't matter).

On the fourth line you are creating a third dataframe which has no lineage to the original two, just to the original dataframe.

If you generally do strong filtering (i.e. you get a very small number of elements) you can do something like this:

cachedDF = myDataFrame.select("field1", "field2", ... "fieldn").cache
cachedDF.count()
filteredDF = cachedDF.filter(some strong filter)
res = myDataFrame.join(broadcast(filteredDF), cond)

i.e. cachedDF has all the fields you filter on, then you filter very strongly and then do an inner join (with cond being all relevant selected fields or some id field) which would give all relevant data.

That said, in most cases, assuming you use a file format such as parquet, caching will not help you much.