How to use StringIO(file.read()) to create a Spark dataframe

152 Views Asked by At

I have a very simple csv file. It is pretty easy to get the records loaded into a pandas dataframe in the following way. However, what I really need is to get it loaded into a spark dataframe.

How could I directly use StringIO(f.read()) to get the records into a spark dataframe directly, instead of converting a df_pandas to a df_spark?

Thank you very much!

f = open("C:\\myfolder\\test.csv", "r")
df_pandas = pd.read_csv(StringIO(f.read()), sep=";")
#df_spark = spark.read.csv(StringIO(f.read()))  # this doesn't work
f.close()
1

There are 1 best solutions below

0
On

You could convert the pandas dataframe to a spark dataframe:

f = open("C:\\myfolder\\test.csv", "r")
df_pandas = pd.read_csv(StringIO(f.read()), sep=";")
df_spark = spark.createDataFrame(df_pandas)
f.close()

This does not make sense, if you create your StringIO object from a local file, as you could directly load the file with spark.read.csv("C:\\myfolder\\test.csv", sep=";"), but could make sense if you got your StringIO object from another string (e.g. a FileUpload ipython widget).