PySpark DataFrame creation is throwing PySparkTypeError

343 Views Asked by At

I am new in PySpark and am trying to create a simple dataFrame from an array or dictionary and in both cases they are throwing the same exception. I have tired creating a dataframe from .csv files using spark.sql and they worked just fine.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("example").getOrCreate()

simpleData1 = [
(1, 'firstname1', 'lastname1', 'address1', -97),
(2, 'firstname1', 'lastname1', 'address1', -23),
(3, 'firstname2', 'lastname2', 'address2', -23),
(4, 'firstname2', 'lastname2', 'address2', -97)
]
columns = ["id","name", "last", "address", "bookID"]
books = [-23, -44, -97, -32, -57, -76]

simpleData2 = [
{"id":1, "name":'firstname1', "last":'lastname1', "address":'address1', "bookID":-97},
{"id":2, "name":'firstname1', "last":'lastname1', "address":'address1', "bookID":-23},
{"id":3, "name":'firstname2', "last":'lastname2', "address":'address2', "bookID":-23},
{"id":4, "name":'firstname2', "last":'lastname2', "address":'address2', "bookID":-97}
]

#Trying using array: 
df1 = spark.createDataFrame(simpleData1).toDF(*columns)
#Trying using dictionary
df2 = spark.createDataFrame(simpleData2)

df1.show()
df2.show()

When running this code, I get the same following error. Here is the starting point:

Py4JJavaError                             Traceback (most recent call last)
Cell In[2], line 36
     34 df = spark.createDataFrame(simpleData).toDF(*columns)
     35 #df = spark.createDataFrame(simpleData2)
---> 36 df.show()

File C:\devTools\Anaconda\Lib\site-packages\pyspark\sql\dataframe.py:899, in DataFrame.show(self, n, truncate, vertical)
    893     raise PySparkTypeError(
    894         error_class="NOT_BOOL",
    895         message_parameters={"arg_name": "vertical", "arg_type": type(vertical).__name__},
    896     )
    898 if isinstance(truncate, bool) and truncate:
--> 899     print(self._jdf.showString(n, 20, vertical))
    900 else:
    901     try:
  1. I tried installing findpark
  2. I tried running it the same code in PySpark shell and I get the error that Python was not found; run without arguments to install from the Micorsoft store, or disable this shortcut form Settings Management App Execution Aliases'. After that I tried disabling Python in Manage App Execution Aliases and I get the error that Cannot run program ``python': CreateProcess error =2, The system cannot find the file specified`.
  3. Tried creating dataframe from .csv files in both PySparkShell and Jupiter notebook and dataframe gets created in both environments.
  4. I have also tried df1 = spark.createDataFrom(SimpleData1, columns) which exactly the same.
1

There are 1 best solutions below

0
Mojdeh Ebrahimi On

I was able to finally fix my issue. First I realized that even though I had Anaconda installed, I had also installed Python and Spyder myself. Once I uninstalled my own Spyder installation and removed Python and Python/scripts paths from my System Path, the error message changed to 'Python worker failed to connect back when execute spark action'. Then, I added the PYSPARK environment variables mentioned in this link: SparkException: Python worker failed to connect back when execute spark action