How to read postgres DB tables through EMR jupyter lab notebook from amazon workspace

246 Views Asked by At

I'm trying to read the table from postgres tables. but i'm facing below error. Note: i cannot be able to refer external files from local since it is a private workspace.

JDBC : Eg:

"url":"jdbc:postgresql://xxxx-xxxxx-postgresql-prod01.cluster-xxxx.xx-xx-1.rds.amazonaws.com:0000/db_xxx_txxx",

Error i'm getting like : "

java.lang.ClassNotFoundException: org.postgresql.Driver

"

An error was encountered:
An error occurred while calling o153.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:38)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:340)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

i've tried below code.

tables = read_table(
    url=URL,
    table="information_schema.tables",
    driver=DRIVER,
    user=USER,
    password=PASS
)
1

There are 1 best solutions below

7
Ronak Jain On

You need to add Postgres driver as a dependency/classpath first.

First copy the JAR onto the cluster or s3 and then in the first cell execute:

%%configure -f
{ "conf":{
          "spark.jars": "s3://JAR-LOCATION/postgresql.jar"
         }
}

Ref. Postgres JAR with EMR and Jupyter Notebooks

Alternatively, you can configure it while creating the SparkSession.

spark = SparkSession.builder.config('spark.driver.extraClassPath', '/JAR-LOCATION/postgresql.jar').getOrCreate()

Update: Based on your comment since you can't push JAR, you can use maven dependency

%%configure -f
{
    "conf": {"spark.jars.packages": "org.postgresql:postgresql:jar:42.4.3"}
}