'JavaPackage' object is not callable in Archives Unleashed Toolkit

15 Views Asked by At

I am currently trying to run Archives Unleashed (which makes use of PySpark) in a Jupyter Notebook in order to work with some web archives. When I run the following code, I get the error message "'JavaPackage' object is not callable":

from aut import *

WebArchive(sc, sqlContext, "path/to/warcs") \
  .webpages() \
  .select("crawl_date", "domain", "url", "content") \
  .write \
  .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
  .format("csv") \
  .option("escape", "\"") \
  .option("encoding", "utf-8") \
  .save("plain-text-df/")

Here is the link to the package documentation in case it is helpful: https://aut.docs.archivesunleashed.org/docs/text-analysis

I have tried following the PySpark setup steps from this notebook, to no avail (the same error message repeats): https://github.com/archivesunleashed/notebooks/blob/main/Parquet%20Examples/parquet_text_analyis.ipynb

I have made sure the PySparkContext was set up correctly using this code I found in another StackOverflow question:

from pyspark.sql import SparkSession,SQLContext

spark = SparkSession.builder.appName("Basics").getOrCreate()
sc=spark.sparkContext
sqlContext = SQLContext(sc)
0

There are 0 best solutions below