I'm encountering a challenge when attempting to read certain Delta tables from an S3 bucket in Databricks. My goal is to load Delta tables into Databricks, and while some tables like table_1
load successfully, others such as table_2
produce the following error:
AnalysisException: Incompatible format detected.
You are trying to read from SECOND_DATALAKE/table_2/ using Delta, but there is no transaction log present. Check the upstream job to make sure that it is writing using format("delta") and that you are trying to read from the table base path.
I believe the reason behind this issue is that my AWS Glue job fails to generate the _delta_log
folder for heavy tables like table_2
.
Below is the code snippet from my AWS Glue job:
table_to_copy = 'TABLE_2'
try:
# Script generado para node Oracle SQL
OracleSQL_node1 = glueContext.create_dynamic_frame.from_options(
connection_type="oracle",
connection_options={
"url": 'jdbc:oracle:thin://@datalake.eu-west-3.rds.amazonaws.com:10523:NAME1',
"user": username,
"password": password,
"dbtable": 'SCHEMA.' + table_to_copy,
},
transformation_ctx="OracleSQL_node1",
)
# Convertir a DataFrame
dataFrame = OracleSQL_node1.toDF()
# Escribir en formato Delta
dataFrame.repartition(10) \
.write.format('delta') \
.mode('overwrite') \
.save("SECOND_DATALAKE" + table_to_copy)
logger.info(f"{table_to_copy} was copied to the Delta Lake correctly")
except Exception as e:
logger.error(f"Error copying {table_to_copy} to Delta Lake: {str(e)}")
job.commit()
spark.stop()
How can I address this problem and ensure that the _delta_log
folder is generated appropriately for all tables, irrespective of their size?
AWS Glue job is responsible for extracting data from Oracle and writing it to S3 in Delta format. The error occurs specifically for tables perceived as "heavy" by AWS Glue.
How do I optimize the AWS Glue job to ensure the generation of the _delta_log
folder for all tables?
I had a similar problem too.
This link(Create a table - DataFrameWriter API) might be helpful to understand what is different using metastore or not.
Anyway, let's try to add
/
to your path as below.AFAIK, This will work.
If your storage on s3, refer to sample code link
Further information of Metastore
Spark need metastore, it could be one of as below
metastore has
OK, let's take a look code again,
So, if you are using Databricks + AWS Glue, it would be better use
saveAsTable()
with table names.