How can I define a file naming convention of incoming files in Spark

783 Views Asked by At

I receive files in real-time in hdfs and they have the same naming convention.

id_name_..._timestamp

Can I somehow define this naming convention on spark (scala), so I can compare these later with the ID for example?

Thank you

1

There are 1 best solutions below

2
sandevfares On

you use something like this :

register udf

spark.udf()
  .register("get_only_file_name", (String fullPath) -> {
     int lastIndex = fullPath.lastIndexOf("/");
     return fullPath.substring(lastIndex, fullPath.length - 1);
    }, DataTypes.StringType);

import org.apache.spark.sql.functions.input_file_name

#use the udf to get last token(filename) in full path
Dataset<Row> initialDs = spark.read()
  .option("dateFormat", conf.dateFormat)
  .schema(conf.schema)
  .csv(conf.path)
  .withColumn("input_file_name", get_only_file_name(input_file_name()));