How can I define a file naming convention of incoming files in Spark

783 Views Asked by Malo At 27 June 2018 at 09:26

I receive files in real-time in hdfs and they have the same naming convention.

id_name_..._timestamp

Can I somehow define this naming convention on spark (scala), so I can compare these later with the ID for example?

Thank you

Original Q&A

There are 1 best solutions below

sandevfares On 27 June 2018 at 10:09

you use something like this :

register udf

spark.udf()
  .register("get_only_file_name", (String fullPath) -> {
     int lastIndex = fullPath.lastIndexOf("/");
     return fullPath.substring(lastIndex, fullPath.length - 1);
    }, DataTypes.StringType);

import org.apache.spark.sql.functions.input_file_name

#use the udf to get last token(filename) in full path
Dataset<Row> initialDs = spark.read()
  .option("dateFormat", conf.dateFormat)
  .schema(conf.schema)
  .csv(conf.path)
  .withColumn("input_file_name", get_only_file_name(input_file_name()));

How can I define a file naming convention of incoming files in Spark

There are 1 best solutions below

register udf

Related Questions in APACHE-SPARK

Related Questions in HADOOP

Related Questions in NAMING

Related Questions in CONVENTION

Trending Questions

Popular # Hahtags

Popular Questions