Spark reading in fixed width file

2.8k Views Asked by At

I'm new to Spark (less than 1 month!) and am working with a flat file raw data input that is fixed width. I am using sqlContext to read in the file using com.databricks.spark.csv and then using .withColumn to substring the rows based on the set widths.

    rawData.withColumn("ID"), trim(rawData['c0'].substr(1,8)))

The issue I am encountering is that the last field is of variable width. It has a fixed start point but variable number of 'sets' of data that are like 20 chars wide. So for example

Row 1  A 1243 B 42225 C 23213 
Row 2  A 12425
Row 3  A 111 B 2222 C 3 D 4 E55555

I need to eventually read in those variable fields, just pull out the first character of each group in the variable width column, and then transpose so that the output looks like:

Row 1 A
Row 1 B
Row 1 C
Row 2 A
...
Row 3 D
Row 3 E

I've read in the fixed width columns I need but I am stuck at the variable width field.

1

There are 1 best solutions below

0
On BEST ANSWER

zipWithIndex and explode can help to transpose the data into rows of each element

sc.textFile ("csv.data").map(_.split("\\s+")).zipWithIndex.toDF("dataArray","rowId").select ($"rowId",explode($"dataArray")).show(false)

+-----+------+
|rowId|col   |
+-----+------+
|0    |A     |
|0    |1243  |
|0    |B     |
|0    |42225 |
|0    |C     |
|0    |23213 |
|1    |A     |
|1    |12425 |
|2    |A     |
|2    |111   |