Fixed Length file Reading Spark with multiple Records format in one

298 Views Asked by At

All,

I am trying to read the file with multiple record types in spark, but have no clue how to do it.. Can someone point out, if there is a way to do it? or some existing packages? or some user git packages

the example below - where we have a text file with 2 separate ( it could be more than 2 ) record type : 00X - record_ind | First_name| Last_name

0-3 record_ind
4-10 firstname
11-16 lastname
============================
00Y - record_ind | Account_#| STATE | country
0-3 record_ind
4-8 Account #
9-10 STATE
11-15 country

input.txt
------------

    00XAtun   Varma 
    00Y00235ILUSA   
    00XDivya  Reddy  
    00Y00234FLCANDA  
    
    sample output/data frame
    output.txt
    
    record_ind | x_First_name | x_Last_name | y_Account | y_STATE | y_country
    ---------------------------------------------------------------------------
      00x      | Atun         | Varma       | null      | null    | null
      00y      | null         | null        | 00235     | IL      | USA       
      00x      | Divya        | Reddy       | null      | null    | null
      00y      | null         | null        | 00234     | FL      | CANDA         
1

There are 1 best solutions below

1
On

One way to achieve this is to load data as 'text'. Complete row will be loaded inside one column named 'value'. Now call a UDF which modifies each row based on condition and transform the data in way that all row follow same schema. At last, use schema to create required dataframe and save in database.