I have some csv data that I read in with spark and then write out to orc. There's a string column that looks like this:
ahbcshfa \r\n sfafdsf \r\n sdfsfsaf
I've managed to read it in correctly with spark, and write it out to orc, so when I open the orc file in a separate spark console, I get the same string value:
ahbcshfa \r\n sfafdsf \r\n sdfsfsaf
I need to receive the same result when I read this data in BQ. So I send this data to a GCP bucket, and do:
create or replace external table XYZ
(
col1 datetime,
col2 int64,
col3 string,
col4 string,
col5 string
)
options(
format = 'ORC'
uris = [bucketY]
)
Now when i do
select * from XYZ
I can see only nulls in columns 3-5. My guess is the BQ reader is breaking the records when the \r\n characters are met.
I've tried doubling the backslash in spark, so effectively writing
\\r\\n
, but that didn't help.
Is there any transformation I could do before writing the dataframe to orc, to help BQ understand what's going on and read it in correctly?
Thanks