How to read orc data into BQ while preserving "\r\n" in a string value?

27 Views Asked by At

I have some csv data that I read in with spark and then write out to orc. There's a string column that looks like this:

ahbcshfa \r\n sfafdsf \r\n sdfsfsaf

I've managed to read it in correctly with spark, and write it out to orc, so when I open the orc file in a separate spark console, I get the same string value:

ahbcshfa \r\n sfafdsf \r\n sdfsfsaf

I need to receive the same result when I read this data in BQ. So I send this data to a GCP bucket, and do:

create or replace external table XYZ
(
col1 datetime,
col2 int64,
col3 string,
col4 string,
col5 string
)
options(
format = 'ORC'
uris = [bucketY]
)

Now when i do

select * from XYZ

I can see only nulls in columns 3-5. My guess is the BQ reader is breaking the records when the \r\n characters are met.

I've tried doubling the backslash in spark, so effectively writing \\r\\n , but that didn't help.

Is there any transformation I could do before writing the dataframe to orc, to help BQ understand what's going on and read it in correctly?

Thanks

0

There are 0 best solutions below