read csv file where value contain comma in AWS athena

6.2k Views Asked by At

Hi Currently I have created a table schema in AWS Athena as follow

CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
  `clientcomputername` string,
  `intjnltblrecid` bigint,
  `processingstate` string,
  `sessionid` int,
  `sessionlogindatetime` string,
  `sessionlogindatetimetzid` bigint,
  `recidoriginal` bigint,
  `modifieddatetime` string,
  `modifiedby` string,
  `createddatetime` string,
  `createdby` string,
  `dataareaid` string,
  `recversion` int,
  `partition` bigint,
  `recid` bigint 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
   ) 
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');

But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.

So what would be the best way to read this file. Thanks

3

There are 3 best solutions below

0
On

This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']

Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)

CREATE EXTERNAL TABLE test1 ( f1 string, s2 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\") LOCATION 's3://user-test-region/dataset/test1/'

0
On

When I removed this part

WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
) 

I was able to read quoted text with commas in it

0
On

As workaround - look at aws glue project.

Instead of creating table via CREATE EXTERNAL TABLE:

  1. invoke get-table for your table
  2. Then make json for create-table
  3. Merge the following StorageDescriptor part:

    { "StorageDescriptor": { "SerdeInfo": { "SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde" ... } ... }

  4. perform create via aws cli. You will get this table in aws glue and athena be able to select correct columns.

Notes

  1. If your table already defined OpenCSVSerde - they may be fixed this issue and you can simple recreate this table.
  2. I do not have much knoledge about athena, but in aws glue you can delete or create table without any data loss
  3. Before adding this table via create-table you have to check first how glue or/and athena hadles table duplicates