How to COPY CSV as JSON fields

3.7k Views Asked by At

Is there a way to COPY the CSV file data directly into a JSON or JSONb array?

Example:

CREATE TABLE mytable (
    id serial PRIMARY KEY,
    info jSONb -- or JSON
);
COPY mytable(info) FROM '/tmp/myfile.csv' HEADER csv;

NOTE: each CSV line is mapped to a JSON array. It is a normal CSV.


Normal CSV (no JSON-embeded)... /tmp/myfile.csv =

a,b,c
100,Mum,Dad
200,Hello,Bye

The correct COPY command must be equivalent to the usual copy bellow.

Usual COPY (ugly but works fine)

CREATE TEMPORARY TABLE temp1 (
  a int, b text, c text
);
COPY temp1(a,b,c) FROM '/tmp/myfile.csv' HEADER csv;

INSERT INTO mytable(info) SELECT json_build_array(a,b,c) FROM temp1;

It is ugly because:

  • need the a priory knowledge about fields, and a previous CREATE TABLE with it.

  • for "big data" need a big temporary table, so lost CPU, disk and my time — the table mytable have CHECKs and UNIQUEs constraints for each line.

  • ... Needs more than 1 SQL command.

1

There are 1 best solutions below

7
On BEST ANSWER

Perfect solution!

Not need to know all the CSV columns, only extract what you know.

Use at SQL CREATE EXTENSION PLpythonU;: if the command produce an error like "could not open extension control file ... No such file" you need to install pg-py extra-packages. In standard UBUNTU (16 LTS) is simple, apt install postgresql-contrib postgresql-plpython.

CREATE FUNCTION get_csvfile(
  file text,
  delim_char char(1) = ',',
  quote_char char(1) = '"')
returns setof text[] stable language plpythonu as $$
  import csv
  return csv.reader(
     open(file, 'rb'),
     quotechar=quote_char,
     delimiter=delim_char,
     skipinitialspace=True,
     escapechar='\\'
  )
$$;

INSERT INTO mytable(info)
  SELECT jsonb_build_array(c[1],c[2],c[3]) 
  FROM get_csvfile('/tmp/myfile1.csv') c;

The split_csv() function was defined here. The csv.reader is very reliable (!).

Not tested for big-big CSV... But expected Python do job.


PostgreSQL workaround

It is not a perfect solution, but it solves the main problem, that is the

... big temporary table, so lost CPU, disk and my time"...

This is the way we do it, a workaround with file_fdw!

  1. Adopt your conventions to avoid file-copy and file-permission confusions... The standard file path for a CSV. Example: /tmp/pg_myPrj_file.csv

  2. Initialise your database or SQL script with the magic extension,

   CREATE EXTENSION file_fdw;
   CREATE SERVER files FOREIGN DATA WRAPPER file_fdw;
  1. For each CSV file, myNewData.csv,

    3.1. make a symbolic link (or scp remote copy) for your new file ln -sf $PWD/myNewData.csv /tmp/pg_socKer_file.csv

    3.2. configure the file_fdw for your new table (suppose mytable).

   CREATE FOREIGN TABLE temp1 (a int, b text, c text) 
   SERVER files OPTIONS ( 
     filename '/tmp/pg_socKer_file.csv', 
       format 'csv', 
       header 'true'
   );

PS: after running SQL script with psql, when having some permission problem, change owner of the link by sudo chown -h postgres:postgres /tmp/pg_socKer_file.csv.

3.3. use the file_fdw table as source (suppose populating mytable).

 INSERT INTO mytable(info)
 SELECT json_build_array(a,b,c) FROM temp1;

Thanks to @JosMac (and his tutorial)!


NOTE: if there is a STDIN way to do it (exists??), will be easy, avoiding permission problems and use of absolute paths. See this answer/discussion.