Import a csv with foreignkeys

2.5k Views Asked by At

Let's say I have 2 tables: Students and Groups.

  • The Group table has 2 columns: id, GroupName
  • The Student table has 3 columns: id, StudentName and GroupID
  • The GroupID is a foreign key to a Group field.

I need to import the Students table from a CSV, but in my CSV instead of the Group id appears the name of the group. How can I import it with pgAdmin without modifying the csv?

3

There are 3 best solutions below

0
On

Based on Laurenz answer, use follwoing scripts:

Create a temp table to insert from CSV file:

CREATE TEMP TABLE std_temp (id int, student_name char(25), group_name char(25));

Then, import the CSV file:

COPY std_temp FROM '/home/username/Documents/std.csv' CSV HEADER;

Now, create std and grp tables for students and groups:

CREATE TABLE grp (id int, name char(25));
CREATE TABLE std (id int, name char(20), grp_id int);

It's grp table's turn to be populated based on distinct value of group name. Consider how row_number() is use to provide value for id`:

INSERT INTO grp (id, name) select row_number() OVER (), * from (select distinct group_name from std_temp) as foo;

And the final step, select data based on the join then insert it into the std table:

insert into std (id, name, grp_id) select std_temp.id, std_temp.student_name,grp.id from std_temp inner join grp on std_temp.group_name = grp.name;

At the end, retreive data from final std table:

select * from std;
0
On

The suggestion by @LaurenzAlbe is the obvious approach (IMHO never load a spreadsheet directly to your tables, they are untrustworthy beasts). But I believe your implementation after loading the staging table is flawed.
First, using row_number() virtually ensures you get duplicated ids for the same group name. The ids will always increment from 1 by 1 to then number of group names no matter the number of groups previously loaded and you cannot ensure the identical sequence on a subsequent spreadsheets. What happens when you have a group that does not previously exist. Further there is no validation that the group name does not already exist. Result: Duplicate group names and/or multiple ids for the same name.
Second, you attempt to use the id from the spreadsheet as the id the student (std) table is full of error possibilities. How do you ensure that number is unique across spreadsheets? Even if unique in a single spreadsheet, how do you ensure another spreadsheet does not use the same numbers as a previous one. Or assuming multiple users create the spreadsheets that one users numbers do not overlap another users even if all users user are very conscious of the numbers they use. Result: Duplicate id numbers.
A much better approach would be to put a unique key on the group table name column then insert any group names from the stage table into the group trapping any duplicate name errors (using on conflict). Then load the student table directly from the stage table while selecting group id from the group table by the (now unique) group name.

create table csv_load_temp( junk_num integer, student_name text, group_name text);

create table groups( grp_id integer generated always as identity
                  , name text
                  , grp_key text generated always as ( lower(name) ) stored
                  , constraint grp_pk 
                               primary key (grp_id)
                  , constraint grp_bk
                               unique (grp_key) 
                  ); 

create table students (std_id integer generated always as identity 
                    , name text 
                    , grp_id integer
                    , constraint std_pk 
                                 primary key (std_id)
                    , constraint std2grp_fk
                                 foreign key (grp_id)
                                 references groups(grp_id)
                );
                
-- Function to load Groups and Students
create or replace function establish_students() 
 returns void 
 language sql 
as $$
insert into groups (name) 
     select distinct group_name
       from csv_load_temp
         on conflict (grp_key) do nothing;
         
insert into students (name, grp_id)
     select student_name, grp_id 
       from csv_load_temp t 
   join groups grp
     on (grp.name = t.group_name);
$$; 

The groups table requires Postgres v12. For prior versions remove the column grp_key couumn and and put the unique constraint directly on the name column. What to do about capitalization is up to your business logic.
See fiddle for full example. Obviously the 2 inserts in the Establish_Students function can be run standalone and independently. In that case the function itself is not necessary.

0
On

Your easiest option is to import the file into a temporary table that is defined like the CSV file. Then you can join that table with the "groups" table and use INSERT INTO ... SELECT ... to populate the "students" table.

There is of course also the option to define a view on a join of the two tables and define an INSTEAD OF INSERT trigger on the view that inserts values into the underlying tables as appropriate. Then you could load the data directly to the view.