MySQL: Using natural primary index or adding surrogate when tables are given

795 Views Asked by At

I have 5 text-fields which I want to import into a MySQL/MariaDB database. But there are two problems:

(1) The files are quite large: 0.5 GB to 10 GB
(2) All relevant keys have 40 characters

Point (1) I have to accept as it is and I can't change it. Point 2 is my concern. There are a lot of suggestions in the internet. For example, to use enum for varchar or to use numeric surrogates. There is no problem to add a surrogate key to a table. But the same surrogate key has to be added to other tables. And this is the point where I stuck.

Here the specific information about the files/tables:

  • table invoice has 3 columns and 20 Mio rows:

    • invoice_id (primary key) with distinct values = number of rows
    • praxis_id with 4,000 distinct values
    • patient_id with 4 Mio distinct values all columns are CHAR(40) and have a fixed length of 40.
  • table diagnose has 3 columns and 25 Mio rows:

    • invoice_id CHAR(40) 1.4 Mio distinct id
    • diagnose_type
    • diagnose_code
  • table patient has 5 columns with 5 Mio rows:

    • patient_id CHAR(40) not unique (4 Mio distinct pat_id)
    • praxis_id CHAR(40)
    • year of birth, sex etc.

For example, I want to join invoice with diagnose and patient. It makes sense to index the keys. One way would be to define invoice.invoice_id as primary key and for all other keys in table invoice I would add an index. The same with table diagnose (invoice_id with INDEX) and patient (patient_id as primary key).
The problem is that it took a long time to define invoice.invoice_id as primary key using:

ALTER TABLE invoice_id ADD PRIMARY KEY(invoice_id);

After one hour I killed the process. I think that one problem of performance arises from the kind of datatype of invoice_id in table invoice. One idea could be to add an autoincrementing surrogate key invoice_id_surr when loading the text file. But nevertheless the problem remains if I want to join with table diagnose since I have to join with invoice_id of table diagnose which has not the surrogate key invoice_id_surr as foreign key. I could add an index on diagnose.invoice_id but then I loose the advantage of having a surrogate key on table invoice.

I would be interested in a strategy how to cope with this problem: Several already existing tables which can be joint together but the keys are CHAR(40) and have no index.

Thanks for help.


UPDATE 1: Table specification
- keys have 40 characters [0-9][A-Z]
- These are tables which won't change anymore (no inserts)

-- invoice_id is primary key (unique)
-- patient_id and praxis id for foreign key and not unique in this table
CREATE TABLE invoice (
  invoice_id             CHAR(40) DEFAULT NULL
, praxis_id              CHAR(40) DEFAULT NULL
, patient_id             CHAR(40) DEFAULT NULL
, PRIMARY KEY (invoice_id2)
) ENGINE = InnoDB
;  

LOAD DATA LOCAL INFILE 'C:/data/invoice.txt'
INTO TABLE invoice
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
;

-- invoice_id is not unique in this table
CREATE TABLE diagnose (
  invoice_id             CHAR(40)    DEFAULT NULL
, diagnose_katalog       VARCHAR(20) DEFAULT NULL
, diagnose_code          VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;
-- patient_id is not unique in this table since since patient may change praxis
CREATE TABLE patient (
  patient_id             CHAR(40)    DEFAULT NULL
, praxis_id              CHAR(40)    DEFAULT NULL
, sex                    CHAR(1)     DEFAULT NULL
, birth_year             SMALLINT UNSIGNED DEFAULT NULL
, zip_code               VARCHAR(20) DEFAULT NULL
) ENGINE = InnoDB
;
3

There are 3 best solutions below

10
On

You should avoid using natural keys as primary keys in your database for a variety of reasons, which can be found quite easy by googling.

Also, what does it mean "tables are given"? You should not insist on having a one-to-one correspondence between your text files and database tables. Instead, you should design your tables in the way that most suits your needs. Strive for normalization.

For example, you say that patient_id in patient "table" (you mean file, right?) is not unique. Clearly you need a table where patients are unique; so create a table with distinct patient_id and their attributes. patient_id should be a unique key on that table, but generate a numeric surrogate (an auto_increment field, for example) to serve as a primary key in that table. After that, for example, create a praxis table, with praxis_id as a unique key and a numeric surrogate for primary key. Then you can connect patient and praxis with a third table, as appropriate for many-to-many relationships. That way you normalize your database: a patient with its attributes is always a single row in a single patient table; whereas now you have the same patient entered multiple times into you current patient file/table, which will cause you trouble sooner or later.

5
On

Do you really have CHAR(40), not VARCHAR(40)? Are the values always 40 characters? Is the table CHARACTER SET utf8?

CHAR(40) utf8 takes 120 bytes always. If you store 'Z' in such a field, it still takes 120 bytes, not just 1. Even if you are importing 40 characters, declaring it VARCHAR will trim the trailing blanks as it is loaded.

At a minimum, I would do

ALTER TABLE foo
    MODIFY col1 VARCHAR(40) ...,  -- the "..." is other options for the col
    MODIFY col2 VARCHAR(40) ...,
    ...;

This would probably help "Point (1)" a lot. And make everything faster. (Caveat: It would take a long time to finish that ALTER.)

"Natural" PRIMARY KEYs are not evil. But use them where appropriate. In your case, an invoice_id is required to be unique for Business reasons, correct? How long is it? It is probably good as a PK.

What key(s) did you have on the table before you tried to do the ALTER? When initially building a table, you should have at least the PRIMARY KEY in place. (But too late now, I guess.)

Please provide SHOW CREATE TABLE for each table -- I am having to make too many guesses.

Adding a surrogate (AUTO_INCREMENT) key adds a level of indirection for all lookups; this may slow down some SELECTs (in addition to requiring changes to many SELECTs).

When adding secondary indexes to a table, base it on what would be of use to the SELECT statements you have. I discuss that in my index cookbook. Provide SELECTs here for further discussion. Do not blindly add an index for each column.

It sounds like invoice_id should be the PRIMARY KEY for invoice and an INDEX in diagnose.

If it is practical to reload the data, declare the fields VARCHAR, not CHAR, and have PRIMARY KEY(invoice_id) in invoice.

0
On

I respond myself to my question. The reason for this posting was that I submitted queries, for example defining a primary key with ALTER TABLE ... and the process didn't stopped after several hours. As @zgguv mentioned the duration seems not plausible. I stopped the queries and restarted (sometimes after the third time) and the process finished after some minutes (about 10 minutes). I don't know why sometimes the queries hang-ups. This never happened to me untill now but the tables I used were much smaller. Lesson learnt is:

  • The long string keys should be replaced by numeric keys to make selections faster.

  • Replacing the string keys of tables of size 10 GB (text-table), respectively, 20 Mio row numbers are feasible (You have to pull on your own hair only once). A join between to indexed string keys took about 10 minutes.

  • If the duration of the query takes more than 30' (hang-up) then stop it and try again. It would be nice to know why this happened (InnoDB, MyISAM, HeidiSQL, ...) but this is another issue.

@zgguv Thanks for support and patience.