Why the size of MySQL MyISAM table is the same after striping some data from VARCHAR column?

498 Views Asked by At

I need to reduce the size of MySQL database. I recoded some information which striped ';' and ':' from sources column (~10% char reduction). After doing so, the size of the table is exactly the same as before. How is it possible? I'm using MyISAM engine.

btw: Unfortunately, I cannot compress the tables with myisampack.

mysql> INSERT INTO test SELECT protid1, protid2, CS, REPLACE(REPLACE(sources, ':', ''), ';', '') FROM homologs_9606; 
Query OK, 41917131 rows affected (4 min 11.30 sec)
Records: 41917131  Duplicates: 0  Warnings: 0

mysql> select TABLE_NAME name, ROUND(TABLE_ROWS/1e6, 3) 'million rows', ROUND(DATA_LENGTH/power(2,30), 3) 'data GB', ROUND(INDEX_LENGTH/power(2,30), 3) 'index GB' from information_schema.TABLES WHERE TABLE_NAME IN ('homologs_9606', 'test') ORDER BY TABLE_ROWS DESC LIMIT 10;
+---------------+--------------+---------+----------+
| name          | million rows | data GB | index GB |
+---------------+--------------+---------+----------+
| test          |       41.917 |   0.857 |    1.075 |
| homologs_9606 |       41.917 |   0.887 |    1.075 |
+---------------+--------------+---------+----------+
2 rows in set (0.01 sec)

mysql> select * from homologs_9606 limit 10;
+---------+---------+-------+--------------------------------+
| protid1 | protid2 | CS    | sources                        |
+---------+---------+-------+--------------------------------+
| 5635338 | 1028608 | 0.000 | 10:,1                          |
| 5644385 | 1028611 | 0.947 | 5:1,1;8:0.943,35;10:1,1;11:1,1 |
| 5652325 | 1028611 | 0.947 | 5:1,1;8:0.943,35;10:1,1;11:1,1 |
| 5641128 | 1028612 | 1.000 | 8:1,10                         |
| 5636414 | 1028616 | 0.038 | 8:0.038,104;10:,1              |
| 5636557 | 1028616 | 0.000 | 8:,4                           |
| 5637419 | 1028616 | 0.011 | 5:,1;8:0.011,91;10:,1          |
| 5641196 | 1028616 | 0.080 | 5:1,1;8:0.074,94;10:,1;11:,4   |
| 5642914 | 1028616 | 0.000 | 8:,3                           |
| 5643778 | 1028616 | 0.056 | 8:0.057,70;10:,1               |
+---------+---------+-------+--------------------------------+
10 rows in set (4.55 sec)

mysql> select * from test limit 10;
+---------+---------+-------+-------------------------+
| protid1 | protid2 | CS    | sources                 |
+---------+---------+-------+-------------------------+
| 5635338 | 1028608 | 0.000 | 10,1                    |
| 5644385 | 1028611 | 0.947 | 51,180.943,35101,1111,1 |
| 5652325 | 1028611 | 0.947 | 51,180.943,35101,1111,1 |
| 5641128 | 1028612 | 1.000 | 81,10                   |
| 5636414 | 1028616 | 0.038 | 80.038,10410,1          |
| 5636557 | 1028616 | 0.000 | 8,4                     |
| 5637419 | 1028616 | 0.011 | 5,180.011,9110,1        |
| 5641196 | 1028616 | 0.080 | 51,180.074,9410,111,4   |
| 5642914 | 1028616 | 0.000 | 8,3                     |
| 5643778 | 1028616 | 0.056 | 80.057,7010,1           |
+---------+---------+-------+-------------------------+
10 rows in set (0.00 sec)

mysql> describe test;
+---------+------------------+------+-----+---------+-------+
| Field   | Type             | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| protid1 | int(10) unsigned | YES  | PRI | NULL    |       |
| protid2 | int(10) unsigned | YES  | PRI | NULL    |       |
| CS      | float(4,3)       | YES  |     | NULL    |       |
| sources | varchar(100)     | YES  |     | NULL    |       |
+---------+------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql> describe homologs_9606;
+---------+------------------+------+-----+---------+-------+
| Field   | Type             | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| protid1 | int(10) unsigned | NO   | PRI | 0       |       |
| protid2 | int(10) unsigned | NO   | PRI | 0       |       |
| CS      | float(4,3)       | YES  |     | NULL    |       |
| sources | varchar(100)     | YES  |     | NULL    |       |
+---------+------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

EDIT1: Added average column length.

mysql> select AVG(LENGTH(sources)) from test; 
+----------------------+
| AVG(LENGTH(sources)) |
+----------------------+
|               5.2177 |
+----------------------+
1 row in set (10.04 sec)

mysql> select AVG(LENGTH(sources)) from homologs_9606; 
+----------------------+
| AVG(LENGTH(sources)) |
+----------------------+
|               6.8792 |
+----------------------+
1 row in set (9.95 sec)

EDIT2: I was able to strip some more MB by setting NOT NULL to all columns.

mysql> drop table test
Query OK, 0 rows affected (0.42 sec)

mysql> CREATE table test (protid1 INT UNSIGNED NOT NULL DEFAULT '0', protid2 INT UNSIGNED NOT NULL DEFAULT '0', CS FLOAT(4,3) NOT NULL DEFAULT '0', sources VARCHAR(100) NOT NULL DEFAULT '0', PRIMARY KEY (protid1, protid2), KEY `idx_protid2` (protid2)) ENGINE=MyISAM CHARSET=ascii;
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO test SELECT protid1, protid2, CS, REPLACE(REPLACE(sources, ':', ''), ';', '') FROM homologs_9606; 
Query OK, 41917131 rows affected (2 min 7.84 sec)

mysql> select TABLE_NAME name, ROUND(TABLE_ROWS/1e6, 3) 'million rows', ROUND(DATA_LENGTH/power(2,30), 3) 'data GB', ROUND(INDEX_LENGTH/power(2,30), 3) 'index GB' from information_schema.TABLES WHERE TABLE_NAME IN ('homologs_9606', 'test');
Records: 41917131  Duplicates: 0  Warnings: 0

+---------------+--------------+---------+----------+
| name          | million rows | data GB | index GB |
+---------------+--------------+---------+----------+
| homologs_9606 |       41.917 |   0.887 |    1.075 |
| test          |       41.917 |   0.842 |    1.075 |
+---------------+--------------+---------+----------+
2 rows in set (0.02 sec)
2

There are 2 best solutions below

3
Emil Vikström On BEST ANSWER

They are not exactly the same. Your query clearly shows that test is about 30 MB smaller than homologs_9606:

+---------------+--------------+---------+
| name          | million rows | data GB |
+---------------+--------------+---------+
| test          |       41.917 |   0.857 | <-- 0.857 < 0.887
| homologs_9606 |       41.917 |   0.887 |
+---------------+--------------+---------+

How much storage should we expect for your table? Let us check Data Type Storage Requirements:

INTEGER(10): 4 bytes
FLOAT(4): 4 bytes
VARCHAR(100): L+1

where L is the number of character bytes, which is usually one byte per character but sometimes more if you use a Unicode character set.

Your rows on average will need:

INTEGER + INTEGER + FLOAT + VARCHAR =
4 + 4 + 4 + (L + 1) = L + 13 bytes

We can infer your original average L as (0.887*1024^3 / 41917131) - 13 = 9.72. You say that you stripped 10% from sources, which means your new L is 9.72*0.9 = 8.75. That gives an expected new total storage requirement of ((8.75 + 13) * 41917131) / 1024^3 = 0.849 GB

I suspect that the difference (between 0.849 and 0.857) might be due to the fact that test have two columns set as NULLable that homologs_9606 do not have, but I do not know enough about the MyISAM engine to calculate this exactly. I can however guess! On a minimum you would need 1 bit per column per row to store a NULL state, which in your case means two bits per row or 2*41917131 = 83834262 bits = 10 479 283 bytes = 0.010 GB. The total 0.849+0.010 = 0.859 shoots slightly over the goal (about 2 MB too much). But I have made some roundings and your 10% figure is also an estimate so I am sure the rest is lost in translation.

Another reason could be if you use a Unicode character set on sources in test, in which case some characters may use more than one byte each, but since the NULLable columns seems to account for everything I do not think this is the case for your table.

Summary

  • Your two tables are not the same size, they differ by 30 MB.
  • The size of your new table is around the expected size.
  • You can save some more space in your new table by making protid1 and protid2 into NOT NULL columns.
1
Rick James On

The "table" is stored in a .MYD file. This file will never shrink due to UPDATEs or DELETEs. SHOW TABLE STATUS (or the equivalent query into information_schema) may show Data_length shrinking, but Data_free will increase.

You can shrink the .MYD file by doing OPTIMIZE TABLE. But that will copy the table over, thereby needing extra disk space during the process. And this action is only very rarely worth doing.

Changing to NOT NULL may not free up space if you had a lot of nulls -- "" takes 1 or 2 bytes for a VARCHAR because of the length. (And your code may need to handle '' differently than NULL.)

The space taken for each row is actually 1 byte more than previously mentioned -- this byte handles knowing whether the row exists or is the beginning of a hole.

For large text fields, I like to do this to save space. (This applies to both MyISAM and InnoDB.) Compress the text and store it into a BLOB column (instead of TEXT). For most text, that is a 3:1 shrinkage. It takes a little extra code and CPU time in the client, but it saves a lot of I/O in the server. Often the net result is "faster". I would not use it for the varchar you have; I would only do it on columns bigger than, say, 50 characters average.

Back to the original question. It sounds like there were only about 30M colons and semicolons in the entire table. Could it be that the first 10 rows are not representative?