MySQL - Manual CSV Import - Column Data Type Errors

277 Views Asked by At

I am trying to import a COVID-19 dataset into MySQL Workbench manually on MySQL 8.0 with CREATE TABLE and then LOAD DATA INFILE.

For columns such as hospital patients per million for example, the data varies in number of digits (up to 8) and number of decimal places (up to 3).

Also, data in other columns such as new cases has negative figures (when number of new cases starts falling compared to previous records).

When I run the below CREATE TABLE query, the table would populate with all columns. Then when I import the CSV file with LOAD DATA INFILE, column data type "decimal" would keep giving me errors such as Error 1366 incorrect decimal value at row 1.

So I changed it to "float" and it gave me Error 1265 data truncated at row 1. Now I've changed it to "double", as another Stack Overflow user had suggested, and still get Error 1265 Data truncated at row 1.

Here is the CREATE TABLE query:

CREATE TABLE covid_deaths (
    iso_code text,
    continent text,
    location text,
    date date,
    population bigint,
    total_cases int,
    new_cases int,
    new_cases_smoothed double,
    total_deaths int,
    new_deaths int,
    new_deaths_smoothed double,
    total_cases_per_million double,
    new_cases_per_million double,
    new_cases_smoothed_per_million double,
    total_deaths_per_million double,
    new_deaths_per_million double,
    new_deaths_smoothed_per_million double,
    reproduction_rate double,
    icu_patients int,
    icu_patients_per_million double,
    hosp_patients int,
    hosp_patients_per_million double,
    weekly_icu_admissions double,
    weekly_icu_admissions_per_million double,
    weekly_hosp_admissions double,
    weekly_hosp_admissions_per_million double
);

Here is the LOAD DATA INFILE query:

LOAD DATA INFILE 'CovidDeaths.csv' 
    INTO TABLE covid_deaths 
    FIELDS TERMINATED BY ',' 
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'
    IGNORE 1 ROWS;

First 3 columns and first 3 rows of CSV (26 columns and 85,171 rows):


    iso_code, continent, location, date, population, total_cases
    AFG, Asia, Afghanistan, 2020-02-24, 38928341, 1
    AFG, Asia, Afghanistan, 2020-02-25, 38928341, 1

Additional info:

I tried importing all "double" columns above with the data type "varchar" instead, and it worked. However, when I use ORDER BY DESC it would give me e.g. 99.4 followed by 887.2, as it was ordering descending by characters (9 before 8) rather than the numeric values (887.2 is bigger than 99.4).

So I need a numeric data type rather than varchar, in order to analyze the data. But I'm just mentioning that when I used "varchar" to import, all 85,171 rows were imported perfectly, and I was able to query the data as well.

My CSV file is already in the working directory.

Using the Table Data Import Wizard finishes with 0 rows affected (i.e. no data imported except the column titles). This is why I am using the CREATE TABLE then LOAD DATA INFILE route instead.

My data had a header row (column titles), so I thought this could also be a problem. I included IGNORE 1 ROWS but it didn't help. I also removed the header row from the CSV file, leaving only the data. This didn't help either.

secure-file-priv is turned off and I have privileges to edit my.ini if required.

I am new on Stack Overflow, I have done some research, but I apologize if my question is still not up to standard. Please let me know if any additional information is required. Thank you and have a pleasant day.

1

There are 1 best solutions below

6
P.Salmon On

My first observation is that the published table does not contain all the columns in the file (at least from todays version)--it needs to. Second the file providers don't tell you the datatypes to use , which is a bit naughty. I would download the xslx file and examine that to determine appropriate datatypes.

If you don't want all the columns in the data you can discard by pushing them to user defined variables. This also a useful way of finding and/or transforming read columns see the section Input Preprocessing https://dev.mysql.com/doc/refman/8.0/en/load-data.html.

The following loaded the columns I asked for

DROP TABle if exists covid_deaths;
CREATE TABLE covid_deaths (
iso_code    varchar(20),
continent   varchar(20),
location    varchar(50),
date    varchar(20),
total_cases varchar(20),
new_cases   varchar(20),
new_cases_smoothed  varchar(20),
total_deaths    varchar(20),
new_deaths  varchar(20),
new_deaths_smoothed varchar(20),
total_cases_per_million varchar(20),
new_cases_per_million   varchar(20),
new_cases_smoothed_per_million  varchar(20),
total_deaths_per_million    varchar(20),
new_deaths_per_million  varchar(20),
new_deaths_smoothed_per_million varchar(20),
reproduction_rate   varchar(20),
icu_patients    varchar(20),
icu_patients_per_million    varchar(20),
hosp_patients   varchar(20),
hosp_patients_per_million   varchar(20),
weekly_icu_admissions   varchar(20),
weekly_icu_admissions_per_million   varchar(20),
weekly_hosp_admissions  varchar(20),
weekly_hosp_admissions_per_million  varchar(20),
total_tests varchar(20),
new_tests   varchar(20),
total_tests_per_thousand    varchar(20),
new_tests_per_thousand  varchar(20),
new_tests_smoothed  varchar(20),
new_tests_smoothed_per_thousand varchar(20),
positive_rate   varchar(20),
tests_per_case  varchar(20),
tests_units varchar(20),
total_vaccinations  varchar(20),
people_vaccinated   varchar(20),
people_fully_vaccinated varchar(20),
total_boosters  varchar(20),
new_vaccinations    varchar(20),
new_vaccinations_smoothed   varchar(20),
total_vaccinations_per_hundred  varchar(20),
people_vaccinated_per_hundred   varchar(20),
people_fully_vaccinated_per_hundred varchar(20),
total_boosters_per_hundred  varchar(20),
new_vaccinations_smoothed_per_million   varchar(20),
new_people_vaccinated_smoothed  varchar(20),
new_people_vaccinated_smoothed_per_hundred  varchar(20),
stringency_index    varchar(20),
population_density  varchar(20),
median_age  varchar(20),
aged_65_older   varchar(20),
aged_70_older   varchar(20),
gdp_per_capita  varchar(20),
extreme_poverty varchar(20),
cardiovasc_death_rate   varchar(20),
diabetes_prevalence varchar(20),
female_smokers  varchar(20),
male_smokers    varchar(20),
handwashing_facilities  varchar(20),
hospital_beds_per_thousand  varchar(20),
life_expectancy varchar(20),
human_development_index varchar(20),
population  varchar(20),
excess_mortality_cumulative_absolute    varchar(20),
excess_mortality_cumulative varchar(20),
excess_mortality    varchar(20),
excess_mortality_cumulative_per_million varchar(20)
);

LOAD DATA INFILE 'C:\\Program Files\\MariaDB 10.4\\data\\sandbox\\owid-covid-data.csv' 
    INTO TABLE covid_deaths 
    FIELDS TERMINATED BY ',' 
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'
    IGNORE 1 ROWS
     (iso_code,
    continent,
    location,
    date,
    population ,
    total_cases,
     @new_cases ,
    @new_cases_smoothed ,
    @total_deaths ,
    @new_deaths ,
    @new_deaths_smoothed ,
    @total_cases_per_million ,
    @new_cases_per_million ,
    @new_cases_smoothed_per_million ,
    @total_deaths_per_million ,
    @new_deaths_per_million ,
    @new_deaths_smoothed_per_million ,
    @reproduction_rate ,
    @icu_patients ,
    @icu_patients_per_million ,
    @hosp_patients ,
    @hosp_patients_per_million ,
    @weekly_icu_admissions ,
    @weekly_icu_admissions_per_million ,
    @weekly_hosp_admissions ,
    @weekly_hosp_admissions_per_million ,
    @total_tests    ,
    @new_tests  ,
    @total_tests_per_thousand   ,
    @new_tests_per_thousand ,
    @new_tests_smoothed ,
    @new_tests_smoothed_per_thousand    ,
    @positive_rate  ,
    @tests_per_case ,
    @tests_units    ,
    @total_vaccinations ,
    @people_vaccinated  ,
    @people_fully_vaccinated    ,
    @total_boosters ,
    @new_vaccinations   ,
    @new_vaccinations_smoothed  ,
    @total_vaccinations_per_hundred ,
    @people_vaccinated_per_hundred  ,
    @people_fully_vaccinated_per_hundred    ,
    @total_boosters_per_hundred ,
    @new_vaccinations_smoothed_per_million  ,
    @new_people_vaccinated_smoothed ,
    @new_people_vaccinated_smoothed_per_hundred ,
    @stringency_index   ,
    @population_density ,
    @median_age ,
    @aged_65_older  ,
    @aged_70_older  ,
    @gdp_per_capita ,
    @extreme_poverty    ,
    @cardiovasc_death_rate  ,
    @diabetes_prevalence    ,
    @female_smokers ,
    @male_smokers   ,
    @handwashing_facilities ,
    @hospital_beds_per_thousand ,
    @life_expectancy    ,
    @human_development_index    ,
    @population ,
    @excess_mortality_cumulative_absolute   ,
    @excess_mortality_cumulative    ,
    @excess_mortality   ,
    @excess_mortality_cumulative_per_million    
     );
    

select count(*) from covid_deaths;
+----------+
| count(*) |
+----------+
|   242447 |
+----------+
1 row in set (0.339 sec)

EDIT example with empty or null values

id,val1,val2
1,38.0,1
2,null,1
3,,1
4,48.666666666666664,1

DROP TABLE IF EXISTS T;
create table t
(id int auto_increment primary key , val1 decimal(10,2),val2 int);

LOAD DATA INFILE 'C:\\Program Files\\MariaDB 10.1\\data\\sandbox\\data.txt'
    INTO TABLE t
    FIELDS TERMINATED BY ',' 
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'
    IGNORE 1 ROWS
     (id,@val1,val2)
     set val1 = case when @val1 = 'null' or @val1 = '' then null else @val1 end;
     
    
select * from t;

+----+-------+------+
| id | val1  | val2 |
+----+-------+------+
|  1 | 38.00 |    1 |
|  2 |  NULL |    1 |
|  3 |  NULL |    1 |
|  4 | 48.67 |    1 |
+----+-------+------+
4 rows in set (0.001 sec)