How to copy/join few columns from a csv file to an existing SQL table in R?

60 Views Asked by At

I have a created a table "SQL_table" in SQL and copied contents from various existing .csv files to "SQL_table" in R using the code given below.

Code:

library(RPostgres)
library(DBI)

#Step 2: make a dataframe of all files
file_names1 <- dir("D:/Data/", full.names = TRUE, recursive = T) #where you have your files
my_data_frame <- do.call(rbind,lapply(file_names1,read.csv))

#Step 3: Establish R & PostgreSQL Connection using RPostgres
dsn_database = "...."   # Specify the name of your Database
dsn_hostname = "localhost"  # Specify host name
dsn_port = "...."                # Specify your port number. e.g. 98939
dsn_uid = "...."         # Specify your username. e.g. "admin"
dsn_pwd = "...."         # Specify your password. e.g. "xxx"

tryCatch({
  drv <- dbDriver("Postgres")
  print("Connecting to Database…")
  connec <- dbConnect(drv, 
                      dbname = dsn_database,
                      host = dsn_hostname, 
                      port = dsn_port,
                      user = dsn_uid, 
                      password = dsn_pwd)
  print("Database Connected!")
},
error=function(cond) {
  print("Unable to connect to Database.")
})

#Step 4: Run using RPostgres
dbWriteTable(connec, "SQL_table", my_data_frame, create = TRUE)
dbAppendTable(connec, "SQL_table", my_data_frame, create = TRUE)
dbReadTable(connec, "SQL_table")

The structure of my SQL_table is as follows:

Y_m_d_Time  Location_ID  Popul  Literacy  Mortality  
2001-01-01 00:00:00  NAM  5000  77  8.8
2002-01-01 00:00:00  NAM  8700  58  7.7
2003-01-01 00:00:00  NAM  3410  98  9.8
..
..
1990-01-01 00:00:00  KEN  2000  87  9.8
1991-01-01 00:00:00  KEN  3200  88  9.7
1991-01-01 00:00:00  KEN  4910  78  8.8
..
..
2010-01-01 00:00:00  PUB  6000  97  9.8
2011-01-01 00:00:00  PUB  4200  89  9.7
2012-01-01 00:00:00  PUB  5910  88  8.8
..
..
..

I have another .csv file "CSV_1" and I want to copy few columns (Latitude, Longitude, Altitude, Start_Date, End_Date) from "CSV_1" to "SQL_table" using R. The common column between both the tables is Location_ID in "SQL_table" and City in "CSV_1".

The structure of CSV file is as follows:

City  Latitude  Longitude  Altitude  Start_Date  End_Date  No. of Events  Event_Type
NAM  35  79  218  3/1/2001    10  Flood
KEN  30  81  129  2/1/1990  5/31/1999  5  Earthquake
PUB  22  76  220  1/1/2010    11  Landslide
..
..

My desired output is as follows:

Y_m_d_Time  Location_ID  Popul  Literacy  Mortality  Latitude  Longitude  Altitude  Start_Date  End_Date
2001-01-01 00:00:00  NAM  5000  77  8.8  35  79  218  3/1/2001
2002-01-01 00:00:00  NAM  8700  58  7.7  35  79  218  3/1/2001
2003-01-01 00:00:00  NAM  3410  98  9.8  35  79  218  3/1/2001
..
..
1990-01-01 00:00:00  KEN  2000  87  9.8  30  81  129  2/1/1990  5/31/1999
1991-01-01 00:00:00  KEN  3200  88  9.7  30  81  129  2/1/1990  5/31/1999
1991-01-01 00:00:00  KEN  4910  78  8.8  30  81  129  2/1/1990  5/31/1999
..
..
2010-01-01 00:00:00  PUB  6000  97  9.8  22  76  220  1/1/2010
2011-01-01 00:00:00  PUB  4200  89  9.7  22  76  220  1/1/2010
2012-01-01 00:00:00  PUB  5910  88  8.8  22  76  220  1/1/2010
..
..
..

Could anyone please help me how I can extend my code in R to obtain the desired results.

1

There are 1 best solutions below

0
On

Suppose your variable Location_ID is origin from the flights table and City is faa from the airports table. If you agreed to use {dplyr}, you could simply do a left_join().

con <- DBI::dbConnect(duckdb::duckdb())

dbplyr::copy_nycflights13(con)
#> Creating table: airlines
#> Creating table: airports
#> Creating table: flights
#> Creating table: planes
#> Creating table: weather

flights <- dplyr::tbl(con, "flights")
airports <- dplyr::tbl(con, "airports")

flights |>
  dplyr::left_join(airports, by = c("origin" = "faa"))
#> # Source:   SQL [?? x 26]
#> # Database: DuckDB 0.8.1 [root@Darwin 22.5.0:R 4.3.1/:memory:]
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     2    14     1045           1056       -11     1204           1230
#>  2  2013     2    14     1048           1049        -1     1412           1400
#>  3  2013     2    14     1048           1040         8     1333           1337
#>  4  2013     2    14     1051           1100        -9     1424           1424
#>  5  2013     2    14     1051           1100        -9     1203           1214
#>  6  2013     2    14     1057           1100        -3     1408           1420
#>  7  2013     2    14     1057           1100        -3     1244           1300
#>  8  2013     2    14     1057           1059        -2     1353           1441
#>  9  2013     2    14     1058           1056         2     1211           1220
#> 10  2013     2    14     1058           1100        -2     1337           1338
#> # ℹ more rows
#> # ℹ 18 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>,
#> #   lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

DBI::dbDisconnect(con, shutdown = TRUE)