SQL select on table using IDs stored in an R data frame

2k Views Asked by At

I am trying to load a table across several SQL Teradata tables that are "joined together" using a RODBC connection. There are unique IDs that I want to specify so it only pulls their data. The SQL statement currently works with the "sqlQuery" R function when I insert a few of the IDs in the query text. However, I have many more IDs (e.g., 10,000's, 100,000's) that I want to pull the data for. I only have these IDs stored within my local R environment, so I cannot join by it from a SQL table. My question is how can I pull the data for these IDs by referencing them using the R environment variable?

I have attempted to provide an example that illustrates the problem. The example shows a working case that just "pastes" 10 IDs directly into the SQL statement. However, this option does not work when their are many more "IDs" (10,000's, 100,000's, etc.). Unfortunately I don't have permission to the Teradata DB, so I cannot join by IDs there. I only have the IDs from a .csv file, SAS dataset, etc. Also I can't just place all the IDs in the SQL statement like this example, because it is way too long with all those IDs. I have seen how others use SAS SQL statements that "reference" the IDs that are in the local SAS environment (uses * symbol), so I hope that I can do something similar in R. Thank you.

#for loading data
library(RODBC)

# Connect to Teradata SQL server
TeradataConnect <- odbcConnect(dsn="****")

#random sample of 10 IDs but their are  many more (e.g., 100,000's)
IDs = c(855820852, 1461838083,  861608839, 1498698119,  817883914,    776309362,
        799069685,  825438654,  856801569, 1749317792)

# specificy dates you want to pull data over
LowDate = Sys.Date() - 365
UpDate = Sys.Date()

#SQL query for pulling data 
SQL_query =  paste("SELECT  Distinct S.ID  
               ,S.LOC_ID
               ,S.DATE
               ,I.ITEM_DESC

               FROM  table.DATA   S
               INNER JOIN item.DATA I
               ON  S.ID=I.ID 

               WHERE  1 =1
               AND I.ITEM_TYPE= 'NORMAL'
               AND S.DATE BETWEEN ", paste(paste0("'", LowDate, "'")), " AND ", paste(paste0("'", UpDate, "'")), "
               AND S.ID IN (", paste(paste(IDs, collapse = ", ")), ") 
               GROUP BY 
               S.ID
               ,S.LOC_ID
               ,S.DATE
               ,I.ITEM_DESC
               ")

#pulls all data from a random sample of IDs
Dataset = sqlQuery(TeradataConnect, SQL_query)

Update: Thanks @RYoda for your suggestions. I am adding an example of the SAS code so you can see how it works. I am not a SAS expert, but I received these queries from someone who is. It is my understanding that anytime they use a "&" character (e.g., "ids." where ids is a variable on the SAS environment) in the SQL statement it will use the data that is within the SAS environment. Unfortunately I don't have write access to the SQL table so I can't let the database do the work. It looks like use a workaround like you suggested, because the prefiltered data is much too large for me to load within my R environment and then perform a join using a data.table or dplyr join. I was hoping that I could find a more elegant solution, since their is a lot of the time where I am trying to use R environment variables like this to load/filter on SQL table data pulls. I think the expected final filtered data size will be several millions of rows and ~10 columns. Thanks again for your help.

PROC SQL;
 CONNECT TO teradata(db=VIEWS user=&td_user. password=&td_pwd. tdpid=tableid mode=teradata);
 CREATE TABLE Str_inv AS
 SELECT * FROM connection to Teradata
                ( 
                SELECT 
                    nbr1,
                    nbr2,
                    b.item_nbr,
                    sum(inv.qty) AS QTY
                FROM 
                    table.inventory inv
                INNER JOIN
                    table.hier b
                ON
                    inv.item_id=b.item_id
                INNER JOIN 
                    table.loc c
                ON
                    inv.str_loc_id = c.loc_id
                WHERE 
                    inv.cal_dt = &fw_end_dt. AND
                    b.ITEM_TYP_DESC IN ('NORMAL') AND
                    b.CORE_ID = 1 AND
                    b.item_id IN (&ids.)
                GROUP BY
                    1,2,3
                );
QUIT;
0

There are 0 best solutions below