I'm trying to do an automatic code, I have build a loop to match people in my 2 databases. here an example of thoses two database:
some library might be needed so here all I used:
library(readr)
library(stringi)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(rlang)
library(Hmisc)
library(sqldf)
library(tcltk)
library(tcltk2)
library(gWidgets2)
library(gWidgets2tcltk)
df1 <- data.frame(
id = c(1, 2, 3),
name = c("John", "Jane", "Jim"),
age = c(25, 30, 35)
)
df2 <- data.frame(
id_2 = c(4, 5, 3,9),
name_2 = c("Johny", "Janey", "Jim","Gar"),
age_2 = c(26, 31, 35,NA)
)
and I have some condition in string that I got by readline():
#condtion I got by readline()
condition =c("df1$id[i]==df2$id_2[j]","df1$name[i]==df2$name_2[j]","df1$age[i]==df2$age_2[j] && !is.na(df2$age_2[j])")
I wanna have a loop that apply a score for every condition respected and put people with a minimal score of similarity in a new dataframe.
here what I tried but it take so much time to execute:
#new dataframe that assign similarity score between people
df3 <- data.frame(id=integer(), id_2=integer(), score=integer())
#only way I know to use string condition in if statement
start_time <- Sys.time()
# loop to search on df1 and df2
for (i in 1:nrow(df1)){
for (j in 1:nrow(df2)){
score = 0
# Vérifie condition
for (k in 1:length(condition)){
if(eval(parse(text = condition[k]))){
score = score + 1
}
}
#add to new dataframe if condition verified
if(score == (length(condition))/2){
df3 <- rbind(df3, data.frame(id=df1[i, 1], id_2=df2[j, 1], score=score))
}
}
}
#system to verify time spend
end_time <- Sys.time()
elapsed_time <- difftime(end_time, start_time, units = "secs")
elapsed_time
View(df3)
#the result of df3:
id id_2 score
3 3 3
#then just need to merge by ID and id_2 to get all other information
#what you should get after merging
id id_2 score name age name_2 age_2
3 3 3 "Jim" 35 "Jim" 35
I need to repoduce this for 2 dataframe of 683870 * 3681
I already did a sql code that is almost instant but there no score so some client with some small difference are left out. Im also gonna add a readline for the minimal score intended later.
edit: I got the condition by readline() since I try to make the code accesible for not regular R user:
condition = readline("what data you wanna compare exemple: [data1==data2] ")
for the if(score == (length(condition))/2) I know its not useful here but its just for the exemple I will add a readline() for the score too