R ifelse dummy variable on multiple condition lookup

101 Views Asked by At

I have a huge dataset and I would like to insert a dummy variable column based on a set of conditions:

I have my main df (A) in which I have 5 million rows and 10 columns, where 4 of them are date;hour;minute;second and these go from 2020 to 2023.

enter image description here

On the other df (B) I have the same columns but I have only 30 rows.

I want A to look at B and put a 1 to all the rows where date,hour,minute,second match the date,hour,minute,second of B, and 0 to all the rest. so, in the end, i should find my self with a column where I have 30 1 and 4.999.970 0

Even better would be to have like date,hour,minute matching exactly, and second matching "more or less" (say like +/- 5 seconds)

Can you help please?

I thought a solution could have been:

A$dummy <- for (i in A){

ifelse("A$date"=="B$date"&"A$hour"=="B$hour"&
"A$minute"=="B$minute"&or("A$second">="B$second"-5,"A$second"<="B$second"+5),1,0)
}
2

There are 2 best solutions below

4
E. Nygaard On BEST ANSWER

Here is a solution using tidyverse (including code that generates example data).

# Create data frame A
set.seed(42)  # For reproducibility
n_rows_A <- 100

df_A <- data.frame(
  date = rep(Sys.Date(), n_rows_A),
  sym = rep("EUR_ab6m_05y", n_rows_A),
  hour = sample(0:23, n_rows_A, replace = TRUE),
  minute = sample(0:59, n_rows_A, replace = TRUE),
  second = sample(0:59, n_rows_A, replace = TRUE)
)

# Create data frame B with 30 rows
n_rows_B <- 30

# Select 15 random rows from df_A
matching_indices <- sample(n_rows_A, n_rows_B / 2)
df_B_matching <- df_A[matching_indices, ]

# Create 15 rows with different 'second' values
non_matching_indices <- sample(setdiff(1:n_rows_A, matching_indices), n_rows_B / 2)
df_B_non_matching <- df_A[non_matching_indices, ]
df_B_non_matching$second <- sample(0:59, n_rows_B / 2, replace = TRUE)

# Combine matching and non-matching rows for data frame B
df_B <- rbind(df_B_matching, df_B_non_matching)

rownames(df_B) <- NULL

# EXACTLY MATCHING SECONDS
# Create dummy variable indicating whether there is an exact match between date,
# sym, hour, minute and **second** in A, and date, sym, hour, minute and
# **second** in B:
df_A <- dplyr::left_join( df_A
                        , df_B %>% mutate( dummy = 1 )
                        , by = c( "date", "sym", "hour", "minute", "second")
) %>% 
  mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )

It is fully possible to use the same approach to create a dummy variable indicating whether a row in A has a matching row in B within a +/- 5 second margin. However, based on the data in the image you provided, some rows in A would probably have multiple "close-enough" matches in B. This could easily result in duplicates of rows from A.

Assuming you want to keep the A to its original number of rows and add a column to A indicating whether or not there is one or more rows in B that match within a +/- 5 second margin, you could do this:

# ALTERNATIVE 2 - SECONDS WITHIN +/- 5 

library(lubridate)
df_A <- dplyr::inner_join( df_A
                           , df_B 
                           , by = c( "date", "sym") # only use date as a key
                           , suffix = c("", "_b")
) %>% 
  mutate( 
    datetime_a = ymd_hms( paste0( date, " "
                                  , sprintf("%02d", hour), ":"
                                  , sprintf("%02d", minute), ":"
                                  , sprintf("%02d", second)
    ))
    , datetime_b = ymd_hms( paste0( date, " "
                                    , sprintf("%02d", hour_b), ":"
                                    , sprintf("%02d", minute_b), ":"
                                    , sprintf("%02d", second_b)
    ) )
    , diff_seconds = abs( as.integer( difftime( datetime_a, datetime_b, units = "secs") ) )
    
  ) %>% 
  # Remove rows where the difference between datetime_a and datetime_b is
  # greater than 5 seconds:
  filter( diff_seconds <= 5 ) %>%
  # Then add the dummy variable:
  mutate( dummy = 1 ) %>% 
  # Remove any unnecessary columns:
  select( date, sym, hour, minute, second, dummy ) %>% 
  # Add the dummy column to A using a left_join:
  dplyr::left_join( df_A, ., by = c("date", "sym", "hour", "minute", "second")) %>% 
  # Rows without any matches have dummy = NA. Recode:
  mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )
0
Augusto César On

I think you could use dplyr and lubridate functions. Try this:

## load the libraries
library(lubridate)
library(dplyr)

## create df A
dfA <- data.frame(
  date = seq(as.Date("2023-08-25"), by = "days", length.out = 100),
  sym = rep("EUR_ab6m_05y", 100),
  hour = sprintf("%02d", rep(0:23, length.out = 100)),
  minute = sprintf("%02d", rep(0:59, length.out = 100)),
  second = sprintf("%02d", rep(0:59, length.out = 100))
)

## create data.frame B (just a selected part of df A)
dfB <- dfA[c(1:10), ]

## Transform variables to datetime
dfA %>% 
  mutate(new_date = str_glue('{date} {hour}:{minute}:{second}') %>% ymd_hms()) -> dfA

dfB %>% 
  mutate(new_date = str_glue('{date} {hour}:{minute}:{second}') %>% ymd_hms()) -> dfB
dfB$new_date -> list_dates

## Verify the condition
dfA %>% 
  mutate(dummy = ifelse(new_date %in% list_dates, 1, 0)) -> dfA

Most part of the code it's to create the df.