How to sanitize a df according to specific variable values?

225 Views Asked by At

I have two data frames. dfOne is made like this:

 X Y Z T J
 3 4 5 6 1
 1 2 3 4 1
 5 1 2 5 1

and dfTwo is made like this

 C.1  C.2
  X    Z
  Y    T

I want to obtain a new dataframe where there are simultaneously X, Y, Z, T Values which are major than a specific threshold.

Example. I need simultaneously (in the same row):

  • X, Y > 2
  • Z, T > 4

I need to use the second data frame to reach my objective, I expect something like:

dfTwo$C.1>2

so the result would be a new dataframe with this structure:

 X Y Z T J
 3 4 5 6 1

How could I do it?

4

There are 4 best solutions below

4
On BEST ANSWER

We can use the purrr package

Here is the input data.

# Data frame from lmo's solution
dat <-
  structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L, 
                                                             3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X", 
                                                                                                                        "Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                 -3L))

# A numeric vector to show the threshold values
# Notice that columns without any requirements need NA   
vals <- c(X = 2, Y = 2, Z = 4, T = 4, J = NA)

Here is the implementation

library(purrr)

map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) %>% na.omit()

# A tibble: 1 x 5
      X     Y     Z     T     J
  <int> <int> <int> <int> <int>
1     3     4     5     6     1

map2_dfc loop through each column in dat and each value in vals one by one with a defined function. ~ifelse(.x > .y | is.na(.y), .x, NA) means if the number in each column is larger than the corresponding value in vals, or vals is NA, the output should be the original value from the column. Otherwise, the value is replaced to be NA. The output of map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) is a data frame with NA values in some rows indicating that the condition is not met. Finally, na.omit removes those rows.

Update

Here I demonstrate how to covert the dfTwo dataframe to the vals vector in my example.

First, let's create the dfTwo data frame.

dfTwo <- read.table(text = "C.1  C.2
X    Z
Y    T",
                     header = TRUE, stringsAsFactors = FALSE)

dfTwo
  C.1 C.2
1   X   Z
2   Y   T

To complete the task, I load the dplyr and tidyr package.

library(dplyr)
library(tidyr)

Now I begin the transformation of dfTwo. The first step is to use stack function to convert the format.

dfTwo2 <- dfTwo %>%
  stack() %>%
  setNames(c("Col", "Group")) %>%
  mutate(Group = as.character(Group))
dfTwo2
  Col Group
1   X   C.1
2   Y   C.1
3   Z   C.2
4   T   C.2

The second step is to add the threshold information. One way to do this is to create a look-up table showing the association between Group and Value

threshold_df <- data.frame(Group = c("C.1", "C.2"),
                           Value = c(2, 4),
                           stringsAsFactors = FALSE)

threshold_df
  Group Value
1   C.1     2
2   C.2     4

And then we can use the left_join function to combine the data frame.

dfTwo3 <- dfTwo2  %>% left_join(threshold_dt, by = "Group")
dfTwo3
  Col Group Value
1   X   C.1     2
2   Y   C.1     2
3   Z   C.2     4
4   T   C.2     4

Now it is the third step. Notice that there is a column called J which does not need any threshold. So we need to add this information to dfTwo3. We can use the complete function from tidyr. The following code completes the data frame by adding Col in dat but not in dfTwo3 and NA to the Value.

dfTwo4 <- dfTwo3 %>% complete(Col = colnames(dat))
dfTwo4
# A tibble: 5 x 3
    Col Group Value
  <chr> <chr> <dbl>
1     J  <NA>    NA
2     T   C.2     4
3     X   C.1     2
4     Y   C.1     2
5     Z   C.2     4

The fourth step is arrange the right order of dfTwo4. We can achieve this by turning Col to factor and assign the level based on the order of the column name in dat.

dfTwo5 <- dfTwo4 %>%
  mutate(Col = factor(Col, levels = colnames(dat))) %>%
  arrange(Col) %>%
  mutate(Col = as.character(Col))  
dfTwo5
# A tibble: 5 x 3
    Col Group Value
  <chr> <chr> <dbl>
1     X   C.1     2
2     Y   C.1     2
3     Z   C.2     4
4     T   C.2     4
5     J  <NA>    NA

We are almost there. Now we can create vals from dfTwo5.

vals <- dfTwo5$Value
names(vals) <- dfTwo5$Col
vals
 X  Y  Z  T  J 
 2  2  4  4 NA

Now we are ready to use the purrr package to filter the data.

The aboved are the breakdown of steps. We can combine all these steps into the following code for simlicity.

library(dplyr)
library(tidyr)
threshold_df <- data.frame(Group = c("C.1", "C.2"),
                           Value = c(2, 4),
                           stringsAsFactors = FALSE)

dfTwo2 <- dfTwo %>%
  stack() %>%
  setNames(c("Col", "Group")) %>%
  mutate(Group = as.character(Group)) %>%
  left_join(threshold_df, by = "Group") %>%
  complete(Col = colnames(dat)) %>%
  mutate(Col = factor(Col, levels = colnames(dat))) %>%
  arrange(Col) %>%
  mutate(Col = as.character(Col))

vals <- dfTwo2$Value
names(vals) <- dfTwo2$Col
0
On
dfOne[Reduce(intersect, list(which(dfOne["X"] > 2),
                             which(dfOne["Y"] > 2),
                             which(dfOne["Z"] > 4),
                             which(dfOne["T"] > 4))),]
#  X Y Z T J
#1 3 4 5 6 1

Or iteratively (so fewer inequalities are tested):

vals = c(X = 2, Y = 2, Z = 4, T = 4) # from @lmo's answer
dfOne[Reduce(intersect, lapply(names(vals), function(x) which(dfOne[x] > vals[x]))),]
#  X Y Z T J
#1 3 4 5 6 1
1
On

I'm writing this assuming that the second DF is meant to categorize the fields in the first DF. It's way simpler if you don't need to use the second one to define the conditions:

dfNew = dfOne[dfOne$X > 2 & dfOne$Y > 2 & dfOne$Z > 4 & dfOne$T > 4, ]

Or, using dplyr:

library(dplyr)
dfNew = dfOne %>% filter(X > 2 & Y > 2 & Z > 4 & T > 4)

In case that's all you need, I'll save this comment while I poke at the more complicated version of the question.

0
On

Here is a base R method with Map and Reduce.

# build lookup table of thresholds relative to variable name
vals <- setNames(c(2, 2, 4, 4), unlist(dat2))
# subset data.frame
dat[Reduce("&", Map(">", dat[names(vals)], vals)), ]
  X Y Z T J
1 3 4 5 6 1

Here, Map returns a list of length 4 with logical variables corresponding to each comparison. This list is passed to Reduce which returns a single logical vector with length corresponding to the number of rows in the data.frame, dat. This logical vector is used to subset dat.

data

dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L, 
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X", 
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA, 
-3L))

dat2 <-
structure(list(C.1 = structure(1:2, .Label = c("X", "Y"), class = "factor"), 
    C.2 = structure(c(2L, 1L), .Label = c("T", "Z"), class = "factor")), .Names = c("C.1", 
"C.2"), class = "data.frame", row.names = c(NA, -2L))