Updating only certain values of data frame based on match

530 Views Asked by At

I'm trying to update a variable (popsnp) in a higher scope within an lapply, on the basis of a match. I can't quite figure out the syntax for updating the values though, what I have currently overwrites any previously existing values with NA:

lapply(1:22, function(i){
  in.name<-paste("/data/mdp14aps/ld/chr", i, ".ld", sep="")
  out.name<-paste("/data/mdp14aps/R/ldatachr", i, ".rda", sep="")
  ldata<-read.csv(in.name, sep="", header=TRUE,
  freq<-count(ldata, c("SNP_A", "CHR_A", "BP_A"))

  #the part I'm not sure about
  popsnp$chrom<<-freq[match(popsnp$marker, freq$SNP_A),2]
  popsnp$position<<-freq[match(popsnp$marker, freq$SNP_A),3]
  popsnp$freq<<-freq[match(popsnp$marker, freq$SNP_A),4]

  rm(ldata, freq)

I want to preserve the values I'm setting between iterations of lapply so I end up with popsnp containing all values of chrom, position and freq, not just the last iteration.

I feel like this should be straightforward, but I'm still somewhat unfamiliar with R.

A toy example:

test<-data.frame(A = c("a", "b", "c", "d", "e"), B = c(rep(NA,5)))
test1<-data.frame(A = c("a", "b"), B = c(1, 2))
test2<-data.frame(A = c("c", "d", "e"), B = c(3, 4, 5))

test$B<-test1[match(test$A, test1$A), 2]
test$B<-test2[match(test$A, test2$A), 2]

I want test$B to have the values from 1-5 in it.


There are 1 best solutions below


Update for your Toy Example

You need to subset both sides of your assignment, and also convert your conditions to logical subsetting vectors.

logical1 <- !is.na(test1[match(test$A, test1$A),2]) # TRUE/FALSE
logical2 <- !is.na(test1[match(test$A, test2$A),2])

test[t1,] <- test1[t1,] # selects only TRUE rows
test[t2,] <- test2[t2,] 

I recommend you look at each element individually so you can see what's happening.


I'm not exactly sure I understand what you're example is trying to accomplish. So I'm going to provide you with a toy example of subsetting:

dat <- data.frame(
 A = sample(letters[3:26],26,replace = TRUE)
 B = runif(26)

# Replaces everything in column B where column A == "a"
dat[dat$a == "c", "B"] <- 1

# dat$A == "c" returns a TRUE/FALSE vector, "B" returns column "B".

Best practice is to always use TRUE / FALSE conditions while subsetting to avoid future errors. You could subset by row number, but it ALWAYS gets messy.

It's important to note that your use of <<- pushes your change of the variable to the parent environment, outside of the scope of your function. This can lead to unexpected results in the future. It's better to supply the variable you want to change and then return it again at the end of your manipulation function. This way you have a clear sequence of events.

myfun <- function(x,y) { 
  # ... do stuff to y

y <- myfun(x,y) 

Final Update

Lastly, with respect to dropping unnecessary columns. Typical practice is to drop them after import by name (best practice) or reference number (changes in data break this).

ldata[c('col1','col2',...)] <- NULL # drop