Scatterplot with Color Gradient from the Mean

134 Views Asked by At

I have a data frame "DF" of 2020 observations and 79066 variables. The first column is the year spanning continuously from 1 to 2020, the other columns (variables) are numbers.

For reproducibility, I created I fake data frame with 20 years from 2000 to 2020, and only 100 variables. E.g.:

set.seed(123)

i <- 100

DF <- data.frame(year=c(2000:2020), 
                 setNames(
                   as.data.frame(lapply(1:i, function(k) c(rnorm(21)))), 
                   paste("Var_", 1:i, sep = "")))

I then created a Mean by row

DF$Aver <- apply(DF[, 2:101], 1, mean, na.rm=TRUE)

I then plotted the average as a line and added the points

plot(DF$year, DF$Aver, type="l", col=1, cex=0.5, las=1, xlab="", ylab="", ylim=c(-4, 4)) 
for (i in 2:101) {
  points(DF$year, DF[, i], pch=20, cex=1, col='gray')
}

However, what I would like to have is a scatterplot where the points close to the mean are dark grey and the grey colour goes shading (light grey) towards the tail values.

1

There are 1 best solutions below

8
jay.sf On

You could normalize the variable in iteration and use it as alpha value in the rgb(red, gree, blue, alpha) function, where values are between [0, 1]. Since smallest values get zero alpha, we simply make points a second time with a very light gray.

normalize <- function(x, na.rm=FALSE) {
  na <- is.na(x)
  x[na] <- min(x, na.rm=TRUE)
  num <- x - min(x)
  denom <- max(x) - min(x)
  return(replace(num/denom, na, NA_real_))
}

plot(DF$year, DF$Aver, type="l", col=1, cex=0.5, las=1, xlab="", ylab="", ylim=c(-4, 4)) 
for (i in 2:101) {
  # points(DF$year, DF[, i], pch=20, cex=1, col=rgb(.8, .8, .8, 1 - normalize(abs(DF[, i]))))
  points(DF$year, DF[, i], pch=20, cex=1, col=rgb(.8, .8, .8, 1 - {nz <- normalize(abs(DF[, i])); replace(nz, is.na(nz), 0)}))
  points(DF$year, DF[, i], pch=20, cex=1, col=rgb(.1, .1, .1, .05))
  # points(DF$year, DF[, i], pch=20, cex=1, col='gray')
}

enter image description here

Update

It is not feasible to plot a huge number of points like 159,713,320 in your case. The human eye cannot resolve this, and a nasty surprise could be waiting for us in the copy store.

A common way of solving the issue is to use a smaller random sample s of the data with a fraction of the columns in the size= argument. (No need to care for integers, sample will round reals.) It will represent your data adequately.

set.seed(42)  ## for sake of reproducibility
s <- sample(2:ncol(DF_huge), size=(ncol(DF_huge) - 1)*.0005)  ## see data below

We might initialize an empty plot using type=n, since the line will be overlaid from the points` anyway.

Next, in the for loop for the points we iterate exactly over this subset s. Here with that many points, we could simplify to a single color '#RRGGBBaa' defined in hexadecimal format 1-F, where R=red, G=green, B=blue and a=alpha (opacity). I commented on the approach from above, but you might want to try both.

Finally, using lines we draw the average line as the top layer, where we may use the whole dataset.

plot(DF_huge$year, DF_huge$Aver, type="n", las=1, xlab="", ylab="", ylim=c(-4, 4)) 
for (i in s) {
  v <- .5
  # points(DF_huge$year, DF_huge[, i], pch='.', cex=3,
  #        col=rgb(v, v, v, 1 - {nz <- normalize(abs(DF_huge[, i])); replace(nz, is.na(nz), 0)}))
  points(DF_huge$year, DF_huge[, i], pch='.', cex=3, col="#88888833")
}
lines(DF_huge$year, DF_huge$Aver, type="l", col=1, cex=0.5, las=1, xlab="", ylab="", ylim=c(-4, 4)) 

enter image description here

See also these older questions dealing with plotting of large data: 1, 2, 3.


Data:

n <- 2020
m <- 79066
set.seed(42)
Var <- replicate(m, rnorm(n))
Var[sample(seq_along(Var), 2000)] <- NA_real_  ## produce 2000 random missings
DF_huge <- data.frame(year=seq_len(n), Var=Var)

DF_huge$Aver <- rowMeans(DF_huge[-1], na.rm=TRUE)  ## much faster than `apply(., mean)`!