Count NA Values by Group (Year and month) in several columns in R

53 Views Asked by At

I would like to count the number of missing values (NA) in each month and year of the time series (columns) and plot a bar chart. How to extract this information from a data frame? I need as result a table with the number of missing values (NA) per month each year, for each columns.

b <- read.table(text = '        Date  AAA  BBB  CCC  DDD  EEE
49 1999-12-15 24.8 21.4 25.6 35.0 17.4
50 1999-12-16   NA  0.6  1.5  6.3  2.5
51 1999-12-17   NA 16.3 20.3  NA 19.2
52 1999-12-18   13  1.6 NA  6.3  0.0
53 1999-12-19   10 36.4 12.5 26.8 24.9
54 1999-12-20   NA  0.0  0.0  0.2  0.0
55 1999-12-21  0.2  0.0  0.0  0.0  0.0
56 1999-12-22  0.0  0.0  0.0  0.0  0.0')

head(b)

Thank you

3

There are 3 best solutions below

0
jpsmith On

You ask two questions, but to address the one in your title, you can count the NA values across columns AAA through EEE by year and month in dplyr (note the data are slightly changed to include multiple year-month groups):

b %>%
  mutate(ym = format(as.Date(Date), "%Y-%m")) %>%
  summarise(NA_count = sum(across(AAA:EEE, ~sum(is.na(.x)))),
            .by = ym)

#        ym NA_count
# 1 1999-01        3
# 2 1999-02        2
# 3 1999-03        0

To create a barplot of this, there are several ways. One way in base R is (note, I saved the above data as plotdat):

barplot(plotdat$NA_count, 
        names.arg = plotdat$ym)

enter image description here

Note I changed your sample data to have multiple year-month groups:

b <- read.table(text = '        Date  AAA  BBB  CCC  DDD  EEE
49 1999-1-15 24.8 21.4 25.6 35.0 17.4
50 1999-1-16   NA  0.6  1.5  6.3  2.5
51 1999-1-17   NA 16.3 20.3  NA 19.2
52 1999-2-18   13  1.6 NA  6.3  0.0
53 1999-2-19   10 36.4 12.5 26.8 24.9
54 1999-2-20   NA  0.0  0.0  0.2  0.0
55 1999-3-21  0.2  0.0  0.0  0.0  0.0
56 1999-3-22  0.0  0.0  0.0  0.0  0.0')
0
Friede On

Based on the kind data provision given in @jpsmith's answer, I would like to add two approaches:

1. Base R

Use by; very concise:

b$Date = format(as.Date(b$Date), "%Y-%m")
by(b, b$Date, \(x) sum(is.na(x))) |> barplot()

or,

2. {collapse}

for really large data, collapse::BY + rowSums might be faster

library(collapse)
BY(b, b$Date, \(x) sum(is.na(x))) |> rowSums() |> barplot()
1
jay.sf On

Suggesting to use tapply and mean,

> tapply(b[-1], strftime(b$Date, '%Y-%m'), is.na) |> sapply(mean) |> 
+   barplot(ylab='p. NA', las=1)

enter image description here

or colMeans.

> tapply(b[-1], strftime(b$Date, '%Y-%m'), is.na) |> sapply(colMeans) |> 
+   barplot(beside=TRUE, ylab='p. NA', leg=names(b[-1]), col=hcl.colors(ncol(b) - 1), las=1)

enter image description here

Edit

I think it was already answered, but using tapply such a table you can get like so:

> tapply(b[-1], strftime(b$Date, '%Y-%m'), is.na) |> sapply(colSums)
    1999-01 1999-02 1999-03
AAA       2       1       0
BBB       0       0       0
CCC       0       1       0
DDD       1       0       0
EEE       0       0       0

Data:

Thanks to @jpsmith

> dput(b)
structure(list(Date = c("1999-1-15", "1999-1-16", "1999-1-17", 
"1999-2-18", "1999-2-19", "1999-2-20", "1999-3-21", "1999-3-22"
), AAA = c(24.8, NA, NA, 13, 10, NA, 0.2, 0), BBB = c(21.4, 0.6, 
16.3, 1.6, 36.4, 0, 0, 0), CCC = c(25.6, 1.5, 20.3, NA, 12.5, 
0, 0, 0), DDD = c(35, 6.3, NA, 6.3, 26.8, 0.2, 0, 0), EEE = c(17.4, 
2.5, 19.2, 0, 24.9, 0, 0, 0)), class = "data.frame", row.names = c("49", 
"50", "51", "52", "53", "54", "55", "56"))