Find the diameter bin that contains 50% of the total particles

Question

Find the diameter bin that contains 50% of the total particles

88 Views Asked by Maria At 31 October 2023 at 22:02

I'm trying to find the diameter at 50% of the total particles. So I was able to use this code to find the diameter size at 50% of the total particles. So the first column is the diameter size, the next column is the number of particles for each diameter size. The last column is the sum which is the sum of particles as the diameter size decreases. In this example, the total particles is equal to 81394 and the 50% of particles will equal 40697. So I want to find the diameter bin that 40697 particles are in. Example Link

    smps<-read_csv("C:/Users/missl/Desktop/smps.csv")
    transpose <- t(smps)
    transpose <- as.data.frame(transpose)
    rev_data_frame <- rev(transpose)
    rev_data_frame <- t(rev_data_frame)
    smps_1 <- as.data.frame(rev_data_frame)
    smps_1$sum <- cumsum(smps_1[, 2])
    smps_2 <- smps_1[which.min(abs((smps_1[51,3]/2)> smps_1$sum)),]

# dput(smps_1)
structure(list(Diameter = c(1000, 891, 794, 708, 631, 562, 501, 
447, 398, 355, 316, 282, 251, 224, 200, 178, 158, 141, 126, 112, 
100, 89.1, 79.4, 70.8, 63.1, 56.2, 50.1, 44.7, 39.8, 35.5, 31.6, 
28.2, 25.1, 22.4, 20, 17.8, 15.8, 14.1, 12.6, 11.2, 10, 8.91, 
7.94, 7.08, 6.31, 5.62, 5.01, 4.47, 3.98, 3.61, 3.55), Particle = c(6.3743, 
5.1308, 12.0985, 5.4748, 9.823, 21.8273, 37.5763, 51.2838, 59.5292, 
110.409, 158.339, 186.517, 273.547, 358.379, 454.531, 583.11, 
757.914, 859.94, 970.185, 1142.2, 1335.7, 1427, 1552.13, 1651.4, 
1679.82, 1606.73, 1411.52, 1305.51, 1266.84, 1309.77, 1477.73, 
1938.36, 2333.89, 2710.21, 3080.09, 3663.15, 4158.74, 4379.57, 
4005.18, 3426.03, 3795.27, 4143.3, 4466.91, 3890.09, 3297.93, 
3111.45, 2229.47, 1015.87, 1379.63, 898.331, 1382.19), sum = c(6.3743, 
11.5051, 23.6036, 29.0784, 38.9014, 60.7287, 98.305, 149.5888, 
209.118, 319.527, 477.866, 664.383, 937.93, 1296.309, 1750.84, 
2333.95, 3091.864, 3951.804, 4921.989, 6064.189, 7399.889, 8826.889, 
10379.02, 12030.42, 13710.24, 15316.969, 16728.49, 18034, 19300.84, 
20610.61, 22088.34, 24026.7, 26360.59, 29070.8, 32150.89, 35814.04, 
39972.78, 44352.35, 48357.53, 51783.56, 55578.83, 59722.13, 64189.04, 
68079.13, 71377.06, 74488.51, 76717.98, 77733.85, 79113.48, 80011.81, 
81394)), class = "data.frame", row.names = c(NA, -51L))

I'm having trouble figuring out how to expand this code to other columns. Currently, I have only been able to figure out to get the diameter column and the sum of particles as the diameter size decreases. So ultimately, I want to find the diameter for each column that it is 50% of the total particles. So for column V1 it will be 14.1 nm, V2 would be 14.1 nm and V3 will be 15.80 nm.

#dput(loopsmps6)
structure(list(Diameter = c(1000, 891, 794, 708, 631, 562, 501, 
447, 398, 355, 316, 282, 251, 224, 200, 178, 158, 141, 126, 112, 
100, 89.1, 79.4, 70.8, 63.1, 56.2, 50.1, 44.7, 39.8, 35.5, 31.6, 
28.2, 25.1, 22.4, 20, 17.8, 15.8, 14.1, 12.6, 11.2, 10, 8.91, 
7.94, 7.08, 6.31, 5.62, 5.01, 4.47, 3.98, 3.61, 3.55), V1 = c(1382.19, 
898.331, 1379.63, 1015.87, 2229.47, 3111.45, 3297.93, 3890.09, 
4466.91, 4143.3, 3795.27, 3426.03, 4005.18, 4379.57, 4158.74, 
3663.15, 3080.09, 2710.21, 2333.89, 1938.36, 1477.73, 1309.77, 
1266.84, 1305.51, 1411.52, 1606.73, 1679.82, 1651.4, 1552.13, 
1427, 1335.7, 1142.2, 970.185, 859.94, 757.914, 583.11, 454.531, 
358.379, 273.547, 186.517, 158.339, 110.409, 59.5292, 51.2838, 
37.5763, 21.8273, 9.823, 5.4748, 12.0985, 5.1308, 6.3743), V2 = c(1380.47, 
213.662, 1945.19, 2062.84, 2692.5, 3274, 3736.01, 4301.49, 4792.09, 
4626.88, 5150.13, 6270.89, 6731.52, 6553.28, 5123.29, 4635.26, 
4478.81, 3869.76, 3231.69, 2552.12, 1922.94, 1662.09, 1513.65, 
1427.39, 1411.19, 1574.37, 1730.65, 1725.3, 1578.94, 1413.3, 
1308.7, 1209.57, 1046.64, 838.961, 725.613, 535.499, 432.743, 
355.892, 257.469, 206.747, 150.875, 115.105, 85.9249, 38.3227, 
30.5767, 28.3947, 19.1909, 5.5532, 8.9367, 6.5134, 2.6135), V3 = c(1365.3, 
1082.56, 1485.89, 1486.07, 1854.67, 2457.91, 3456.4, 3481.57, 
3272.76, 3598.13, 4004.16, 4480.2, 5368.56, 5734.28, 4916.71, 
4610.23, 4441.32, 3655.14, 2915.26, 2380.36, 2821.82, 2313.77, 
1942.49, 1732.05, 1471.91, 1637.06, 1691.77, 1626.86, 1512.16, 
1466.37, 1275.49, 1152.71, 1007.67, 821.176, 659.546, 584.179, 
456.626, 340.22, 279.581, 198.135, 165.915, 130.029, 83.975, 
35.9218, 17.6912, 18.6721, 14.289, 3.4856, 12.5274, 3.2755, 2.8518
), V4 = c(164.171, 135.322, 630.374, 1421.33, 2090.21, 2529.67, 
2567.1, 4216.73, 6077.57, 5994.51, 6616.98, 7851.76, 7280.47, 
6554.56, 5801.08, 5730.74, 5781.37, 4454.9, 3352.84, 2660.91, 
2259.52, 2082.14, 1949.51, 1774.88, 1534.58, 1650.44, 1624.44, 
1601.32, 1574.22, 1460.39, 1370.78, 1227.98, 1047.13, 840.175, 
617.41, 564.059, 440.58, 329.58, 288.765, 196.878, 167.035, 114.322, 
61.5434, 63.1678, 33.1914, 21.8862, 14.2781, 7.3284, 10.9794, 
7.4324, 3.9122), V5 = c(935.992, 981.076, 1343.07, 2085.9, 2688.29, 
3198.92, 3554.72, 4366.22, 5170.72, 5025.25, 5299.9, 5939.16, 
6302.48, 6363.15, 5824.72, 4750.64, 3525.67, 2888.71, 2346.75, 
1995.1, 1893.54, 1638.47, 1514.59, 1483.24, 1439.87, 1610.69, 
1677.33, 1686.02, 1618.14, 1426.04, 1245.07, 1145.19, 1036.26, 
890.568, 751.837, 567.525, 464.186, 375.798, 259.373, 217.886, 
152.946, 105.769, 76.1406, 50.6423, 34.3214, 22.6856, 12.8905, 
5.3805, 4.7017, 4.5505, 3.2585)), row.names = c(NA, 51L), class = "data.frame")

[![enter image description here][4]][4]

Original Q&A

There are 1 best solutions below

**uke** · Accepted Answer · 2023-11-01T02:11:50.230000

Without really understanding your analysis code, I understand:

You have a sequence of observations, each observation is one particle.

Each row in your dataset is an aggregation of observations. For humans, this is nice, for R this is not the most comfortable data format. R wants one row to equal one observations (=particle). By that I mean a long format, where the diameter is printed explicitly for each particle, even though this means many duplicate diameters.

After this, it should be easy to apply the quantile() function. Because I think that's what you are looking for when you say: "I want the diameter bin that 50% of the particles are [in or below]". So you want the 50%-quantile, aka the median. There even is the median() function for that.

Using the data you provided:

# Using dput() of smps_1 from question

# Step 1: Expand dataset into longer format
# This is dirty: rounding Particle because it is not a whole number
# But the overall outcome of n = 81394 rows is equal to the cumsum end value, so it cannot be that bad. 

long_smps <- tidyr::uncount(smps_1, weights = round(Particle))

# Step 2: Compute Median

median(long_smps$Diameter)
#> 14.1
# Same as manually computed by OP

We get the same 50%-quantile of 14.1 as in your question! Now we can easily visualize the distribution of the variable as well with histograms.

Note that the rounding I apply is a dirty fix to accomodate the fact, that your particle counts are not whole numbers. So somewhere, we are overlooking something which is reflected by this rounding. This means, maybe this is not the correct solution for your data.

expanding this to multiple columns of count data

As of my understanding, in smps_6 the variables V1 to V5 contain particle counts for each diameter at 5 different measurement dates.

head(smps_6)
#  Diameter       V1       V2      V3       V4       V5
#1     1000 1382.190 1380.470 1365.30  164.171  935.992
#2      891  898.331  213.662 1082.56  135.322  981.076
#3      794 1379.630 1945.190 1485.89  630.374 1343.070
#4      708 1015.870 2062.840 1486.07 1421.330 2085.900
#5      631 2229.470 2692.500 1854.67 2090.210 2688.290
#6      562 3111.450 3274.000 2457.91 2529.670 3198.920

We again need to expand the diameter column. The goal is to have the diameters of each variable in a long format, e.g. for V1 the "uncounted" diameter sequence would look like this:

1000 1000 1000 ... (repeated 1382.19 times) followed by 891 891 891 ... (repeated 898.331 times) - and so on for all the other rows.

This is what tidyr::uncount() accomplishes. You see here why I am rounding, because you cannot repeat the diameter 1000 for .19 times. Either there was a counted observation or not. The following answer ignores this problem and assumes it is OK to round to whole numbers.

We cannot expand all 5 variables into a long format if they are in the same data frame sharing the same diameter column. Because e.g. in V1, a diameter of 1000 is occurring 1382 times, in V2 - V5 this number is different.

So we have to create individual data frames, where there is just a combination of the Diameter column and the specific V... column containing the count data. We have to do this 5 times to accommodate the particle counts V1 to V5.

I achieve this by using a for loop.

This is my solution code. What it does:

Take smps_6, then
select only the Diameter column and the current V... column
make a data frame in long format by "uncounting"
take the resulting long diameter sequence and
compute the median

# allocate storage for the results
results <- vector("double", length = 5)

# this is one of many ways to do it
# i chose iterating over 2:6 because your particle colums in smps_6 are at 2:6

for (i in 2:6) { 
results[i - 1] <- smps_6 |> 
  # narrow down selection to only the important variables
  # diameter column is always present
    dplyr::select(Diameter, !!i) |>
  # put the data in a longer format
  # where each observation = 1 particle = 1 row
  # the weights is always the count data of the current column
    tidyr::uncount(weights = round(smps_6[[i]])) |>
  # we don't need the count data any more, we only need the diameter
  # we need it as a vector to pass it into median
  # this equals median(dataframe$diameter) but is compatible with the pipe
    dplyr::pull(Diameter) |>
    median()
}

# recreate your variable names to clearly see which results belong where
names(results) <- paste0("V", 1:5)

results
#  V1  V2  V3  V4  V5 
# 224 224 200 224 224

Note that I am not getting the same results as you are: V1 is not 14.1 in my case. So clearly there is something wrong we have to investigate, but this is not the fault of the loop. The the median of the smps_6 data you dput() (as loopsmps_6, i shortened the name) is in fact 224 for V1

Find the diameter bin that contains 50% of the total particles

There are 1 best solutions below

expanding this to multiple columns of count data

Related Questions in R

Related Questions in CLOSEST

Trending Questions

Popular # Hahtags

Popular Questions