I'm trying to find the diameter at 50% of the total particles. So I was able to use this code to find the diameter size at 50% of the total particles. So the first column is the diameter size, the next column is the number of particles for each diameter size. The last column is the sum which is the sum of particles as the diameter size decreases. In this example, the total particles is equal to 81394 and the 50% of particles will equal 40697. So I want to find the diameter bin that 40697 particles are in. Example Link
smps<-read_csv("C:/Users/missl/Desktop/smps.csv")
transpose <- t(smps)
transpose <- as.data.frame(transpose)
rev_data_frame <- rev(transpose)
rev_data_frame <- t(rev_data_frame)
smps_1 <- as.data.frame(rev_data_frame)
smps_1$sum <- cumsum(smps_1[, 2])
smps_2 <- smps_1[which.min(abs((smps_1[51,3]/2)> smps_1$sum)),]
# dput(smps_1)
structure(list(Diameter = c(1000, 891, 794, 708, 631, 562, 501,
447, 398, 355, 316, 282, 251, 224, 200, 178, 158, 141, 126, 112,
100, 89.1, 79.4, 70.8, 63.1, 56.2, 50.1, 44.7, 39.8, 35.5, 31.6,
28.2, 25.1, 22.4, 20, 17.8, 15.8, 14.1, 12.6, 11.2, 10, 8.91,
7.94, 7.08, 6.31, 5.62, 5.01, 4.47, 3.98, 3.61, 3.55), Particle = c(6.3743,
5.1308, 12.0985, 5.4748, 9.823, 21.8273, 37.5763, 51.2838, 59.5292,
110.409, 158.339, 186.517, 273.547, 358.379, 454.531, 583.11,
757.914, 859.94, 970.185, 1142.2, 1335.7, 1427, 1552.13, 1651.4,
1679.82, 1606.73, 1411.52, 1305.51, 1266.84, 1309.77, 1477.73,
1938.36, 2333.89, 2710.21, 3080.09, 3663.15, 4158.74, 4379.57,
4005.18, 3426.03, 3795.27, 4143.3, 4466.91, 3890.09, 3297.93,
3111.45, 2229.47, 1015.87, 1379.63, 898.331, 1382.19), sum = c(6.3743,
11.5051, 23.6036, 29.0784, 38.9014, 60.7287, 98.305, 149.5888,
209.118, 319.527, 477.866, 664.383, 937.93, 1296.309, 1750.84,
2333.95, 3091.864, 3951.804, 4921.989, 6064.189, 7399.889, 8826.889,
10379.02, 12030.42, 13710.24, 15316.969, 16728.49, 18034, 19300.84,
20610.61, 22088.34, 24026.7, 26360.59, 29070.8, 32150.89, 35814.04,
39972.78, 44352.35, 48357.53, 51783.56, 55578.83, 59722.13, 64189.04,
68079.13, 71377.06, 74488.51, 76717.98, 77733.85, 79113.48, 80011.81,
81394)), class = "data.frame", row.names = c(NA, -51L))
I'm having trouble figuring out how to expand this code to other columns. Currently, I have only been able to figure out to get the diameter column and the sum of particles as the diameter size decreases. So ultimately, I want to find the diameter for each column that it is 50% of the total particles. So for column V1 it will be 14.1 nm, V2 would be 14.1 nm and V3 will be 15.80 nm.
#dput(loopsmps6)
structure(list(Diameter = c(1000, 891, 794, 708, 631, 562, 501,
447, 398, 355, 316, 282, 251, 224, 200, 178, 158, 141, 126, 112,
100, 89.1, 79.4, 70.8, 63.1, 56.2, 50.1, 44.7, 39.8, 35.5, 31.6,
28.2, 25.1, 22.4, 20, 17.8, 15.8, 14.1, 12.6, 11.2, 10, 8.91,
7.94, 7.08, 6.31, 5.62, 5.01, 4.47, 3.98, 3.61, 3.55), V1 = c(1382.19,
898.331, 1379.63, 1015.87, 2229.47, 3111.45, 3297.93, 3890.09,
4466.91, 4143.3, 3795.27, 3426.03, 4005.18, 4379.57, 4158.74,
3663.15, 3080.09, 2710.21, 2333.89, 1938.36, 1477.73, 1309.77,
1266.84, 1305.51, 1411.52, 1606.73, 1679.82, 1651.4, 1552.13,
1427, 1335.7, 1142.2, 970.185, 859.94, 757.914, 583.11, 454.531,
358.379, 273.547, 186.517, 158.339, 110.409, 59.5292, 51.2838,
37.5763, 21.8273, 9.823, 5.4748, 12.0985, 5.1308, 6.3743), V2 = c(1380.47,
213.662, 1945.19, 2062.84, 2692.5, 3274, 3736.01, 4301.49, 4792.09,
4626.88, 5150.13, 6270.89, 6731.52, 6553.28, 5123.29, 4635.26,
4478.81, 3869.76, 3231.69, 2552.12, 1922.94, 1662.09, 1513.65,
1427.39, 1411.19, 1574.37, 1730.65, 1725.3, 1578.94, 1413.3,
1308.7, 1209.57, 1046.64, 838.961, 725.613, 535.499, 432.743,
355.892, 257.469, 206.747, 150.875, 115.105, 85.9249, 38.3227,
30.5767, 28.3947, 19.1909, 5.5532, 8.9367, 6.5134, 2.6135), V3 = c(1365.3,
1082.56, 1485.89, 1486.07, 1854.67, 2457.91, 3456.4, 3481.57,
3272.76, 3598.13, 4004.16, 4480.2, 5368.56, 5734.28, 4916.71,
4610.23, 4441.32, 3655.14, 2915.26, 2380.36, 2821.82, 2313.77,
1942.49, 1732.05, 1471.91, 1637.06, 1691.77, 1626.86, 1512.16,
1466.37, 1275.49, 1152.71, 1007.67, 821.176, 659.546, 584.179,
456.626, 340.22, 279.581, 198.135, 165.915, 130.029, 83.975,
35.9218, 17.6912, 18.6721, 14.289, 3.4856, 12.5274, 3.2755, 2.8518
), V4 = c(164.171, 135.322, 630.374, 1421.33, 2090.21, 2529.67,
2567.1, 4216.73, 6077.57, 5994.51, 6616.98, 7851.76, 7280.47,
6554.56, 5801.08, 5730.74, 5781.37, 4454.9, 3352.84, 2660.91,
2259.52, 2082.14, 1949.51, 1774.88, 1534.58, 1650.44, 1624.44,
1601.32, 1574.22, 1460.39, 1370.78, 1227.98, 1047.13, 840.175,
617.41, 564.059, 440.58, 329.58, 288.765, 196.878, 167.035, 114.322,
61.5434, 63.1678, 33.1914, 21.8862, 14.2781, 7.3284, 10.9794,
7.4324, 3.9122), V5 = c(935.992, 981.076, 1343.07, 2085.9, 2688.29,
3198.92, 3554.72, 4366.22, 5170.72, 5025.25, 5299.9, 5939.16,
6302.48, 6363.15, 5824.72, 4750.64, 3525.67, 2888.71, 2346.75,
1995.1, 1893.54, 1638.47, 1514.59, 1483.24, 1439.87, 1610.69,
1677.33, 1686.02, 1618.14, 1426.04, 1245.07, 1145.19, 1036.26,
890.568, 751.837, 567.525, 464.186, 375.798, 259.373, 217.886,
152.946, 105.769, 76.1406, 50.6423, 34.3214, 22.6856, 12.8905,
5.3805, 4.7017, 4.5505, 3.2585)), row.names = c(NA, 51L), class = "data.frame")
[![enter image description here][4]][4]


Without really understanding your analysis code, I understand:
You have a sequence of observations, each observation is one particle.
Each row in your dataset is an aggregation of observations. For humans, this is nice, for R this is not the most comfortable data format. R wants one row to equal one observations (=particle). By that I mean a long format, where the diameter is printed explicitly for each particle, even though this means many duplicate diameters.
After this, it should be easy to apply the
quantile()function. Because I think that's what you are looking for when you say: "I want thediameterbin that 50% of the particles are [in or below]". So you want the 50%-quantile, aka the median. There even is themedian()function for that.Using the data you provided:
We get the same 50%-quantile of 14.1 as in your question! Now we can easily visualize the distribution of the variable as well with histograms.
Note that the rounding I apply is a dirty fix to accomodate the fact, that your particle counts are not whole numbers. So somewhere, we are overlooking something which is reflected by this rounding. This means, maybe this is not the correct solution for your data.
expanding this to multiple columns of count data
As of my understanding, in
smps_6the variablesV1toV5contain particle counts for each diameter at 5 different measurement dates.We again need to expand the diameter column. The goal is to have the diameters of each variable in a long format, e.g. for V1 the "uncounted" diameter sequence would look like this:
1000 1000 1000 ...(repeated 1382.19 times) followed by891 891 891 ...(repeated 898.331 times) - and so on for all the other rows.This is what
tidyr::uncount()accomplishes. You see here why I am rounding, because you cannot repeat the diameter 1000 for .19 times. Either there was a counted observation or not. The following answer ignores this problem and assumes it is OK to round to whole numbers.We cannot expand all 5 variables into a long format if they are in the same data frame sharing the same diameter column. Because e.g. in V1, a diameter of 1000 is occurring 1382 times, in V2 - V5 this number is different.
So we have to create individual data frames, where there is just a combination of the
Diametercolumn and the specificV...column containing the count data. We have to do this 5 times to accommodate the particle countsV1toV5.I achieve this by using a
forloop.This is my solution code. What it does:
smps_6, thenDiametercolumn and the currentV...columnNote that I am not getting the same results as you are:
V1is not 14.1 in my case. So clearly there is something wrong we have to investigate, but this is not the fault of the loop. The the median of thesmps_6data youdput()(asloopsmps_6, i shortened the name) is in fact 224 forV1