Preferentially Sampling Based upon Value Size

Question

Preferentially Sampling Based upon Value Size

115 Views Asked by jjniev01 At 20 May 2021 at 10:13

So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.

I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {x-min(x)}/{max(x)-min(x)} method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.

I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.

Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?

# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))

# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????

# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```

Original Q&A

There are 1 best solutions below

**Martin Wettstein** · Accepted Answer · 2021-05-20T13:01:44.830000

There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope. In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.

area.dist = rgamma(1000,1,3)*40
area.dist[area.dist>100]=100
hist(area.dist,main="Probability functions")

area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area))  ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4  ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area))))  ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)


lines(area,prob_vector1*500,col="red")
lines(area,prob_vector2*500,col="green")
lines(area,prob_vector3*500,col="blue")
lines(area,prob_vector4*500,col="orange")

The output is:

The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.

Preferentially Sampling Based upon Value Size

There are 1 best solutions below

Related Questions in R

Related Questions in STATISTICS

Related Questions in OVERSAMPLING

Trending Questions

Popular # Hahtags

Popular Questions