Context
I have the following Tables in Microsoft SQL Server:
MainTable
UserId
EmailType
Column 1
Column 2
..
Column N
EmailDistribution
EmailType
WeightageRequired
EmailType has about 10 distinct value posibility, say A-J, for both the tables. Also, the sum of WeightageRequired for the 10 values togather is 1.
In MainTable, there are multiple combinations of UserId & EmailType. However, UserId-EmailType Combination would have a unique value. Thus, following is possibility:
UserId - EmailType
1-A
1-B
1-D
2-A
2-C
2-G
3-A
3-B
etc. Also, one particular UserId need not have a combination with all the EmailType values. Thus, in the given example above, UserId 1 only has rows corresponding to A,B & D and not for the rest of the possible values of EmailType.
Now, the requirement:
Out of, say, 50000 rows in the MainTable, I want to select 1 row for each UserId such that the distribution of the EmailType is as close as possible to the EmailDistribution's WeightageRequired.
For example, if the 50000 rows have 12000 unique UserIds, the resultset has to have 12000 rows (and only 12000 rows). However, within the set of rows for a particular UserId, a random row should be picked so as to work towards the required weightage distribution.
There is no hard requirement for the weightage to be exactly what is required, but the closer it is to the requirement, the better is the model fit.
I hope I am able to explain my problem correctly.
Request the great minds on StackOverflow for assistance.
Here's a non-loop version, albeit not ideal, but you could create a bunch of CTEs to get distinct userids.
Let's pretend these are your weights by emailType:
And let's pretend you have 10,000 distinct UserIDs in the main table. So group A should represent ~10 rows (10,000 * 0.00097).
This "works" to some degree but you're not guaranteed to get all 10,000 user ids. For example, the first 3 CTEs may grab some people you need for Group D. If you are looking for a non-loop solution, then perhaps start by first counting how many unique IDs you have by EmailType and then compare to your Weights. That may help determine how to order the CTEs (should you choose this method).
Again, not ideal but might get you close to what you need.