Objective-C NSIndexSet / NSArray - Selecting the "Best" Index from Set using Standard Dev

39 Views Asked by At

I have a question now about using standard deviation. And if I'm using it properly for my case as laid out below.

The Indexes are all Unique here's a few questions I have about Standard Deviation: 1) Since I'm using all of the data should I be using a population Standard Dev or
should I use a sample Standard Dev? 2) Does it matter what the length (range) of the full playlist is (1...15)

I have a program which takes a Playlist of Songs and gets recommendations for each song from Spotify.

Say the playlist has a length of 15.
Each tracks gets an a array of Suggestions of about 30 tracks.
And in the end my program will filter down all of the suggestions to create a new playlist of only 15 tracks.

There is often duplicates that get recommended.
I have devised a method for finding these duplicates and then putting their index into a NSIndexSet.

In my example there is a duplicate track that was suggested for tracks
in the original playlist at indexes 4, 6, 7, 12

I'm trying to calculate out which is the best one of the duplicates to pick. All of the NSSet methods etc were not going to help me an would not take into account "where" the duplicates where place. To me it makes sense that the more ofter within a "zone" a track was suggested would make the most sense to "use" it at that location in the final suggested playist.

Originally I was just selecting the index closest to the mean (7.25)
But to me I would think that 6 would be a better choice than 7.
The 12 seems to throw it off.

So I started to investigating StdDev and figured that could help me out
How do you think my approach to this here is?


NSMutableIndexSet* dupeIndexsSet;  // contains indexes 4,6,7,12
// I have an extension on NSIndexSet to create a NSArray from it
NSArray* dupesIndexSetArray = [dupeIndexsSet indexSetAsArray]; 
// @[4, 6, 7, 12]
NSUInteger dupeIndexsCount = [dupeIndexSetArray count]; // 4
NSUInteger dupeIndexSetFirst = [dupeIndexsSet firstIndex]; // 4
NSUInteger dupeIndexSetLast = [dupeIndexsSet lastIndex]; // 12

// I have an extension on NSArray to calculate the mean
NSNumber* dupeIndexsMean = [dupeIndexArray meanOf]; // 7.25;

the populationSD is 2.9475  
the populationVariance is 8.6875  

the sampleSD is 3.4034  
the sampleVariance is 11.5833

Which SD should I use?
Or will it matter

I learned that the SD is the range from the Mean
so I figured I would calculate out what those values are.


double mean = [dupeIndexsMean doubleValue];
double dev = [populationSD doubleValue];

NSUInteger stdDevRangeStart = MAX(round(mean - dev), dupeIndexSetFirst);
// 7.25 - 2.8475 = 4.4025, round 4, dupeIndexSetFirst = 4
// stdDevRangeStart = 4;

NSUInteger stdDevRangeEnd = MIN(round(mean + dev), dupeIndexSetLast);
// 7.25 + 2.8475 = 10.0975, round 10, dupeIndexSetLast = 12
// stdDevRangeEnd = 10;

NSUInteger stdDevRangeLength1 = stdDevRangeEnd - stdDevRangeStart; // 6
NSUInteger stdDevRangeLength2 = MAX(round(dev * 2), stdDevRangeLength1);
// 2.8475 * 2 = 5.695, round 6, stdDevRangeLength1 = 6
// stdDevRangeLength2 = 6;

NSRange dupeStdDevRange = NSMakeRange(stdDevRangeStart, stdDevRangeLength2);   
// startIndex = 4, length 6

So I figured if this new range would give me a better range that
would include a more accurate stdDev and not include the 12.

I create a newIndexSet from the original one that only includes the indexes
that are included from my new dupeStdDevRange


NSMutableIndexSet* stdDevIndexSet = [NSMutableIndexSet new];
[dupeIndexsSet enumerateIndexesInRange:dupeStdDevRange 
options:NSEnumerationConcurrent 
usingBlock:^(NSUInteger idx, BOOL * _Nonnull stop) 
{
[stdDevIndexSet addIndex:idx];
}];

// stdDevIndexSet has indexes = 4, 6, 7

the new stdDevIndexSet now includes indexes 4,6,7
12 was not included, which is great cause I thought was throwing everything off

now with this new "stdDevIndexSet" I check it against the original IndexSet If the stdDevIndexSet count is less, then I reload this new indexSet into the whole process and calculate everything again.


if ([stdDevIndexSet count] < dupeIndexesCount) 
{
[self loadDupesIndexSet:stdDevIndexSet];
}
else {
doneTrim = YES;
}

So it is different, so I start the whole process again with index set that
includes 4,6,7

updated calculations

dupeIndexsMean = 5.6667;  

populationSD = 1.2472;  
populationVariance = 1.5556;  
sampleSD = 1.5275;  
sampleVariance = 2.3333;  

stdDevRangeStart = 4;  
stdDevRangeEnd = 7;  

The newTrimmed IndexSet now "fits" the Stand Deviation Range.

So if I use the new Mean rounded to 6.

My Best Index to use is 6 from the original (4, 6, 7, 12)
Which now makes send to me.


So big question am I approaching this correctly?

Do things like the original Size (length) of the "potential" range matter?
IE if the original playlist length was 20 tracks as compared to 40 tracks? (I'm thinking not).

0

There are 0 best solutions below