I've written an algorithm using DEAP which maximises the sum of selected values from column "a" by selecting a single i2 value from each i1 group on the following multi-index dataframe:
i1 i2 a
0 0 9.8
0 1 2.3
0 2 5.2
0 3 9.7
0 4 9.7
0 5 9.7
0 6 9.7
1 0 7.5
1 1 5.4
1 2 2.7
1 3 1.1
1 4 1.5
1 5 7.6
1 6 7.9
2 0 6.7
2 1 8.0
2 2 0.5
2 3 8.2
2 4 5.6
2 5 5.6
2 6 5.6
3 0 8.9
3 1 5.3
3 2 3.3
3 3 3.3
3 4 3.3
3 5 3.3
3 6 3.3
4 0 8.6
4 1 0.5
4 2 9.0
4 3 3.0
4 4 0.6
4 5 0.6
4 6 0.6
For instance, one possible solution would be [0, 3, 4, 1, 3] which would give me a sum of 24.8 and another solution (individual) would be [2, 6, 4, 0, 2] giving a sum of 36.6. The dataframe contains duplicate values for the i1 groups where the number of values is less than the largest i1 group. This will allow mutations and crossovers without any issues given that all the i1 groups can have exactly the same i2 variations. Even though the algorithm is performing as expected and producing a solution, it presents two main issues:
- It increases the amount of data quite significantly since I'm dealing with more than 80,000 i1 groups at a time and I need to expand all to groups to have the same number of i2 values of the largest group on i1.
- It increases the processing time given the much larger number of possible useless combinations since solution [0, 3, 4, 1, 3] is the same as [0, 3, 5, 1, 3] or [0, 3, 6, 1, 3].
Would it be possible to avoid the duplication of i2 by limiting each gene on the individual to its maximum in its position? For example, the first i2 on the individual could only vary from 0 to 3, the second i2, from 0 to 6 and so forth. Would still be possible to perform crossovers and mutations discarding the individuals that do not conform to these rules i.e. an individual that has the first i2 as 4? Below dataframe without duplicates:
i1 i2 a
0 0 9.8
0 1 2.3
0 2 5.2
0 3 9.7
1 0 7.5
1 1 5.4
1 2 2.7
1 3 1.1
1 4 1.5
1 5 7.6
1 6 7.9
2 0 6.7
2 1 8.0
2 2 0.5
2 3 8.2
2 4 5.6
3 0 8.9
3 1 5.3
3 2 3.3
4 0 8.6
4 1 0.5
4 2 9.0
4 3 3.0
4 4 0.6