Suppose I have the following dataset:
| Personalnumber | Category | Year | Month | Index_ID | Previous_Index_ID |
|---|---|---|---|---|---|
| 1 | 100 | 2022 | 8 | 42100 | |
| 1 | 100 | 2022 | 9 | 9534 | 42100 |
| 1 | 9400 | 2023 | 9 | 4 | |
| 1 | 9400 | 2023 | 10 | 485 | 4 |
| 2 | 100 | 2022 | 1 | 214 | 102 |
| 2 | 100 | 2022 | 2 | 194231 | 214 |
| 3 | 200 | 2022 | 2 | 2111 | |
| 3 | 200 | 2022 | 3 | 1012 | 2111 |
| 3 | 200 | 2022 | 4 | 9876 | 1012 |
| 3 | 200 | 2022 | 5 | 8794 | 9876 |
| 3 | 200 | 2022 | 6 | 24142 | 8794 |
| 4 | 100 | 2022 | 4 | 42100 | |
| 4 | 200 | 2022 | 7 | 12 | |
| 4 | 200 | 2022 | 8 | 14 | 12 |
| 4 | 200 | 2022 | 9 | 485 | 14 |
The first column (Personalnumber) is a number that specifies a person. There is an additional column (Category) that gives a category. There is an entry for year and month (Year Month). There is an index column (Index_ID) and most importantly a column stating a reference, the previous index a case might relate to (Previous_Index_ID). So, let's make it more understandable:
The first case belongs to person 1 within category 100. We have to entries that belong to this case. It starts with the index 42100. The next record has the index 9534, it is related to the first one, as the column "Previous_Index_ID" has the value 42100.
The second case belongs to person 1 within category 9400. We have two entries that belong to this case. It starts with the index 4. The next record has the index 485, it is related to the first one, as the column "Previous_Index_ID" has the value 4.
The third case:
2;100;2022;1;214;102
2;100;2022;2;194231;214
belongs to person 2 within category 100. Here we can see that we do not have the first record that would have index 102 in our dataset.
It continues like this, for example person 3 has 5 records:
3;200;2022;2;2111;
3;200;2022;3;1012;2111
3;200;2022;4;9876;1012
3;200;2022;5;8794;9876
3;200;2022;6;24142;8794
This is one case.
Now I want to add a column with an unique identifier for each case.
My code is as follows:
import pandas as pd
myfile = pd.read_csv(r"C:\pathtofile\testfile.csv", sep=";")
myfile['newID'] = myfile.groupby(['Personalnumber','Category'], sort=False).ngroup().add(1)
print(myfile)
And indeed the output is as desired:
Personalnumber Category Year Month Index_ID Previous_Index_ID newID
0 1 100 2022 8 42100 NaN 1
1 1 100 2022 9 9534 42100.0 1
2 1 9400 2023 9 4 NaN 2
3 1 9400 2023 10 485 4.0 2
4 2 100 2022 1 214 102.0 3
5 2 100 2022 2 194231 214.0 3
6 3 200 2022 2 2111 NaN 4
7 3 200 2022 3 1012 2111.0 4
8 3 200 2022 4 9876 1012.0 4
9 3 200 2022 5 8794 9876.0 4
10 3 200 2022 6 24142 8794.0 4
11 4 100 2022 4 42100 NaN 5
12 4 200 2022 7 12 NaN 6
12 4 200 2022 8 14 12 6
12 4 200 2022 9 485 14 6
The column newID shows the correct case numbering.
Now an additional case comes into play:
1;100;2022;8;101;
1;100;2022;9;204;101
1;100;2022;10;4344;204
1;100;2022;11;2069;4344
This case also belongs to person 1, category 100. Now the data looks like this:
| Personalnumber | Category | Year | Month | Index_ID | Previous_Index_ID |
|---|---|---|---|---|---|
| 1 | 100 | 2022 | 8 | 42100 | |
| 1 | 100 | 2022 | 8 | 101 | |
| 1 | 100 | 2022 | 9 | 9534 | 42100 |
| 1 | 100 | 2022 | 9 | 204 | 101 |
| 1 | 100 | 2022 | 10 | 4344 | 204 |
| 1 | 100 | 2022 | 11 | 2069 | 4344 |
| 1 | 9400 | 2023 | 9 | 4 | |
| 1 | 9400 | 2023 | 10 | 485 | 4 |
| 2 | 100 | 2022 | 1 | 214 | 102 |
| 2 | 100 | 2022 | 2 | 194231 | 214 |
| 3 | 200 | 2022 | 2 | 2111 | |
| 3 | 200 | 2022 | 3 | 1012 | 2111 |
| 3 | 200 | 2022 | 4 | 9876 | 1012 |
| 3 | 200 | 2022 | 5 | 8794 | 9876 |
| 3 | 200 | 2022 | 6 | 24142 | 8794 |
| 4 | 100 | 2022 | 4 | 42100 | |
| 4 | 200 | 2022 | 7 | 12 | |
| 4 | 200 | 2022 | 8 | 14 | 12 |
| 4 | 200 | 2022 | 9 | 485 | 14 |
As you can see it gets mixed up and my code leads to wrong results. Reason is that the new case falls into the same "place", it also has category 100 and belongs to person 1. However, from the column Index_ID and Previous_Index_ID it is clear that this is another case. These two columns show the traces from which one can differentiate between them and see that these are two different cases. (Of course there could be also even further cases that "fall into the same place", so it is not limited to just two as here in this example.) So my problem now is to get the following desired output:
Personalnumber Category Year Month Index_ID Previous_Index_ID newID
0 1 100 2022 8 42100 NaN 1
1 1 100 2022 8 101 NaN 2
2 1 100 2022 9 9534 42100.0 1
3 1 100 2022 9 204 101.0 2
4 1 100 2022 10 4344 204.0 2
5 1 100 2022 11 2069 4344.0 2
6 1 9400 2023 9 4 NaN 3
7 1 9400 2023 10 485 4.0 3
8 2 100 2022 1 214 102.0 4
9 2 100 2022 2 194231 214.0 4
10 3 200 2022 2 2111 NaN 5
11 3 200 2022 3 1012 2111.0 5
12 3 200 2022 4 9876 1012.0 5
13 3 200 2022 5 8794 9876.0 5
14 3 200 2022 6 24142 8794.0 5
15 4 100 2022 4 42100 NaN 6
16 4 200 2022 7 12 NaN 7
16 4 200 2022 8 14 12 7
16 4 200 2022 9 485 14 7
How can I do this?
The Index_ID is not unique over the complete dataset, it is only unique per year and month. So you can see that the Index_ID 42100 occurs in 2022 month 8 (personalnumber 1) and also in 2022 in month 4 (personalnumber 4). Or Index_ID 485 occurcs in 2023 month 10 (personalnumber 1) and also in 2022 month 9 (personalnumber 4). However, of course it is unique within a year and month.
(The index numbers are set completey random. So sorting ascending or descending on the Index_ID or Previous_Index_Id column is not a solution.)
EDIT regarding my comment below to the answer from Muhammed Samed Özmen:
Consider the following example:
Personalnumber;Category;Year;Month;Index_ID;Previous_Index_ID
398;14;2022;1;10708;1
398;14;2022;2;50242;10708
398;14;2022;3;76850;50242
398;14;2022;4;120861;76850
398;14;2022;5;110883;120861
398;14;2022;6;188043;110883
398;14;2022;7;9432;188043
398;14;2022;8;175715;9432
398;14;2022;9;142837;175715
398;14;2022;10;152659;142837
398;14;2022;11;52335;152659
398;14;2022;12;156366;52335
398;14;2023;1;16416;156366
398;14;2023;2;163499;16416
398;14;2023;3;1;163499
With the last line (398;14;2023;3;1;163499) the code throws a recursion error. I think the recursion error might arise due to 398;14;2022;1;10708;1 and 398;14;2023;3;1;163499.
However if I change the last record to Index_ID = 2, like this:
Personalnumber;Category;Year;Month;Index_ID;Previous_Index_ID
398;14;2022;1;10708;1
398;14;2022;2;50242;10708
398;14;2022;3;76850;50242
398;14;2022;4;120861;76850
398;14;2022;5;110883;120861
398;14;2022;6;188043;110883
398;14;2022;7;9432;188043
398;14;2022;8;175715;9432
398;14;2022;9;142837;175715
398;14;2022;10;152659;142837
398;14;2022;11;52335;152659
398;14;2022;12;156366;52335
398;14;2023;1;16416;156366
398;14;2023;2;163499;16416
398;14;2023;3;2;163499
Then it works and it sets a newID for this case as it should (all these records belong to one case).
Let's stick to the statement:
This means that
Previous_Index_IDpoints to anIndex_IDat some fixed past period. Based on the data provided, I assume thatPrevious_Index_IDis theIndex_IDfor the previous calendar month.The plan is to join the indices of the corresponding previous records to the current ones, then replace group roots with unique ids and pointers to previous rows with the corresponding data in the records they're pointing at.
First, let's prepare the data to work with:
Then check if the initial statement is true (just in case):
Prepare supporting data such as dates shifted one month forward, unique record identifiers (by year, month, index ID), and a group starting point marker:
Now [left] join the corresponding previous row numbers to the data:
And the final touch, iterate over the new column, replacing the marked starting points with unique group identifiers and populating the corresponding cells in the chain with them:
Note that in the for-loop we can replace
...= current if ...with some other provider of unique group identifiers if necessary, for example:We can as well rewrite this loop in Numba to run faster, as far as the data index is a range index starting at zero:
Code to experiment with:
P.S.
The described algorithm works only if all previous indexes physically precede the subsequent ones inside the table. Otherwise, we must first fill in all the starting points (roots) and then populate the chains with the appropriate group identifiers. For this purpose, we can use 2 reference sequences with indices of the next and previous entries (one to identify roots, the other to fill in the chains):