I am working with the following table:
+------+------+------+------+---------------+---------+-------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+-------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | |
| 1 | 103 | 2020 | SMT | | | |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | | |
| 2 | 202 | 2022 | SMT | | | |
| 2 | 203 | 2022 | SMT | | | |
+------+------+------+------+---------------+---------+-------+
I am trying to perform the following steps using a df.query() but since there are so many caveats I am not sure how to fit them all in.
Step 1: Only looking at Type == "SMT" or Type == "CMT", group by ID 1 and identify latest date, compare this (grouped ID 1 data) to date of Marked_Latest == "Latest (essentially, just verifying that the date is correct)
Step 2: If the date values are the same, do nothing. If different, then supply ID 2 next to original Marked_Latest == "Latest" in Updated
Step 3: If multiple Latest have the same max Date, put a note in Notes that says "multiple".
This will result in the following table:
+------+------+------+------+---------------+---------+----------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+----------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | multiple |
| 1 | 103 | 2020 | SMT | | | multiple |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | 203 | |
| 2 | 202 | 2022 | SMT | | | multiple |
| 2 | 203 | 2022 | SMT | | | multiple |
+------+------+------+------+---------------+---------+----------+
To summarize: check that the latest date is actually marked as latest date. If it is not marked as latest date, write the updated ID 2 next to the original (incorrect) latest date. And when there are multiple cases of latest date, inputting "multiple" for each ID of latest date.
I have gotten only as far as identifying the actual latest date, using
q = df.query('Type' == "SMT" or 'Type' == "CMT").groupby('ID 1').last('ID 2')
q
This will return a subset with the latest dates marked, but I am not sure how to proceed from here, i.e. how to now compare this dataframe with the date field corresponding to Marked_Latest.
All help appreciated.
Try:
Output: