Suppose you have a dataframe df_matches with transaction data structured in the following way:
| REPORT_ID| VALUE | SIDE | COUNTRY | CP1 | CP2 |...
| -------- | -------- | -------- | -------- | -------- | -------- |...
| ABC123 | 20 | B | DE | A | B |...
| ABC123 | 20 | S | FR | B | A |...
| DEF456 | 60 | B | DE | A | C |...
| DEF456 | 62 | S | AT | C | A |...
| GHI789 | 75 | B | NL | D | E |...
| GHI789 | 65 | S | NL | E | D |...
|... |... |... |... |... |... |...
I want to calculate similarity measures per REPORT_ID for several attributes and therefore want to change the structure of the dataframe to look like this:
| REPORT_ID| VALUE_1 | VALUE_2 | SIDE_1 | SIDE_2 | CP1_1 | CP1_2 |...
| -------- | -------- | -------- | --------| -------| -------| --------|...
| ABC123 | 20 | 20 | B | S | A | B |...
| DEF456 | 60 | 62 | S | B | A | C |...
| GHI798 | 75 | 65 | B | S | D | E |...
|... |... |... |... |... |... |... |...
Is it the most efficient way to do this using dplyr/group_by REPORT_ID and look up the values for first and second report by using "first" and "last" in the summarise-command like this?
df_sim_calc <- df_matches %>%
dplyr::group_by(REPORT_ID) %>%
dplyr::summarise(VALUE_1 = first(VALUE),
VALUE_2 = last(VALUE),
SIDE_1 = first(SIDE),
CP1_1 = first(CP1),
CP1_2 = last(CP1),
CP2_1 = first(CP2),
CP2_2 = last(CP2),
)
Thanks!