Get latest date by comparing multiple date columns in Pandas DataFrame

2.4k Views Asked by At

In a DataFrame, I have multiple Date columns as below.

+---+----------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|   | order_id | dateColumn1 | dateColumn2 | dateColumn3 | dateColumn4 | dateColumn5 | dateColumn6 | dateColumn7 |
+---+----------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| 0 | 22161577 | 2019-03-26  |             |             | 2019-03-24  |             | 2019-03-23  | 2019-03-26  |
| 1 | 22719649 | 2019-05-22  |             | 2019-05-21  | 2019-05-28  |             | 2019-05-24  | 2019-05-28  |
| 2 | 23693247 | 2019-08-13  |             | 2019-08-02  | 2019-08-13  |             | 2019-08-10  | 2019-08-07  |
| 3 | 24269635 | 2019-09-23  |             | 2019-10-04  | 2019-09-25  |             | 2019-10-04  |             |
| 4 | 24887829 | 2019-11-12  |             | 2020-11-27  | 2020-11-27  |             | 2020-11-27  |             |
| 5 | 25325867 | 2019-12-19  |             | 2019-12-19  | 2019-12-19  |             | 2019-12-19  |             |
| 6 | 26096549 | 2020-03-18  |             | 2020-06-25  | 2020-04-20  |             | 2020-06-25  | 2020-06-25  |
+---+----------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+

Trying to find the latest date among the given dates in each row and update it on a new column as below. In few cases, dateColumns 2 and 5 also have dates.

+---+----------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
|   | order_id | dateColumn1 | dateColumn2 | dateColumn3 | dateColumn4 | dateColumn5 | dateColumn6 | dateColumn7 | latestDate |
+---+----------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| 0 | 22161577 | 2019-03-26  |             |             | 2019-03-24  |             | 2019-03-23  | 2019-03-26  | 2019-03-26 |
| 1 | 22719649 | 2019-05-22  |             | 2019-05-21  | 2019-05-28  |             | 2019-05-24  | 2019-05-28  | 2019-05-28 |
| 2 | 23693247 | 2019-08-13  |             | 2019-08-02  | 2019-08-13  |             | 2019-08-10  | 2019-08-07  | 2019-08-13 |
| 3 | 24269635 | 2019-09-23  |             | 2019-10-04  | 2019-09-25  |             | 2019-10-04  |             | 2019-10-04 |
| 4 | 24887829 | 2019-11-12  |             | 2020-11-27  | 2020-11-27  |             | 2020-11-27  |             | 2020-11-27 |
| 5 | 25325867 | 2019-12-19  |             | 2019-12-19  | 2019-12-19  |             | 2019-12-19  |             | 2019-12-19 |
| 6 | 26096549 | 2020-03-18  |             | 2020-06-25  | 2020-04-20  |             | 2020-06-25  | 2020-06-25  | 2020-06-25 |
+---+----------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+

Do we’ve any in-build function in Pandas DataFrame to find the latest date, comparing the multiple date columns. Verified this question. Yet, looking for alternative method or function to compare multiple date column values directly and get the most recent date. The DataFrame contains approximately 25K rows.

2

There are 2 best solutions below

0
On

following function should serve the purpose.

function will check dates (row wise) for latest one and writes into a new column.

i have checked it at online, have a visit if need, Online Editor.

import pandas as pd
import numpy as np
from datetime import *
from dateutil.parser import *

def find_lastest_date(dataframe):
    lss = []
    max_date = []
    df = dataframe.astype(str)
    for row in range(len(df)):
        for col in df:
            try:
                lss.append(parse(str(df.loc[row,col])))
            except:
                pass
        try:
            max_date.append(max(lss).strftime("%Y/%m/%d %H:%M"))  #change format for output column
        except:
            max_date.append("could not parse date from string")
    else:
        return dataframe.assign(lastest_date = np.array(max_date))
                

print(find_lastest_date(mydf)) #change mydf to yours
0
On
df = pd.DataFrame({0: {0: 22161577, 1: 22719649, 2: 23693247, 3: 24269635, 4: 24887829, 5: 25325867, 6: 26096549},
    1: {0: '2019-03-26', 1: '2019-05-22', 2: '2019-08-13', 3: '2019-09-23', 4: '2019-11-12', 5: '2019-12-19', 6: '2020-03-18'},
    2: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan},
    3: {0: nan, 1: '2019-05-21', 2: '2019-08-02', 3: '2019-10-04', 4: '2020-11-27', 5: '2019-12-19', 6: '2020-06-25'},
    4: {0: '2019-03-24', 1: '2019-05-28', 2: '2019-08-13', 3: '2019-09-25', 4: '2020-11-27', 5: '2019-12-19', 6: '2020-04-20'},
    5: {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan},
    6: {0: '2019-03-23', 1: '2019-05-24', 2: '2019-08-10', 3: '2019-10-04', 4: '2020-11-27', 5: '2019-12-19', 6: '2020-06-25'},
    7: {0: '2019-03-26', 1: '2019-05-28', 2: '2019-08-07', 3: nan, 4: nan, 5: nan, 6: '2020-06-25'}})
df.columns=["order_id","dateColumn1","dateColumn2","dateColumn3","dateColumn4","dateColumn5","dateColumn6","dateColumn7"]
df[df.filter(regex="date").columns] = df.filter(regex="date").apply(pd.to_datetime)

df:

   order_id dateColumn1 dateColumn2 dateColumn3 dateColumn4 dateColumn5 dateColumn6 dateColumn7
0  22161577  2019-03-26         NaT         NaT  2019-03-24         NaT  2019-03-23  2019-03-26
1  22719649  2019-05-22         NaT  2019-05-21  2019-05-28         NaT  2019-05-24  2019-05-28
2  23693247  2019-08-13         NaT  2019-08-02  2019-08-13         NaT  2019-08-10  2019-08-07
3  24269635  2019-09-23         NaT  2019-10-04  2019-09-25         NaT  2019-10-04         NaT
4  24887829  2019-11-12         NaT  2020-11-27  2020-11-27         NaT  2020-11-27         NaT
5  25325867  2019-12-19         NaT  2019-12-19  2019-12-19         NaT  2019-12-19         NaT
6  26096549  2020-03-18         NaT  2020-06-25  2020-04-20         NaT  2020-06-25  2020-06-25

get max date in each row (in only date columns):

df['latest_date'] = df.filter(regex="date").max(1)

df:

   order_id dateColumn1 dateColumn2 dateColumn3 dateColumn4 dateColumn5 dateColumn6 dateColumn7 latest_date
0  22161577  2019-03-26         NaT         NaT  2019-03-24         NaT  2019-03-23  2019-03-26  2019-03-26
1  22719649  2019-05-22         NaT  2019-05-21  2019-05-28         NaT  2019-05-24  2019-05-28  2019-05-28
2  23693247  2019-08-13         NaT  2019-08-02  2019-08-13         NaT  2019-08-10  2019-08-07  2019-08-13
3  24269635  2019-09-23         NaT  2019-10-04  2019-09-25         NaT  2019-10-04         NaT  2019-10-04
4  24887829  2019-11-12         NaT  2020-11-27  2020-11-27         NaT  2020-11-27         NaT  2020-11-27
5  25325867  2019-12-19         NaT  2019-12-19  2019-12-19         NaT  2019-12-19         NaT  2019-12-19
6  26096549  2020-03-18         NaT  2020-06-25  2020-04-20         NaT  2020-06-25  2020-06-25  2020-06-25