What is the right way to parse the received headers of a .eml file, in order to extract all the hops' information? In particular I need to extract the following information:
- Sender URL
- Sender IP
- Receiver URL
- Receiver IP
- Date
- Protocol
I found the following specs, but it appears that there is no standard convention for the format of the received headers, and it may vary depending on the server:
For me the most clear explanation was the one from the RFC 822 spec:
received = "Received" ":" ; one per relay
["from" domain] ; sending host
["by" domain] ; receiving host
["via" atom] ; physical path
*("with" atom) ; link/mail protocol
["id" msg-id] ; receiver msg id
["for" addr-spec] ; initial form
";" date-time ; time received
Considering the following received headers
Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with
HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13
+0000
Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
(2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan
2020 16:34:13 +0000
Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
(2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com
(2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend
Transport; Thu, 9 Jan 2020 16:34:13 +0000
Received: from relay-out.ohc.cu (200.55.138.44) by
DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP
Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000
Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD
for <[email protected]>; Thu, 9 Jan 2020 11:29:43 -0500 (CST)
Received: from relay-out.ohc.cu ([127.0.0.1])
by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id 7CZku5Y59vGC for <[email protected]>;
Thu, 9 Jan 2020 11:29:38 -0500 (CST)
Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])
by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5
for <[email protected]>; Thu, 9 Jan 2020 11:29:36 -0500 (CST)
Received: from localhost (localhost.localdomain [127.0.0.1])
by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001
for <[email protected]>; Thu, 9 Jan 2020 11:40:05 -0500 (CST)
Received: from correo.patrimonio.ohc.cu ([127.0.0.1])
by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id hNMp-6lHHtzH for <[email protected]>;
Thu, 9 Jan 2020 11:40:05 -0500 (CST)
Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])
by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;
Thu, 9 Jan 2020 11:39:53 -0500 (CST)
the most changing fields seems to be
host domain
e.g.
- from VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:7:7c::43)
- by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com
- from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com (2603:10a6:802::42)
- by VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:803:11f::30)
- from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com (2a01:111:f400:7e02::203)
- by VI1PR0102CA0029.outlook.office365.com (2603:10a6:802::42)
- from relay-out.ohc.cu (200.55.138.44)
- by DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246)
- from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
- by relay-out.ohc.cu (Postfix)
- from relay-out.ohc.cu ([127.0.0.1])
- by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
- from correo.patrimonio.ohc.cu (unknown [192.168.229.20])
- by relay-out.ohc.cu (Postfix)
- from localhost (localhost.localdomain [127.0.0.1])
- by correo.patrimonio.ohc.cu (Postfix)
- from correo.patrimonio.ohc.cu ([127.0.0.1])
- by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
- from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])
- by correo.patrimonio.ohc.cu (Postfix)
mail protocol e.g.
- with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
- with ESMTP
What is the consolidated approach in extracting such info, considering their changing nature? Other answers on SO discouraged the use of regular expression for this task, but then how can one do this parsing? It would be ok for me if there existsted some tested regex or maybe a Java code/library to parse the received headers to extract the above info.
I want to offer the following solution. You can find a full explanation of the regular expression used here.
Output:
Demo.