# The Problem #
When parsing millions of emails, the method Mail.read_from_string(mail_as_string) is too slow.
# The Question #
How can I speed up email parsing?
# The Context #
I have put in enough context for you to understand my use case.
Fetching an email
I connect to some external IMAP server via Rubys Net::IMAP.
@imap = Net::IMAP.new("imap.gmail.com", 993, true) # A few login steps are omitted here
I fetch an email
:
email = @imap.uid_fetch("85113", ["BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]", "RFC822"]) # => #<struct Net::IMAP::FetchData seqno=55395, attr={"UID"=>85113, "RFC822"=>"Delivered-To: [email protected]\r\nReceived: by 10.223.148.78 with SMTP id o14csp218630fav;\r\n Tue, 18 Dec 2012 16:55:50 -0800 (PST)\r\nX-Received: by 10.194.177.199 with SMTP id cs7mr8044338wjc.41.1355878548414;\r\n Tue, 18 Dec 2012 16:55:48 -0800 (PST)\r\nReturn-Path: <[email protected]>\r\nReceived: from exproxy-1.exserver.dk (exproxy-1.exserver.dk. [195.69.129.162])\r\n by mx.google.com with ESMTP id m13si17440569wie.32.2012.12.18.16.55.47;\r\n Tue, 18 Dec 2012 16:55:48 -0800 (PST)\r\nReceived-SPF: pass (google.com: domain of [email protected] designates 195.69.129.162 as permitted sender) client-ip=195.69.129.162;\r\nAuthentication-Results: mx.google.com; spf=pass (google.com: domain of [email protected] designates 195.69.129.162 as permitted sender) [email protected]\r\nReceived: by exproxy-1.exserver.dk (Postfix, from userid 65534)\r\n\tid 5330511CDCB; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from EXHUB02.exchangeserver.dk (exhub02.exchangeserver.dk [193.239.98.62])\r\n\tby exproxy-1.exserver.dk (Postfix) with ESMTP id 4735211A58E\r\n\tfor <[email protected]>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from front07.exserver.dk (195.69.129.92) by\r\n EXHUB02.exchangeserver.dk (193.239.98.60) with Microsoft SMTP Server id\r\n 8.2.176.0; Wed, 19 Dec 2012 01:58:49 +0100\r\nReceived: from localhost (front07.exserver.dk [127.0.0.1])\tby\r\n front07.exserver.dk (Postfix) with ESMTP id 0B8287B4015\tfor\r\n <[email protected]>; Wed, 19 Dec 2012 01:55:45 +0100 (CET)\r\nX-Virus-Scanned: amavisd-new at exserver.dk\r\nReceived: from front07.exserver.dk ([127.0.0.1])\tby localhost\r\n (front07.exserver.dk [127.0.0.1]) (amavisd-new, port 10024)\twith ESMTP id\r\n vrjzzlpsuXn6 for <[email protected]>;\tWed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from shopmail.scannet.dk (shopmail.scannet.dk [195.69.129.120])\tby\r\n front07.exserver.dk (Postfix) with ESMTP id A6F797B4002\tfor\r\n <[email protected]>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from WebSrv100 (unknown [193.239.97.100])\tby shopmail.scannet.dk\r\n (Postfix) with ESMTP id 6DFEF7FE4E\tfor <[email protected]>; Wed, 19 Dec\r\n 2012 01:55:34 +0100 (CET)\r\nMIME-Version: 1.0\r\nFrom: me <[email protected]>\r\nTo: me <[email protected]>\r\nReply-To: <[email protected]>\r\nDate: Wed, 19 Dec 2012 01:55:44 +0100\r\nSubject: Ordre (Kopi)\r\nContent-Type: text/html; charset=\"utf-8\"\r\nContent-Transfer-Encoding: base64\r\nMessage-ID: <[email protected]>\r\nX-ScanNet-Forward: TTL=5\r\n\r\n\r\nT3JkcmUgZnJhIEdhbWVQSU1QOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08YnI+DQpPcmRy\r\nZWRhdG86IDE5LTEyLTIwMTIgMDE6NTU6NDM8YnI+DQpPcmRyZW51bW1lcjogMTA4NjU0PGJy\r\nPg0KVHJhbnNha3Rpb25zSUQ6IDE2NzI4Ng0KPGJyPjxicj4NCkZha3R1cmVyaW5nc2FkcmVz\r\nc2U6PGJyPg0KLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLTxicj48YnI+DQpMYXJzIFBldGVyc2VuPGJy\r\nIC8+QnlzdMOmdm5ldmVqIDY2LCBCw7hqZGVuPGJyIC8+NTYwMCBGYWFib3JnPGJyIC8+RGVu\r\nbWFyazxiciAvPlRMRjo6IDYwNjczNzY3PGJyIC8+PGEgaHJlZj0ibWFpbHRvOmZhc3RodWdv\r\nQGhvdG1haWwuY29tIj5mYXN0aHVnb0Bob3RtYWlsLmNvbTwvYT48YnIgLz4NCjxicj48YnI+\r\nDQpMZXZlcmluZ3NhZHJlc3NlOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08YnI+PGJyPg0K\r\nTGFycyBQZXRlcnNlbjxiciAvPkJ5c3TDpnZuZXZlaiA2NiwgQsO4amRlbjxiciAvPjU2MDAg\r\nRmFhYm9yZzxiciAvPkRlbm1hcms8YnIgLz5UTEY6OiA2MDY3Mzc2NzxiciAvPjxhIGhyZWY9\r\nIm1haWx0bzpmYXN0aHVnb0Bob3RtYWlsLmNvbSI+ZmFzdGh1Z29AaG90bWFpbC5jb208L2E+\r\nPGJyIC8+DQo8YnI+PGJyPg0KT3JkcmVkYXRhOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08\r\nYnI+DQoNCiAgMSwwMCBzdGsuIFgzOiBUZXJyYW4gQ29uZmxpY3QgUEMgKDMxNDMyKSDDoSBE\r\nS0sgMjk4LDM5IC0gSWFsdDogREtLIDM3Miw5OQ0KPGJyPg0KICAxLDAwIHN0ay4gQ3J5c2lz\r\nIE1heGltdW0gRWRpdGlvbiBQQyAoNDgwNDgpIMOhIERLSyAxNTcsNTkgLSBJYWx0OiBES0sg\r\nMTk2LDk5DQo8YnI+DQo8YnI+DQpCZXRhbGluZzogMjogRGFuc2tlIGtyZWRpdGtvcnQgW3Ry\r\nYW5zYWt0aW9uc2dlYnlyIDEsMjUlXSAoREtLIDcsMTMpDQo8YnI+DQpGb3JzZW5kZWxzZTog\r\nIChES0sgMCwwMCkNCjxicj48YnI+DQpTYW1sZXQgcHJpcyA6IERLSyA1NzcsMTENCjxicj4N\r\nCkhlcmFmIG1vbXM6IERLSyAxMTUsNDMNCg==\r\n\r\n", "BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]"=>"From: me <[email protected]>\r\nTo: me <[email protected]>\r\nDate: Wed, 19 Dec 2012 01:55:44 +0100\r\nSubject: Ordre (Kopi)\r\n\r\n"}>
Getting the email information
header_attr = email.attr["BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]"]
header = Mail.read_from_string(header_attr) # => #<Mail::Message:70179653144480, Multipart: false, Headers: <Date: Wed, 19 Dec 2012 01:55:44 +0100>, <From: me <[email protected]>>, <To: me <[email protected]>>, <Subject: Ordre (Kopi)>>
I can then access the following:
header.date.to_time # => 2012-12-18 16:55:44 -0800
header.from.first # => [email protected]
header.to.first # => [email protected]
header.subject # => Ordre (Kopi)
Delay 1: Getting header
takes 0.010000 seconds:
puts Benchmark.measure { Mail.read_from_string(header_attr) } # => 0.010000 0.000000 0.010000 ( 0.004163)
Getting the email message (Body)
message_attr = email.attr["RFC822"]
message = Mail.read_from_string(message_attr) # => #<Mail::Message:70179643743140, Multipart: false, Headers: <Return-Path: <[email protected]>>, <Received: by 10.223.148.78 with SMTP id o14csp218630fav; Tue, 18 Dec 2012 16:55:50 -0800 (PST)>, <Received: from exproxy-1.exserver.dk (exproxy-1.exserver.dk. [195.69.129.162]) by mx.google.com with ESMTP id m13si17440569wie.32.2012.12.18.16.55.47; Tue, 18 Dec 2012 16:55:48 -0800 (PST)>, <Received: by exproxy-1.exserver.dk (Postfix, from userid 65534) id 5330511CDCB; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from EXHUB02.exchangeserver.dk (exhub02.exchangeserver.dk [193.239.98.62]) by exproxy-1.exserver.dk (Postfix) with ESMTP id 4735211A58E for <[email protected]>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from front07.exserver.dk (195.69.129.92) by EXHUB02.exchangeserver.dk (193.239.98.60) with Microsoft SMTP Server id 8.2.176.0; Wed, 19 Dec 2012 01:58:49 +0100>, <Received: from localhost (front07.exserver.dk [127.0.0.1]) by front07.exserver.dk (Postfix) with ESMTP id 0B8287B4015 for <[email protected]>; Wed, 19 Dec 2012 01:55:45 +0100 (CET)>, <Received: from front07.exserver.dk ([127.0.0.1]) by localhost (front07.exserver.dk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vrjzzlpsuXn6 for <[email protected]>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from shopmail.scannet.dk (shopmail.scannet.dk [195.69.129.120]) by front07.exserver.dk (Postfix) with ESMTP id A6F797B4002 for <[email protected]>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from WebSrv100 (unknown [193.239.97.100]) by shopmail.scannet.dk (Postfix) with ESMTP id 6DFEF7FE4E for <[email protected]>; Wed, 19 Dec 2012 01:55:34 +0100 (CET)>, <Date: Wed, 19 Dec 2012 01:55:44 +0100>, <From: me <[email protected]>>, <Reply-To: <[email protected]>>, <To: me <[email protected]>>, <Message-ID: <[email protected]>>, <Subject: Ordre (Kopi)>, <Mime-Version: 1.0>, <Content-Type: text/html; charset="utf-8">, <Content-Transfer-Encoding: base64>, <Delivered-To: [email protected]>, <X-Received: by 10.194.177.199 with SMTP id cs7mr8044338wjc.41.1355878548414; Tue, 18 Dec 2012 16:55:48 -0800 (PST)>, <Received-SPF: pass (google.com: domain of [email protected] designates 195.69.129.162 as permitted sender) client-ip=195.69.129.162;>, <Authentication-Results: mx.google.com; spf=pass (google.com: domain of [email protected] designates 195.69.129.162 as permitted sender) [email protected]>, <X-Virus-Scanned: amavisd-new at exserver.dk>, <X-ScanNet-Forward: TTL=5>>
In order to ensure UTF-8 encoding I do the following:
if message.multipart?
body = message.text_part.decoded.force_encoding("UTF-8").encode("UTF-8")
else
body = message.body.decoded.force_encoding(message.charset).encode("UTF-8") # => "Ordre fra mig:<br>\r\n-------------------------------------------------------------------------<br>\r\nOrdredato: 19-12-2012 01:55:43<br>\r\nOrdrenummer: 108654<br>\r\nTransaktionsID: 167286\r\n<br><br>\r\nFaktureringsadresse:<br>\r\n-------------------------------------------------------------------------<br><br>\r\nLars Larsen<br />En vej 66, Bøjden<br />1900 Frederiksberg<br />Denmark<br />TLF:: 12345678<br /><a href=\"mailto:[email protected]\">[email protected]</a><br />\r\n<br><br>\r\nLeveringsadresse:<br>\r\n-------------------------------------------------------------------------<br><br>\r\nLars Larsen<br />En vej 66<br />1900 Frederiksberg<br />Denmark<br />TLF:: 12345678<br /><a href=\"mailto:[email protected]\">[email protected]</a><br />\r\n<br><br>\r\nOrdredata:<br>\r\n-------------------------------------------------------------------------<br>\r\n\r\n 1,00 stk. X3: Terran Conflict PC (31432) á DKK 298,39 - Ialt: DKK 372,99\r\n<br>\r\n 1,00 stk. Crysis Maximum Edition PC (48048) á DKK 157,59 - Ialt: DKK 196,99\r\n<br>\r\n<br>\r\nBetaling: 2: Danske kreditkort [transaktionsgebyr 1,25%] (DKK 7,13)\r\n<br>\r\nForsendelse: (DKK 0,00)\r\n<br><br>\r\nSamlet pris : DKK 577,11\r\n<br>\r\nHeraf moms: DKK 115,43\r\n"
end
Delay 2: Getting message
takes 0.050000 seconds:
puts Benchmark.measure { Mail.read_from_string(message_attr) } # => 0.050000 0.000000 0.050000 ( 0.054013)
If you've got the email already parsed into fields...
...then why make Mail::new parse it out all over again? Instead of calling
Mail.read_from_string(message_attr)
, try something like this: