Should hypen-minus (U+002D) or hypen (U+2010) be used for ISO 8601 datetimes?

145 Views Asked by At

Python interpreter gives the following when generating an ISO-8601 formatted date/time string:

>>> import datetime
>>> datetime.datetime.now().isoformat(timespec='seconds')
'2023-10-12T22:35:02'

Note that the '-' character in the string is a hypen-minus character. When going backwards to produce the datetime object, we do the following:

>>> datetime.datetime.strptime('2023-10-12T22:35:02', '%Y-%m-%dT%H:%M:%S')
datetime.datetime(2023, 10, 12, 22, 35, 2)

This all checks out.

However, sometimes when the ISO-8601 formatted date/time string is provided from an external source, such as a parameter sent over in a GET/POST request, or in a .csv file, the hyphens are sent as the (U+2010) character, which causes the parsing to break:

>>> datetime.datetime.strptime('2023‐10‐12T22:35:02', '%Y-%m-%dT%H:%M:%S')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '2023‐10‐12T22:35:02' does not match format '%Y-%m-%dT%H:%M:%S'

What is the correct standard? Is it hypen-minus - U+002D as given by Python when converting via .isoformat(), or hypen U+2010?

Would it be best practice to accept both?

3

There are 3 best solutions below

1
RandomCoder368 On

I would recommend ASCII 0x2D because ASCII is very commonly used, and will break less. For your purposes, if you care about compatibility, .replace("\u2010", "-") to replace it to ASCII, replace("-", "\u2010") for ISO 8601. If you don't care just let your users do it (I recommend ASCII)

3
Keith Thompson On

The ISO 8601 standard is not publicly available for free. Perhaps someone who has a copy can post a more definitive answer.

ISO has published a brief summary of the ISO 8601 standard. The summary consistently uses HYPHEN-MINUS (0x2D). (Thanks to Giacomo Catenazzi for pointing this out in a comment.)

RFC 3339 is based on ISO 8601, and it consistently uses the HYPHEN-MINUS character (0x2D), not the Unicode HYPHEN character (0x2010). Note that using HYPHEN-MINUS, which is an ASCII character, avoids issues with differing character sets.

Reference: https://datatracker.ietf.org/doc/html/rfc3339

If you create timestamps intended to be consistent with ISO 8601, you should definitely use HYPHEN-MINUS.

If you receive timestamps that are supposedly intended to be ISO 8601, but they include HYPHEN (0x2010) characters, you can choose to accept them. Whether you should accept them depends on the requirements of your project. If possible, ask whoever is generating timestamps to use the correct HYPHEN-MINUS characters. Once you start accepting non-standard input, you might have to do an open-ended amount of work.

0
Crissov On

TL;DR: Implementations should accept both, U+2010 as a date component separator and U+2212 as a sign for year and timezone offsets, and generate neither.

Previous editions of ISO 8601 and its predecessors did not say anything about the actual characters to be used, i.e. they assumed the unified - Hyphen-Minus which became U+002D – the early editions predate Unicode. The 2019 revision, however, has this to say, in §3.2.1:

All characters used in date and time expressions and representations are part of the ISO/IEC 646 repertoire, except for “hyphen”, “minus” and “plus-minus”. In an environment where use is made of a character repertoire based on ISO/IEC 646, “hyphen” and “minus” should be both mapped onto “hyphen-minus”.

The Hyphen is U+2010 in Unicode and the Minus is U+2212.

This paragraph has been interpreted differently. Some understand it as ISO recommending those characters, others say that ISO/IEC 10646 (= Unicode) is based upon ISO/IEC 646 (≈ ASCII) which makes the mapping apply.

The safest conclusion according to Postel’s Law is that robust implementations should accept these for reading, but should only write them if the ominous “environment” actually calls for it.


Personally, I would go as far as recommending some other lenient replacements when parsing a date time string, e.g.:

  • Case insensitive.
  • The separator between the date and the time component T can be a space or underscore _ instead.
  • The period separator / can be substituted by -- or En Dash U+2013.

However, regular implementations should reject Hyphen used as a negative sign and Minus used as a separator!