How to correctly detect file encodings with LispWorks?

375 Views Asked by At

I am trying to detect file encoding using LispWorks.

LispWorks should be capable of such functionality, see External Formats and File Streams.

[Note: details based on @rainer-joswig and @svante comments]

system:*file-encoding-detection-algorithm* is set to its default,

(setf system:*file-encoding-detection-algorithm*
      '(find-filename-pattern-encoding-match
       find-encoding-option
       detect-utf32-bom
       detect-unicode-bom
       detect-utf8-bom
       specific-valid-file-encoding
       locale-file-encoding))

And also,

;; Specify the correct characters
(lw:set-default-character-element-type 'cl:character)

Some verifiable files available here:

UNICODE and LATIN-1 are properly detected

;; UNICODE
;; http://www.humancomp.org/unichtm/tongtwst.htm
(with-open-file (ss "/tmp/tongtwst.htm")
  (stream-external-format ss))
;; => (:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :CRLF)

;; LATIN-1
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)

Detecting UTF-8 does not work right away,

;; UTF-8 encoding
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)

Adding UTF-8 to *specific-valid-file-encodings* makes it work,

(pushnew :utf-8 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:UTF-8)

;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
  (stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :CRLF)

But now same LATIN-1 file as above is detected as UTF-8,

(with-open-file (ss "/tmp/windows-1252-2000.ucm")
  (stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :LF)

Pushing LATIN-1 to *specific-valid-file-encodings* as well,

(pushnew :latin-1 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:LATIN-1 :UTF-8)

;; This one works again
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)

;; But this one, which was properly detected as `UTF-8`,
;; is now detected as `LATIN-1`, *which is wrong.*
(with-open-file (ss "/tmp/tongtws8.htm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)

What I am doing wrong?

How can I correctly detect file encoding using LispWorks?

0

There are 0 best solutions below