I am trying to detect file encoding using LispWorks.
LispWorks should be capable of such functionality, see External Formats and File Streams.
[Note: details based on @rainer-joswig and @svante comments]
system:*file-encoding-detection-algorithm*
is set to its default,
(setf system:*file-encoding-detection-algorithm*
'(find-filename-pattern-encoding-match
find-encoding-option
detect-utf32-bom
detect-unicode-bom
detect-utf8-bom
specific-valid-file-encoding
locale-file-encoding))
And also,
;; Specify the correct characters
(lw:set-default-character-element-type 'cl:character)
Some verifiable files available here:
UNICODE
and LATIN-1
are properly detected
;; UNICODE
;; http://www.humancomp.org/unichtm/tongtwst.htm
(with-open-file (ss "/tmp/tongtwst.htm")
(stream-external-format ss))
;; => (:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :CRLF)
;; LATIN-1
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)
Detecting UTF-8
does not work right away,
;; UTF-8 encoding
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)
Adding UTF-8
to *specific-valid-file-encodings*
makes it work,
(pushnew :utf-8 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:UTF-8)
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :CRLF)
But now same LATIN-1
file as above is detected as UTF-8,
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :LF)
Pushing LATIN-1
to *specific-valid-file-encodings*
as well,
(pushnew :latin-1 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:LATIN-1 :UTF-8)
;; This one works again
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)
;; But this one, which was properly detected as `UTF-8`,
;; is now detected as `LATIN-1`, *which is wrong.*
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)
What I am doing wrong?
How can I correctly detect file encoding using LispWorks?