Within a wxpython application, I am using the following code to detect the 'dialect' of a csv file:
pathname = dlg.GetPath()
try:
self.file = open(pathname, 'r', encoding='utf-8')
except IOError:
wx.LogError("Cannot open file '%s'." % ntpath.basename(self.file.name))
return
# check for file format with sniffer
sample = self.file.read(1024)
try:
dialect = csv.Sniffer().sniff(sample)
except UnicodeDecodeError:
wx.LogError("Cannot decode file '%s'." % ntpath.basename(self.file.name))
return
except csv.Error:
wx.LogError("Cannot determine dialect of '%s'." % ntpath.basename(self.file.name))
return
The first lines of the csv file I am using this on are:
t;3.1.A.;"UN ECE R51; Sound levels"
;;Is covered by the type approval of the vehicle stage 1, refer to Annex S.
t;3.2.A.;"715/2007/EC; Emissions light duty vehicles Euro 6"
;;Is covered by the type approval of the vehicle stage 1, refer to Annex S.
t;3.3.A.;"UN ECE R34; Fuel tanks"
;;Is covered by the type approval of the vehicle stage 1.
t;3.16.A.;"UN ECE R26; Exterior projections"
;3.16.A.1.;Test and inspections
The delimiter is supposed to be ';' and the quote char '"'I am aware that there are lots of commas, semicolons and quote chars to confuse the sniffer, but when running this code with python 3.6 on windows, it works perfectly. Running it on Linux (also with python 3.6) invariably raises the csv.Error (also on other csv files with the same delimiters and quote chars). I have tried this with read(1024), with other values and also with readline, but always get the same results.
Any explanation for this different behaviour ?
The default end-of-record delimiter is different between windows and linux. Commonly, records will be terminated with a CR-LF "pair" on windows, while on *nix, a single LF is the norm. It may be the case that your sniffer might be fixing itself in windows-mode, and needs assistance in deciding what the real line-terminator should be.
From the docs it seems that sniffer defaults to
/r/n
which I think is windows-flavoured. It should cope with alternate line-terminators, but perhaps something is being forced somewhere. If your record-lengths in your data file are longer than 1024, or there's not enough to sample a line-terminator enough times to correctly guess the format, that might have something to do with it.