I'm trying to build a method to import multiple types of csvs or Excels and standardize it. Everything was running smoothly until a certain csv showed up, that brought me this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte

I'm building a set of try/excepts to include variations of data types but for this one I couldn't figure out how to prevent.

    if csv_or_excel_path[-3:]=='csv':
        try: table=pd.read_csv(csv_or_excel_path)
        except:
            try: table=pd.read_csv(csv_or_excel_path,sep=';')
            except:
                try:table=pd.read_csv(csv_or_excel_path,sep='\t')
                except:
                    try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
                    except:
                        try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
                        except: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')

By the way, the separator of the file is ";".

So:

a) I understand it would be easier to track down the problem if I could identify what's the character in "position 133", however I'm not sure how to find that out. Any suggestions?

b) Does anyone have a suggestion on what to include in that try/except sequence to skip this prob?

3

There are 3 best solutions below

1
On BEST ANSWER

For the record, this is probably better than multiple try/excepts

def read_csv(filepath):
     if os.path.splitext(filepath)[1] != '.csv':
          return  # or whatever
     seps = [',', ';', '\t']                    # ',' is default
     encodings = [None, 'utf-8', 'ISO-8859-1']  # None is default
     for sep in seps:
         for encoding in encodings:
              try:
                  return pd.read_csv(filepath, encoding=encoding, sep=sep)
              except Exception:  # should really be more specific 
                  pass
     raise ValueError("{!r} is has no encoding in {} or seperator in {}"
                      .format(filepath, encodings, seps))
0
On

Another possibility is to do

with open(path_to_file, encoding="utf8", errors="ignore") as f:
    table = pd.read_csv(f, sep=";")

By default, errors="ignore" will omit problematic byte sequences from read() calls. You can also supply a fill value for such byte sequences. But in general this should reduce the need for lots of painful error handling and nested try-excepts.

0
On

Thanks for the support @woblers and @FHTMitchell. The problem was a weird enconding the CSV had: ISO-8859-1.

I fixed it by adding a few lines to the try/except sequence. Here you can see the full version of it.

    if csv_or_excel_path[-3:]=='csv':
        try: table=pd.read_csv(csv_or_excel_path)
        except:
            try: table=pd.read_csv(csv_or_excel_path,sep=';')
            except:
                try:table=pd.read_csv(csv_or_excel_path,sep='\t')
                except:
                    try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
                    except:
                        try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
                        except:
                            try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')
                            except:
                                try:table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
                                except:
                                    try: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
                                    except: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep="\t")