numpy genfromtxt - how to detect bad int input values

132 Views Asked by At

Here is a trivial example of a bad int value to numpy.genfromtxt. For some reason, I can't detect this bad value, as it's showing up as a valid int of -1.

>>> bad = '''a,b
0,BAD
1,2
3,4'''.splitlines()

My input here has 2 columns of ints, named a and b. b has a bad value, where we have a string "BAD" instead of an integer. However, when I call genfromtxt, I cannot detect this bad value.

>>> out = np.genfromtxt(bad, delimiter=',', dtype=(numpy.dtype('int64'), numpy.dtype('int64')), names=True, usemask=True, usecols=tuple('ab'))
>>> out

masked_array(data=[(0, -1), (1, 2), (3, 4)],
         mask=[(False, False), (False, False), (False, False)],
   fill_value=(999999, 999999),
        dtype=[('a', '<i8'), ('b', '<i8')])

>>> out['b'].data
array([-1,  2,  4])

I print out the column 'b' from my output, and I'm shocked to see that it has a -1 where the string "BAD" is supposed to be. The user has no idea that there was bad input. In fact, if you only look at the output, this is totally indistinguishable from the following input

>>> bad2 = '''a,b
0,-1
1,2
3,4'''.splitlines()

I feel like I must be using genfromtxt wrong. How is it possible that it can't detect bad input?

1

There are 1 best solutions below

0
On

I found in np.lib._iotools a function

def _loose_call(self, value):
    try:
        return self.func(value)
    except ValueError:
        return self.default

When genfromtxt is processing a line it does

if loose:
    rows = list(
        zip(*[[conv._loose_call(_r) for _r in map(itemgetter(i), rows)]
              for (i, conv) in enumerate(converters)]))

where loose is an input parameter. So in the case of int converter it tries

int(astring)

and if that produces a ValueError it returns the default value (e.g. -1) instead of raising an error. Similarly for float and np.nan.

The usemask parameter is applied in:

        if usemask:
            append_to_masks(tuple([v.strip() in m
                                   for (v, m) in zip(values,
                                                     missing_values)]))

Define 2 converters to give more information on what's processed:

def myint(astr):
    try:
        v = int(astr)
    except ValueError:
        print('err',astr)
        v = '-999'
    return v

def myfloat(astr):
    try:
        v = float(astr)
    except ValueError:
        print('err',astr)
        v = '-inf'
    return v

A sample text:

txt='''1,2
3,nan
,foo
bar,
'''.splitlines()

And using the converters:

In [242]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat})
err b''
err b'bar'
err b'foo'
err b''
Out[242]: 
array([(   1,   2.), (   3,  nan), (-999, -inf), (-999, -inf)],
      dtype=[('f0', '<i8'), ('f1', '<f8')])

And to see what usemask does:

In [243]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat}, usemask=True)
err b''
err b'bar'
err b'foo'
err b''
Out[243]: 
masked_array(data=[(1, 2.0), (3, nan), (--, -inf), (-999, --)],
             mask=[(False, False), (False, False), ( True, False),
                   (False,  True)],
       fill_value=(999999, 1.e+20),
            dtype=[('f0', '<i8'), ('f1', '<f8')])

A missing value is a '' string, and int('') produces a ValueError just as int('bad') does. So for the converter, default or my custom ones, a missing value is the same as bad one. Your converter could make a distinction. But only 'missing' set the the mask.