Select records of specific data type from numpy recarray

178 Views Asked by At

I have a numpy recarray, that has records of different data types or dtypes.

import numpy as np
a = np.array([1,2,3,4], dtype=int)
b = np.array([6,6,6,6], dtype=int)
c = np.array(['p', 'q', 'r', 's'], dtype=object)
d = np.array(['a', 'b', 'c', 'd'], dtype=object)

X = np.rec.fromarrays([a, b, c, d], names=['a', 'b', 'c', 'd'])
X

>>> rec.array([(1, 6, 'p', 'a'), (2, 6, 'q', 'b'), (3, 6, 'r', 'c'),
           (4, 6, 's', 'd')],
          dtype=[('a', '<i8'), ('b', '<i8'), ('c', 'O'), ('d', 'O')])

I tried to select records of object data type using select_dtypes, but I get a attribute error

X.select_dtypes(include='object')

>>>AttributeError: recarray has no attribute select_dtypes

Is there an equivalent of the select_dtype function for numpy recarrays where I can select columns of specific data type ?

1

There are 1 best solutions below

0
hpaulj On
In [74]: X
Out[74]: 
rec.array([(1, 6, 'p', 'a'), (2, 6, 'q', 'b'), (3, 6, 'r', 'c'),
           (4, 6, 's', 'd')],
          dtype=[('a', '<i4'), ('b', '<i4'), ('c', 'O'), ('d', 'O')])

recarray can access field as attribute or indexing:

In [75]: X.a
Out[75]: array([1, 2, 3, 4])    
In [76]: X['a']
Out[76]: array([1, 2, 3, 4])

In [77]: X.dtype.fields
Out[77]: 
mappingproxy({'a': (dtype('int32'), 0),
              'b': (dtype('int32'), 4),
              'c': (dtype('O'), 8),
              'd': (dtype('O'), 16)})

testing the pandas approach:

In [78]: import pandas as pd

In [79]: df=pd.DataFrame(X)
In [80]: df
Out[80]: 
   a  b  c  d
0  1  6  p  a
1  2  6  q  b
2  3  6  r  c
3  4  6  s  d
In [83]: df.select_dtypes(include=object)
Out[83]: 
   c  d
0  p  a
1  q  b
2  r  c
3  s  d

Exploring the dtype:

In [84]: X.dtype
Out[84]: dtype((numpy.record, [('a', '<i4'), ('b', '<i4'), ('c', 'O'), ('d', 'O')]))

In [85]: X.dtype.fields
Out[85]: 
mappingproxy({'a': (dtype('int32'), 0),
              'b': (dtype('int32'), 4),
              'c': (dtype('O'), 8),
              'd': (dtype('O'), 16)})

Checking dtype by field:

In [89]: X['a'].dtype
Out[89]: dtype('int32')    
In [90]: X['c'].dtype
Out[90]: dtype('O')    
In [91]: X['c'].dtype == object
Out[91]: True

So a list comprehension works:

In [93]: [name for name in X.dtype.names if X[name].dtype==object]
Out[93]: ['c', 'd']

df.select_dtypes is python code, but fairly complex, handling the include and exclude lists.

In [95]: timeit [name for name in X.dtype.names if X[name].dtype==object]
16.5 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [96]: timeit df.select_dtypes(include=object)
110 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)