numpy: creating recarray fast with different column types

851 Views Asked by At

I am trying to create a recarray from a series of numpy arrays with column names and mixed variable types.

The following works but is slow:

    import numpy as np
    a = np.array([1,2,3,4], dtype=np.int)
    b = np.array([6,6,6,6], dtype=np.int)
    c = np.array([-1.,-2.-1.,-1.], dtype=np.float32)
    d = np.array(list(zip(a,b,c,d)),dtype = [('a',np.int),('b',np.int),('c',np.float32)])
    d = d.view(np.recarray())

I think there should be a way to do this with np.stack((a,b,c), axis=-1), which is faster than the list(zip()) method. However, there does not seem to be a trivial way to do the stacking an preserving column types. This link does seem to show how to do it, but its pretty clunky and I hope there is a better way.

Thanks for the help!

2

There are 2 best solutions below

0
Paul Panzer On

np.rec.fromarrays is probably what you want:

>>> np.rec.fromarrays([a, b, c], names=['a', 'b', 'c'])
rec.array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
          dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])
1
hpaulj On

Here's the field by field approach that I commented on:

In [308]:     a = np.array([1,2,3,4], dtype=np.int)
     ...:     b = np.array([6,6,6,6], dtype=np.int)
     ...:     c = np.array([-1.,-2.,-1.,-1.], dtype=np.float32)
     ...:     dt = np.dtype([('a',np.int),('b',np.int),('c',np.float32)])
     ...: 
     ...: 

(I had to correct your copy-n-pasted c).

In [309]: arr = np.zeros(a.shape, dtype=dt)
In [310]: for name, x in zip(dt.names, [a,b,c]):
     ...:     arr[name] = x
     ...:     
In [311]: arr
Out[311]: 
array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])

Since typically the array will have many more records (rows) than fields this should be faster than the list of tuples approach. In this case it probably is comprable in speed.

In [312]: np.array(list(zip(a,b,c)), dtype=dt)
Out[312]: 
array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])

rec.fromarrays, after some setup to determine the dtype, does:

_array = recarray(shape, descr)
# populate the record array (makes a copy)
for i in range(len(arrayList)):
    _array[_names[i]] = arrayList[i]

The only way to use stack is to create recarrays first:

In [315]: [np.rec.fromarrays((i,j,k), dtype=dt) for i,j,k in zip(a,b,c)]
Out[315]: 
[rec.array((1, 6, -1.),
           dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')]),
 rec.array((2, 6, -2.),
           dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')]),
 rec.array((3, 6, -1.),
           dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')]),
 rec.array((4, 6, -1.),
           dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])]
In [316]: np.stack(_)
Out[316]: 
array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
      dtype=(numpy.record, [('a', '<i8'), ('b', '<i8'), ('c', '<f4')]))