build design matrix python

4.3k Views Asked by At

Suppose I have a RxC contingency table. This means there are R rows and C columns. I want a matrix, X, of dimension RC × (R + C − 2) that contains the R − 1 “main effects” for the rows and the C − 1 “main effects” for the columns.For example, if you have R=C=2 (R = [0, 1], C = [0, 1]) and main effects only, there are various ways to parameterize the design matrix (X), but below is one way:

1 0
0 1
1 0
0 0

Note that this is 4 x 2 = RC x (R + C - 2), you omit one level of each row and one level of each column.

How can I do this in Python for any value of R and C ie R = 3, C = 4 ([0 1 2] and [0 1 2 3])? I only have the values of R and C, but I can use them to construct arrays using np.arange(R) and np.arange(C).

2

There are 2 best solutions below

0
On BEST ANSWER

The following should work:

R = 3
C = 2

ir = np.zeros((R, C))
ir[0, :] = 1
ir = ir.ravel()

mat = []
for i in range(R):
    mat.append(ir)
    ir = np.roll(ir, C)

ic = np.zeros((R, C))
ic[:, 0] = 1
ic = ic.ravel()

for i in range(C):
    mat.append(ic)
    ic = np.roll(ic, R)

mat = np.asarray(mat).T

and the result is:

array([[ 1.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Thanks everyone for your help!

1
On

Use LabelBinarizer or One-Hot Encoding to create a design matrix

Since all his labels are in similar column, we can use a sklearns preprocessing package which has LabelBinarizer/One Hot Encoding which will convert labels in same column into multiple columns, putting 1s at indexes on which it occured

Example NA
PA
PD
NA

After LabelBinarizer
NA PA PD
1 0 0
0 1 0
0 0 1
1 0 0