I am trying to subtract one column from all the other columns in the dataframe (I have 500000 columns btw)

60 Views Asked by At

I tried this:

for col in cols1:
    reg_df[col]=df[col] = reg_df[col].sub(reg_df['Intercept'])
    print(col)

Because I have 500,000 columns it is taking forever. Something like 220 hours. Is there anyway to speed up the process?

2

There are 2 best solutions below

1
Panda Kim On

Code

use broadcasting

out = reg_df.sub(reg_df['Intercept'], axis=0)

Sameple

import pandas as pd
import numpy as np
np.random.seed(0)
reg_df = pd.DataFrame(np.random.randint(0, 10, (10, 500000))).rename({0: 'Intercept'}, axis=1)
  1. vectorized operation

import time

start = time.time()

out = reg_df.sub(reg_df['Intercept'], axis=0)

end = time.time()

print(f"{end - start:.5f} sec")

time:

0.01237 sec



  1. your for loop

start = time.time()

cols1 = reg_df.columns
for col in cols1:
    reg_df[col] = reg_df[col].sub(reg_df['Intercept'])

end = time.time()

print(f"{end - start:.5f} sec")

time:

It has been 10 minutes and it is still not finished. 

Use vectorized operations.

1
jri On

You can use a vectorized approach for this.

import pandas as pd

reg_df.iloc[:, 1:] = reg_df.iloc[:, 1:].sub(reg_df['Intercept'], axis=0)