I'm using Scanpy in order to analyze an integrated single-cell data comprised of two different datasets.
In the default preprocessing stage provided by Scanpy's authors, cells are being filtered based on their mitochondrial genes expression etc.
Following, effects of total counts and mitochondrial genes are being regressed-out using sc.pp.regress_out and the genes are then scaled to a unit variance using sc.pp.scale.
Should these preprocessing steps be implemented on each dataset separately prior to integration? Or should these steps be committed following integration? - specifically regressing-out and scaling.
For example, conducting sc.pp.scale prior to integration will cause the genes in both datasets to have similar distributions, thus remove possible differences between datasets after integrating them. So it seems as if this step should be conducted following integration.
However, as each dataset originally has a different number of genes sequenced, applying sc.pp.regress_out following integration seems like a mistake as the total counts are affected by the total number of genes sequenced in each dataset. So it seems as if this step should be conducted prior to integration.