I'm looking for an efficient way to create a "parametrized" design matrix for genotypes in R. I have a big file (about 3 gb), with animals and their genotypes. Sample data looks like this:
snp id a1 a2 code
snp1 an1 A A 0
snp1 an2 A B 1
snp1 an3 B B -1
snp2 an1 A B 1
snp2 an2 A A 0
snp2 an3 B B -1
snp is name of snp (each animal have each snp), id is animal's id (each animal has unique id), a1 is allele 1, a2 is allele 2, code denotes genotype based on alleles, if animal has two A's it's code is 0, if animal has AB, it's code is -1, and if it's BB the code is 1.
Now I need to create based on that design matrix, which in row's will have animal's (id column in data), and in columns SNP's (snp column in data) and in the "cell" (at the intersection of the row and column) I need value from code column. So at the end, it should look like that:
an1 0 1
an2 1 0
an3 -1 -1
I know that in the case of efficiency and speed R has a limitation, but still, I need the fastest solution for this I can obtain in R.
Usually the data.table package is pretty performant in these type of cases. Example below:
Created on 2021-10-13 by the reprex package (v2.0.1)
If you need the output as a matrix you could use:
For a ~3Gb file you might expect this to run for about 10 seconds:
Created on 2021-10-13 by the reprex package (v2.0.1)