I'm trying to figure out a way to speed up a pattern search and replace between two large text files (>10Mb). File1 has two columns with unique names in each row. File2 has one column that contains one of the shared names in File1, in no particular order, with some text underneath that spans a variable number of lines. They look something like this:
File1:
uniquename1 sharedname1
uqniename2 sharedname2
...
File2:
>sharedname45
dklajfwiffwf
flkewjfjfw
>sharedname196
lkdsjafwijwg
eflkwejfwfwf
weklfjwlflwf
My goal is to use File1 to replace the sharedname variables with their corresponding uniquename, as follows:
New File2:
>uniquename45
dklajfwif
flkewjfj
>uniquename196
lkdsjafwij
eflkwejf
This is what I've tried so far:
while read -r uniquenames sharednames; do
sed -i "s/$sharednames/$uniquenames/g" $File2
done < $File1
It works but it's ridiculously slow, trudging through those big files. The CPU usage is the rate-limiting step, so I was trying to parallel the modification to use the 8 cores at my disposal, but couldn't get it to work. I also tried splitting File1 and File2 into smaller chunks and running in batches simultaneously, but I couldn't get that to work, either. How would you implement this in parallel? Or do you see a different way of doing it?
Any suggestions would be welcomed.
UPDATE 1
Fantastic! Great answers thanks to @Cyrus and @JJoao and suggestions by other commentators. I implemented both in my script, on the recommendation of @JJoao to test the compute times, and it's an improvement (~3 hours instead of ~5). However, I'm just doing text file manipulation so I don't see how it should be taking any more than a couple of minutes. So, I'm still working on making better use of the available CPUs, so I'm tinkering with the suggestions to see if I can speed it up further.
UPDATE 2: correction to UPDATE 1 I included the modifications into my script and run it as such, but a chunk of my code was slowing it down. Instead, I ran the suggested bits of code individually on the target intermediary files. Here's what I saw:
Time for @Cyrus' sed to complete
real 70m47.484s
user 70m43.304s
sys 0m1.092s
Time for @JJoao's Perl script to complete
real 0m1.769s
user 0m0.572s
sys 0m0.244s
Looks like I'll be using the Perl script. Thanks for helping, everyone!
UPDATE 3 Here's the time taken by @Cyrus' improved sed command:
time sed -f <(sed -E 's|(.*) (.*)|s/^\2/>\1/|' File1 | tr "\n" ";") File2
real 21m43.555s
user 21m41.780s
sys 0m1.140s
I prefer @cyrus solution, but if you need to do that often you can use the previous perl script (chmod + install) as a dict-replacement command.
Usage:
dict-replacement File1 File* > output
It would be nice if you could tell us the time of the various solutions...