I have six genomes. I am trying to identify regions of homology across these genomes. I have created a script below that uses nucmer across the 6 genomes, creating 15 .delta files. I ran into a few problems with concatenating the show-coords without removing the headers but I think I managed to fix it. I now get some errors relating to tab delimited but the file looks good. The error I get from running that script gives me this. I am by no means a computer scientist and am a wetlab but I try to dabble in it by using a few scripts. The first looping through the genomes took me probably a few hours. Thanks.
against subject-file "output_5_6.ntref" \
COMPLETETIME /usr/bin/mummer output_5_6.ntref 6.26 \
SPACE /usr/bin/mummer output_5_6.ntref 9.56 4: FINISHING DATA \
Error: unable to open file or unable to determine types for file \
output_5_6_sorted_coords.txt
Please ensure that your file is TAB delimited (e.g., cat -t FILE).
Also ensure that your file has integer
chromosome coordinates in the expected columns
(e.g., cols 2 and 3 for BED).
user@DESKTOP:~/pao$ cat -t output_5_6_sorted_coords.txt
1004679 1071087 1741050 99.66 CPXXXXXX.1 CPXXXXXX.1
1071043 1071447 1671795 89.90 CPXXXXXX.1 CPXXXXXX.1
1077206 1079524 1670333 96.94 CPXXXXXX.1 CPXXXXXX.1
1079618 1109035 1668024 99.27 CPXXXXXX.1 CPXXXXXX.1
1109164 1121651 1638491 98.21 CPXXXXXX.1 CPXXXXXX.1
1136749 1137105 1621353 91.62 CPXXXXXX.1 CPXXXXXX.1
1138386 1138741 2546099 87.11 CPXXXXXX.1 CPXXXXXX.1
1139723 1140751 1620579 88.19 CPXXXXXX.1 CPXXXXXX.1
1144665 1145158 1624489 90.52 CPXXXXXX.1 CPXXXXXX.1
1145284 1146239 1623990 89.45 CPXXXXXX.1 CPXXXXXX.1
This is the script that I ran. Resulted in the above error about TAB delimited.
# Loop through all pairs of genomes
for i in {1..6}; do
for j in $(seq $((i + 1)) 6); do
genome1="genome${i}.fasta"
genome2="genome${j}.fasta"
output_prefix="output_${i}_${j}"
# Run NUCmer
nucmer -p "${output_prefix}" "$genome1" "$genome2"
# Show coordinates
show-coords -r -l -c -T "${output_prefix}.delta" > "${output_prefix}_coords.txt"
# Filter and format delta file
delta-filter -r -q "${output_prefix}.delta" > "${output_prefix}_filtered.delta"
# Check if the filtered.delta file is not empty
if [ -s "${output_prefix}_filtered.delta" ]; then
# Extract relevant columns from show-coords output and create a reformatted coords file
show-coords -r -T "${output_prefix}_filtered.delta" | tail -n +6 | awk '{print $1, $2, $3, $7, $8, $9, $10, $11}' > "${output_prefix}_reformatted_coords.txt"
# Remove header lines
awk '!/^=/ {print}' "${output_prefix}_reformatted_coords.txt" > "${output_prefix}_reformatted_coords_no_header.txt"
# Concatenate reformatted coords without headers, ensuring tab-delimited format
cat -T "${output_prefix}_reformatted_coords_no_header.txt" >> all_coords_no_header.txt
# Sort the reformatted coords
sort -k1,1 -k2,2n "${output_prefix}_reformatted_coords_no_header.txt" > "${output_prefix}_sorted_coords.txt"
# Merge overlapping regions, ensuring tab-delimited format
bedtools merge -i "${output_prefix}_sorted_coords.txt" -c 1,2,3 -o distinct,count,distinct | sed 's/,/\t/g' > "${output_prefix}_merged_coords.txt"
else
echo "Error: Filtered delta file is empty or not generated successfully for ${genome1} and ${genome2}."
fi
# Cleanup intermediate files
rm -f "${output_prefix}.delta"
done
done