Identifying homologous regions across genomes

47 Views Asked by At

I have six genomes. I am trying to identify regions of homology across these genomes. I have created a script below that uses nucmer across the 6 genomes, creating 15 .delta files. I ran into a few problems with concatenating the show-coords without removing the headers but I think I managed to fix it. I now get some errors relating to tab delimited but the file looks good. The error I get from running that script gives me this. I am by no means a computer scientist and am a wetlab but I try to dabble in it by using a few scripts. The first looping through the genomes took me probably a few hours. Thanks.

 against subject-file "output_5_6.ntref" \
COMPLETETIME /usr/bin/mummer output_5_6.ntref 6.26 \
SPACE /usr/bin/mummer output_5_6.ntref 9.56 4: FINISHING DATA \
Error: unable to open file or unable to determine types for file \
output_5_6_sorted_coords.txt

Please ensure that your file is TAB delimited (e.g., cat -t FILE).
Also ensure that your file has integer 
chromosome coordinates in the expected columns 
(e.g., cols 2 and 3 for BED). 

user@DESKTOP:~/pao$ cat -t output_5_6_sorted_coords.txt 

1004679 1071087 1741050 99.66 CPXXXXXX.1 CPXXXXXX.1 
1071043 1071447 1671795 89.90 CPXXXXXX.1 CPXXXXXX.1 
1077206 1079524 1670333 96.94 CPXXXXXX.1 CPXXXXXX.1 
1079618 1109035 1668024 99.27 CPXXXXXX.1 CPXXXXXX.1 
1109164 1121651 1638491 98.21 CPXXXXXX.1 CPXXXXXX.1 
1136749 1137105 1621353 91.62 CPXXXXXX.1 CPXXXXXX.1 
1138386 1138741 2546099 87.11 CPXXXXXX.1 CPXXXXXX.1 
1139723 1140751 1620579 88.19 CPXXXXXX.1 CPXXXXXX.1 
1144665 1145158 1624489 90.52 CPXXXXXX.1 CPXXXXXX.1 
1145284 1146239 1623990 89.45 CPXXXXXX.1 CPXXXXXX.1

This is the script that I ran. Resulted in the above error about TAB delimited.

# Loop through all pairs of genomes
for i in {1..6}; do
    for j in $(seq $((i + 1)) 6); do
        genome1="genome${i}.fasta"
        genome2="genome${j}.fasta"

        output_prefix="output_${i}_${j}"

        # Run NUCmer
        nucmer -p "${output_prefix}" "$genome1" "$genome2"

        # Show coordinates
        show-coords -r -l -c -T "${output_prefix}.delta" > "${output_prefix}_coords.txt"

        # Filter and format delta file
        delta-filter -r -q "${output_prefix}.delta" > "${output_prefix}_filtered.delta"

        # Check if the filtered.delta file is not empty
        if [ -s "${output_prefix}_filtered.delta" ]; then
            # Extract relevant columns from show-coords output and create a reformatted coords file
            show-coords -r -T "${output_prefix}_filtered.delta" | tail -n +6 | awk '{print $1, $2, $3, $7, $8, $9, $10, $11}' > "${output_prefix}_reformatted_coords.txt"

            # Remove header lines
            awk '!/^=/ {print}' "${output_prefix}_reformatted_coords.txt" > "${output_prefix}_reformatted_coords_no_header.txt"

            # Concatenate reformatted coords without headers, ensuring tab-delimited format
            cat -T "${output_prefix}_reformatted_coords_no_header.txt" >> all_coords_no_header.txt

            # Sort the reformatted coords
            sort -k1,1 -k2,2n "${output_prefix}_reformatted_coords_no_header.txt" > "${output_prefix}_sorted_coords.txt"

            # Merge overlapping regions, ensuring tab-delimited format
            bedtools merge -i "${output_prefix}_sorted_coords.txt" -c 1,2,3 -o distinct,count,distinct | sed 's/,/\t/g' > "${output_prefix}_merged_coords.txt"
        else
            echo "Error: Filtered delta file is empty or not generated successfully for ${genome1} and ${genome2}."
        fi

        # Cleanup intermediate files
        rm -f "${output_prefix}.delta"
    done
done
0

There are 0 best solutions below