Python / ete3: Target most closely related leaf to a specific species in a phylogenetic tree

722 Views Asked by At

I am using the Python package ete3. I have trees such as:

((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);

I would like to see the most closely related leaf to a specific node in the tree (here the tree is Species1_order1). In the example, the most closely related leaves are Species2_order2/ Species3_order2, and Species4_order3/Species5_order5.

Code:

tree = ete3.Tree('((Species1_order1, \
                    (Species2_order2, Species3_order2)), \
                   Species4_order3, Species5_order5);')

New example :

    tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')

The result I get is :

         A    B    C    D    E    F    G    H    I
    A  0.0  2.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    B  2.0  0.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    C  3.0  3.0  0.0  3.0  5.0  5.0  5.0  7.0  7.0
    D  4.0  4.0  3.0  0.0  4.0  4.0  4.0  6.0  6.0
    E  6.0  6.0  5.0  4.0  0.0  2.0  4.0  6.0  6.0
    F  6.0  6.0  5.0  4.0  2.0  0.0  4.0  6.0  6.0
    G  6.0  6.0  5.0  4.0  4.0  4.0  0.0  4.0  4.0
    H  8.0  8.0  7.0  6.0  6.0  6.0  4.0  0.0  2.0
    I  8.0  8.0  7.0  6.0  6.0  6.0  4.0  2.0  0.0

But for instance E and F have an equaly distance to A,B,C and D in the tree and in the result they appear to be clother to D.

A good matrix result should rather be :


        A   B   C   D   E   F   G   H   I
    A   0   1   2   3   4   4   5   6   6
    B   1   0   2   3   4   4   5   6   6
    C   2   2   0   3   4   4   5   6   6
    D   3   3   3   0   4   4   5   6   6
    E   4   4   4   4   0   1   5   6   6
    F   4   4   4   4   1   0   5   6   6
    G   5   5   5   5   5   5   0   6   6
    H   6   6   6   6   6   6   6   0   1
    I   6   6   6   6   6   6   6   1   0

is not it ?

2

There are 2 best solutions below

4
On BEST ANSWER

As discussed in the comments, ete3 gives us a function called Tree.get_closest_leaf, but it's output is not what is expected (and I am not sure what this value even represents here):

>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)

Instead, you can get the node distance like this:

import ete3
import pandas as pd

def make_matrix(tree):
    def get_root_path(node):
        root_path = [node]
        if node.up:
            root_path.extend(get_root_path(node.up))
        return root_path
    leaves = tree.get_leaves()
    leaf_ct = len(leaves)
    paths = {node.name: set(get_root_path(node)) for node in leaves}
    col_lbls = [leaf.name for leaf in leaves]
    dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
    df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
    for node1_name, col in df.iteritems():
        for node2_name in col.keys():
            path = paths[node2_name].symmetric_difference(paths[node1_name])
            dist = sum(node.dist for node in path)
            df.at[node1_name, node2_name] = dist
            df.at[node2_name, node1_name] = dist
    return df

Note: This is a suboptimal solution for several reasons, but this question is not asking for the most most efficient solution. see this link for much more information about phylogenetic distance matrix methods.

This solution also uses pandas which is overkill, since it is really just for the convenience of row/column labels. It would not be difficult to remove the pandas dependencies and do it with native lists instead.

Here is the output:

>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
                 Species1_order1  Species2_order2  Species3_order2  Species4_order3  Species5_order5
Species1_order1              0.0              3.0              3.0              3.0              3.0
Species2_order2              3.0              0.0              2.0              4.0              4.0
Species3_order2              3.0              2.0              0.0              4.0              4.0
Species4_order3              3.0              4.0              4.0              0.0              2.0
Species5_order5              3.0              4.0              4.0              2.0              0.0

For the updates posted, I am not seeing anything wrong. It appears to give correct results. Here is the tree as rendered by ete3 (I highlighted the 4 hops that are counted in the distance from Interest_sequence to Rhopalosiphum_maidis_Hemiptera):

and here is the matrix column for Interest_sequence that corresponds to it:

>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera            4.0
Drosophila_novamexicana__Hemiptera         5.0
Drosophila_arizonae__Hemiptera             6.0
Drosophila_navojoa__Hemiptera              6.0
Interest_sequence                          0.0
Heliothis_virescens_droso_3a__nan          5.0
Mythimna_separata_droso__nan               6.0
Heliothis_virescens_droso_3i__nan          6.0
Scaptodrosophila_lebanonensis__Diptera     5.0
Mythimna_unipuncta_droso_A__nan            6.0
Xestia_c-nigrum_droso__nan                 8.0
Helicoverpa_armigera_droso__nan            8.0
Mocis_latipes_droso__nan                   7.0
Drosophila_busckii__Diptera                4.0
Drosophila_bipectinata__Diptera            5.0
Drosophila_mojavensis__Diptera             7.0
Drosophila_yakuba__Diptera                 7.0
Drosophila_hydei__Diptera                  7.0
Drosophila_serrata__Diptera                8.0
Drosophila_takahashii__Diptera             9.0
Drosophila_eugracilis__Diptera            11.0
Drosophila_ficusphila__Diptera            11.0
Drosophila_erecta__Diptera                12.0
Drosophila_melanogaster__Diptera          13.0
Sequence_A_nan__nan                       14.0
Drosophila_sechellia__Diptera             15.0
Drosophila_simulans__Diptera              15.0
Drosophila_suzukii__Diptera               12.0
Drosophila_biarmipes__Diptera             12.0
Name: Interest_sequence, dtype: float64
2
On

The ete3 function works as expected, you just need to iterate over the nodes of the trees as follows:

from ete3 import Tree t = Tree() # Creates an empty tree 
A = t.add_child(name="A") # Adds a new child to the current tree root and returns it 
B = t.add_child(name="B") # Adds a second child to the current tree root and returns it 
C = A.add_child(name="C") # Adds a new child to one of the branches D =
C.add_sister(name="D") # Adds a second child to same branch as before, but using a sister as the starting point 
R = A.add_child(name="R") # Adds a third child to the branch. Multifurcations are supported
# Next, I add 6 random leaves to the R branch names_library is an optional argument. If no names are provided, they will be generated randomly. 
R.populate(6, names_library=["r1","r2","r3","r4","r5","r6"])
for node in t.traverse():
    print(node.get_closest_leaf())

NOTE: if the internal node is a terminal node, it will have 2 closest leaves, get them with node.get_children()