|
version 3.6
TREEDIST -- distances between trees
© Copyright 2002 by The University of
Washington. Written by Joseph Felsenstein. Permission is granted to copy
this document provided that no fee is charged for it and that this copyright
notice is not removed.
This program computes distances between trees. The distance that is
computed is the Symmetric Distance of Robinson and Foulds (1981). This
does not use branch length information, only the tree topologies. It
must also be borne in mind that the distance does not have any immediate
statistical interpretation -- we cannot say whether a larger distance is
significantly larger than a smaller one.
The Symmetric Distance is computed by considering each of the branches of
the two trees. Each branch divides the set of species into two groups --
the ones connected to one end of the branch and the ones connected to the
other. This makes a partition of the full set of species. (in Newick notation)
((A,C),(D,(B,E)))
has two internal branches. One induces the partition {A, C | B, D, E}
and the other induces the partition {A, C, D | B, E}. A different tree
with the same set of species,
(((A,D),C),(B,E)))
has internal branches that correspond to the two partitions {A, C, D | B, E}
and {A, D | B, C, E}. Note that the other branches, all of which are
external branches, induce partitions that separate one species from all the
others. Thus there are 5 partitions like this: {C | A, B, D, E} on each
of these trees. These are always present on all trees, provided that each
tree has each species at the end of its own branch.
The Symmetric Distance is simply a count of how many partitions there are,
among the two trees, that are on one tree and not on the other. In the
example above there are two partitions, {A, C | B, D, E} and {A, D | B, C, E},
each of which is present on only one of the two trees. The Symmetric
Distance between the two trees is therefore 2. When the two trees are
fully resolved bifurcating trees, their symmetric distance must be an even
number; it can range from 0 to twice the number of internal branches, which
for n species is 4n-6.
We have assumed that nothing is lost if the trees are treated as unrooted trees.
It is easy to define a counterpart to the Symmetric Distance for rooted trees.
each branch then defines a set of species, namely the clade defined by that
branch. Thus if the first of the two trees above were considered as a rooted
tree it would define the three clades {A, C}, {B, D, E}, and {B, E}. The
symmetric distance between two rooted trees is simply the count of the number
of clades that are defined by one but not by the other. For the second tree
the clades would be {A, D}, {B, C, E}, and {B, E}. The Symmetric Distance
between thee two rooted trees would then be 4.
Although the examples we have discussed have involved fully
bifurcating trees, the input trees can have multifurcations.
This can lead to distances that are odd numbers.
INPUT AND OPTIONS
The program reads one or two input tree files. If there is one input tree
file, its default name is intree. If there are two their default
names are intree and intree2. The tree files may either
have the number of trees on their first line, or not. If the number of
trees is given, it is actually ignored and all trees in the tree file
are considered, even if there are more trees than indicated by the number.
(This is a bug and it will be fixed in the future).
The options are selected from a menu, which looks like this:
Tree distance program, version 3.6a3
Settings for this run:
O Outgroup root: No, use as outgroup species 1
R Trees to be treated as Rooted: No
T Terminal type (IBM PC, ANSI, none): (none)
1 Print indications of progress of run: Yes
2 Tree distance submenu: Distance between adjacent pairs
Are these settings correct? (type Y or the letter for one to change)
|
The O option allows you to root the trees using an outgroup. It is specified
by giving its number, where the species are numbered in the order they
appear in the first tree. Outgroup-rooting all the trees does not
affect the unrooted Symmetric Distance, and if it is done and trees are
treated as rooted, the distances turn out to be the same as the unrooted
ones. Thus it is unlikely that you will find this option of interest.
The R option controls whether the Summetric Distance that is computed is
to treat the trees as unrooted or rooted. Unrooted is the default.
The terminal type (0) and progress (1) options do not need description here.
Option 2 controls how many tree files are read in, which trees are to
be compared, and how the output is to be presented. It causes
another menu to appear:
Tree Pairing Submenu:
A Distances between adjacent pairs in tree file.
P Distances between all possible pairs in tree file.
C Distances between corresponding pairs in one tree file and another.
L Distances between all pairs in one tree file and another.
|
Option A computes the distances between successive pairs of trees in the
tree input file -- between trees 1 and 2, trees 3 and 4, trees
5 and 6, and so on. If there are an odd number of trees in the input tree
file the last tree will be ignored and a warning message printed to
remind the user that nothing was done with it.
Option P computes distances between all pairs of trees in the input tree
file. Thus with 10 trees 10 x 10 = 100 distances will be computed,
including distances between each tree and itself.
Option C takes input from two tree files and cmputes distances between
corresponding members of the two tree files. Thus distances will be
computed between tree 1 of the first tree file and tree 1 of the second one,
between tree 2 of the first file and tree 2 of the second one, and so on.
If the number of trees in the two files differs, the extra trees in the
file that has more of them are ignored and a warning is printed out.
Option L computes distances between all pairs of trees, where one tree is
taken from one tree file and the other from the other tree file. Thus if
the first tree file has 7 trees and the second has 5 trees, 7 x 5 = 35
different distances will be computed. Note -- this option seems not
to work at the moment. We hope to fix this soon.
If option 2 is not selected, the program defaults to looking at one tree
file and computing distances of adjacent pairs (so that option A is
the default).
OUTPUT
The results of the analysis are written onto an output file whose
default file name is outfile.
If any of the four types of analysis are selected, the program asks the
user how they want the results presented. Here is that menu for options
P or L:
Distances output options:
F Full matrix.
V One pair per line, verbose.
S One pair per line, sparse.
Choose one: (F,V,S)
|
The Full matrix (choice F) is a table showing all distances. It is
written onto the output file. The table is presented as groups of
10 columns. Here is the Full matrix for the 12 trees in the input
tree file which is given as an example at the end of this page.
Tree distance program, version 3.6
Symmetric differences between all pairs of trees in tree file:
1 2 3 4 5 6 7 8 9 10
\------------------------------------------------------------
1 | 0 4 2 10 10 10 10 10 10 10
2 | 4 0 2 10 8 10 8 10 8 10
3 | 2 2 0 10 10 10 10 10 10 10
4 | 10 10 10 0 2 2 4 2 4 0
5 | 10 8 10 2 0 4 2 4 2 2
6 | 10 10 10 2 4 0 2 2 4 2
7 | 10 8 10 4 2 2 0 4 2 4
8 | 10 10 10 2 4 2 4 0 2 2
9 | 10 8 10 4 2 4 2 2 0 4
10 | 10 10 10 0 2 2 4 2 4 0
11 | 2 2 0 10 10 10 10 10 10 10
12 | 10 10 10 2 4 2 4 0 2 2
11 12
\------------
1 | 2 10
2 | 2 10
3 | 0 10
4 | 10 2
5 | 10 4
6 | 10 2
7 | 10 4
8 | 10 0
9 | 10 2
10 | 10 2
11 | 0 10
12 | 10 0
|
The Full matrix is only available for analyses P and L (not for A or C).
Option V (Verbose) writes one distance per line. The Verbose
output is the default. Here it is for the example data set given below:
Tree distance program, version 3.6a3
Symmetric differences between adjacent pairs of trees:
Trees 1 and 2: 4
Trees 3 and 4: 10
Trees 5 and 6: 4
Trees 7 and 8: 4
Trees 9 and 10: 4
Trees 11 and 12: 10
|
Option S (Sparse or terse) is similar except that all that is
given on each line are the numbers of the two trees and the distance,
separated by blanks. This may be a convenient format if you want to
write a program to read these numbers in, and you want to spare yourself
the effort of having the program wade through the words on each line
in the Verbose output.
The first four lines of the Sparse output are titles that your program would
want to skip past. Here is the Sparse output for the example trees.
Tree distance program, version 3.6
Symmetric differences between adjacent pairs of trees:
1 2 4
3 4 10
5 6 4
7 8 4
9 10 4
11 12 10
|
CREDITS AND FUTURE
TREEDIST was written by Dan Fineman. In the future we hope to expand it
to consider a distance based on branch lengths as well as tree topologies.
The Branch Score distance defined by Kuhner and Felsenstein (1994) is
the one we have in mind (the Branch Score defined by them is actually
the square of the distance). We also hope to compute a distance based on
quartets shared and not shared by trees (implicit in the work of Estabrook, McMorris, and
Meacham, 1985).
TEST DATA SET
(A,(B,(H,(D,(J,(((G,E),(F,I)),C))))));
(A,(B,(D,((J,H),(((G,E),(F,I)),C)))));
(A,(B,(D,(H,(J,(((G,E),(F,I)),C))))));
(A,(B,(E,(G,((F,I),((J,(H,D)),C))))));
(A,(B,(E,(G,((F,I),(((J,H),D),C))))));
(A,(B,(E,((F,I),(G,((J,(H,D)),C))))));
(A,(B,(E,((F,I),(G,(((J,H),D),C))))));
(A,(B,(E,((G,(F,I)),((J,(H,D)),C)))));
(A,(B,(E,((G,(F,I)),(((J,H),D),C)))));
(A,(B,(E,(G,((F,I),((J,(H,D)),C))))));
(A,(B,(D,(H,(J,(((G,E),(F,I)),C))))));
(A,(B,(E,((G,(F,I)),((J,(H,D)),C)))));
|
The output from default settings for this test set is given above (it is the
Verbose output example).
|