Phylogenetic trees are a specialization of hierarchical clustering which elegantly capture relatedness between observations, grouping like with like. Yet hierarchical clusterings have one common complaint, as compared to density/distribution based clustering, the ability to classify the data into different types. Finding a meaningful cut of the tree into subtrees remains an open question in the field The prolific Mattias Prosperi (9 publications in the first 9 months of 2012) has proposed a method for automatically partitioning phylogenetic trees of pathogens into transmission clusters. The intuition is that a group of pathogens represent a transmission cluster if their sequences are monophyletic and more closely related than those from two randomly selected individuals. To capture this intuition, he proposes a depth first search of the phylogeny. A node is classified as a cluster if the median pairwise patristic distance (MPPD) between the nodes of its subtree is below a threshold. Since phylogenetic trees often come with some measurable uncertainty, he further suggests that the node should have high reliability (bootstrap support, SH test, ...). So let's do it. Fast and dirty. In R. [The code is available on Figshare, along with sample data.]We'll need the ape and geiger packages to work with the phylogenies, and igraph makes it easy to get the nodes in depth first search order.
require(ape, quietly=TRUE)
require(geiger, quietly=TRUE)
require(igraph, quietly=TRUE)
|
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| ## Given
## a node, tree, and distance matrix
## Return
## median pairwise patristic distance (MPPD) of its leaves
get.node.leaf.MPPD <- function(node,tree,distmat){
nlist <- node.leaves(tree,node)
foo <- distmat[nlist,nlist]
return(median(foo[upper.tri(foo,diag=FALSE)]))
}
## Given
## a node, tree, and distance matrix
## Return
## median pairwise patristic distance (MPPD) of all of its decendants
get.node.full.MPPD <- function(node,tree,distmat){
nlist <- node.leaves(tree,node)
elist <- tree$edge[which.edge(tree,nlist),2]
foo <- distmat[elist,elist]
return(median(foo[upper.tri(foo,diag=FALSE)]))
}
|
and a helper function to call them:
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| ## Given
## a tree and (optionally) a distance matrix
## Return
## a vector giving the median pairwise
## patristic distance of the subtree under
## each internal node
## SLOW!! May be a good idea to save/cache results
pdist.clusttree <- function(tree,distmat=NULL,mode=c('leaf','all')){
mode <- match.arg(mode)
if(is.null(distmat)){
if(mode=='leaf'){ distmat <- cophenetic.phylo(tree) }
else{ distmat <- dist.nodes(tree) }
}
ntips<- Ntip(tree)
nint <- tree$Nnode ## number of internal nodes
if(mode=='leaf'){
return(sapply((ntips+1):(ntips+nint),get.node.leaf.MPPD,tree,distmat))
}
else{
return(sapply((ntips+1):(ntips+nint),get.node.full.MPPD,tree,distmat))
}
}
|
The actual clustering is done as follows:
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
| ## Given
## a tree and a threshold, and (optionally)
## a vector of the MPPD of the subtree at each internal node,
## the reliability threshold and vector,
## and specify if you want the membership for all nodes or
## only the tips
## Return
## a vector indicating, for each internal node, which
## cluster it belongs to
prosperi.cluster <- function(tree,thresh,distvec=NULL,
rthresh=NULL,reliabilityvec=NULL,
retval=c("tips","all")){
## take care of optional arguments
retval <- match.arg(retval)
if(is.null(distvec)){
cat("Calculating MPPD for each node ...\n")
distvec <- pdist.clusttree(tree,mode='all')
}
if(is.null(rthresh) || is.null(reliabilityvec)){
reliabilityvec=NULL
cat("No reliability thresholding.\n")
}
## set up clustering
ntips<-Ntip(tree)
cnum <- 0 ## cluster number
assign <- rep(0,ntips+tree$Nnode) ## cluster assignment
igraph.tree <- graph.edgelist(tree$edge) ## tree in igraph form
dfs <- graph.dfs(igraph.tree,root=ntips+1,neimode='out',
order=TRUE,dist=TRUE)
## travese the tree in depth first order
for(i in 1:length(dfs$order)){
node <- dfs$order[i]
## skip leaves
if(node < ntips+1){ next }
## skip unreliable nodes (if reliability measure is available)
if(! is.null(reliabilityvec) &&
reliabilityvec[node-ntips] >= rthresh){ next }
## If the node's subtree is below the threshold, mark it and
## its subtree as members of a new cluster
if(distvec[node-ntips]<=thresh && assign[node]<=0){
cnum <- cnum+1
subtree <- graph.dfs(igraph.tree,node,
neimode='out',unreachable=FALSE)$order
subtree <- subtree[! is.nan(subtree)]
assign[subtree] <- cnum
}}
ans <- list(membership=assign,allcsize=table(assign),
leafclustsize=table(assign[1:ntips]),
ntips=ntips,threshold=thresh)
if(retval=="tips"){ans$membership <- ans$membership[1:ntips]}
class(ans) <- c(class(ans),'p.cluster')
return(ans)
}
|
And what is the point of it all without a colorful graphic?
99
100
101
102
103
104
105
106
107
| testplot <- function(testtree,thresh,plotfile=NULL){
graphics.off()
clustering <- prosperi.cluster(testtree,thresh)$membership
if(class(plotfile)=="character"){ png(plotfile) }
plot(testtree,show.tip.label=FALSE)
tiplabels(rep(" ",Ntip(testtree)),
bg=clustering) # better colors if you use RColorBrewer
if(class(plotfile)=="character"){ dev.off() }
}
|
To test it out, we need some data. You can grab a nice tree from the figshare site if you don't have your own handy.
108
| samp <- read.tree('sampbalanced.tre')
|
where to threshold?
109
110
111
| quantile(dist.nodes(samp),seq(0.1,0.5,by=0.05))
## 10% 15% 20% 25% 30% 35% 40% 45%
##0.4643290 0.5712468 0.6980250 0.8462455 1.0000370 1.1277150 1.2418286 1.3255760
|
suggests that 0.46 would restrict clusters to tips which are more closely related than those from 90% of randomly selected individuals. If one wants to be generous, a threshold of 1 clusters individuals more similar than 70% of the population. Here is what it looks like:
112
113
| testplot(samp,0.46)
testplot(samp,1.0)
|
Clustering phylogenetic trees based on median patristic distances less than 10% (left) or 30% (right) of the entire tree
114
115
116
117
118
119
| bigtree <- read.tree("aBigTree.tre")
bigtree.mppd <- pdist.clusttree(bigtree,mode="all")
bigtree.support <- as.numeric(bigtree$node.label)
bigtree.support[is.na(bigtree.support)] <- 0
clustering <- prosperi.cluster(bigtree,0.5,distvec=bigtree.mppd,
rthresh=0.90,reliabilityvec=bigtree.support)
|
Analyze the clustering as you like (perhaps by making a table of the membership values, perhaps by plotting) to see if the threshold is reasonable. Source paper for the algorithm: Prosperi MC, Ciccozzi M, Fanti I, Saladini F, Pecorari M, Borghi V, Di Giambenedetto S, Bruzzone B, Capetti A, Vivarelli A, Rusconi S, Re MC, Gismondo MR, Sighinolfi L, Gray RR, Salemi M, Zazzi M, De Luca A, & ARCA collaborative group (2011). A novel methodology for large-scale phylogeny partition. Nature communications, 2 PMID: 21610724Abstract
Understanding the determinants of virus transmission is a fundamental step for effective design of screening and intervention strategies to control viral epidemics. Phylogenetic analysis can be a valid approach for the identification of transmission chains, and very-large data sets can be analysed through parallel computation. Here we propose and validate a new methodology for the partition of large-scale phylogenies and the inference of transmission clusters. This approach, on the basis of a depth-first search algorithm, conjugates the evaluation of node reliability, tree topology and patristic distance analysis. The method has been applied to identify transmission clusters of a phylogeny of 11,541 human immunodeficiency virus-1 subtype B pol gene sequences from a large Italian cohort. Molecular transmission chains were characterized by means of different clinical/demographic factors, such as the interaction between male homosexuals and male heterosexuals. Our method takes an advantage of a flexible notion of transmission cluster and can become a general framework to analyse other epidemics.
No comments:
Post a Comment