Conceived and designed the experiments: BM SLKP KS. Performed the experiments: BM SLKP. Analyzed the data: BM SLKP KS. Wrote the paper: BM TdO CS SLKP KS.
The authors have declared that no competing interests exist.
The evolution of substitutions conferring drug resistance to HIV-1 is both episodic, occurring when patients are on antiretroviral therapy, and strongly directional, with site-specific resistant residues increasing in frequency over time. While methods exist to detect episodic diversifying selection and continuous directional selection, no evolutionary model combining these two properties has been proposed. We present two models of episodic directional selection (MEDS and EDEPS) which allow the
When exposed to treatment, HIV-1 and other rapidly evolving viruses have the capacity to acquire drug resistance mutations (DRAMs), which limit the efficacy of antivirals. There are a number of experimentally well characterized HIV-1 DRAMs, but many mutations whose roles are not fully understood have also been reported. In this manuscript we construct evolutionary models that identify the locations and targets of mutations conferring resistance to antiretrovirals from viral sequences sampled from treated and untreated individuals. While the evolution of drug resistance is a classic example of natural selection, existing analyses fail to detect the majority of DRAMs. We show that, in order to identify resistance mutations from sequence data, it is necessary to recognize that in this case natural selection is both episodic (it only operates when the virus is exposed to the drugs) and directional (only mutations to a particular amino-acid confer resistance while allowing the virus to continue replicating). The new class of models that allow for the episodic and directional nature of adaptive evolution performs very well at recovering known DRAMs, can be useful at identifying unknown resistance-associated mutations, and is generally applicable to a variety of biological scenarios where similar selective forces are at play.
Among positively selected evolutionary changes, a distinction can be made between
A second distinction is that between selective pressure that is constant over time, and selective pressure that changes over time, possibly instantaneously – we shall refer to the latter as
Here, we consider the evolution of drug resistance in HIV-1 following the treatment of a subset of the host population. We expect that selective pressure will be both episodic, with drug-induced adaptive amino acid changes occurring only in patients receiving therapy, and directional, with site-specific target residues increasing in frequency over time in the treated subset. HIV-1 experiences a variety of other selective pressures, most prominently due to host immune response (e.g.
Previous approaches to detect positive selection driving treatment resistance have had variable success. Crandall
In this paper we aim to demonstrate that explicitly modeling the directional and episodic character of the evolution of drug resistance increases the power and accuracy to detect drug resistance sites. We introduce a codon-based Model of Episodic Directional Selection (MEDS) and a model of protein evolution called Episodic Directional Evolution of Protein Sequences (EDEPS), and show that both models outperform models that lack either the episodic or directional components.
Our codon model of episodic directional selection assumes that branches on the phylogenetic tree can be partitioned into foreground (F) and background (B) subsets
MEDS extends two previously proposed models of coding sequence evolution: 1) the episodic component of MEDS is structurally identical to the Internal Fixed Effects Likelihood (IFEL) model proposed in
Model | Data | Baseline model | Site variation | Lineage variation | Selection test | Citation |
MEDS | Codon | MG94 |
Fixed effects | Episodic | Directional | This paper |
FEEDS | Codon | MG94 |
Fixed effects | Episodic | Diversifying |
|
DEPS | Protein | HIV-Between |
Random effects | Constant | Directional |
|
EDEPS | Protein | HIV-Between | Random effects | Episodic | Directional | This paper |
FEEDS has the same structure as a model called IFEL in that paper, but the use here is novel.
Following Seoighe
Model fitting proceeds in two stages: (a) estimating the parameters shared across sites, and (b) site-wise analysis
The above test treats nucleotide substitution rates and branch length parameters at a single site as known, even though these are estimated across sites under a simpler model. It is possible that this could affect inference if these estimates were substantially biased. Our simulations suggest that the test performs well in spite of this computational shortcut, and using different models to infer these parameters does not substantially affect the test results on the empirical data we analyze here. Additionally, the
Scanning a site for selection towards any possible amino acid (
To assess the importance of the directional component of MEDS, we adapt IFEL to test for episodic diversifying selection along foreground branches and use it as a benchmark for MEDS. As the branches of interest are mostly terminal, the name, IFEL, is no longer appropriate, and we rename the model FEEDS, for ‘Fixed Effects Episodic Diversifying Selection’. The alternative model for FEEDS is identical to the null model for MEDS, allowing
Throughout the analyses we also compare our results against DEPS (full results in tables S1 to S3), a method for detecting non-episodic directional selection proposed by Kosakovsky Pond
It is a straightforward exercise to modify DEPS to incorporate the episodic nature of MEDS – namely, we restrict accelerated substitutions towards a target residue
All models and their accompanying LRTs are implemented in a HyPhy Batch Language script
We analyzed three HIV-1 datasets obtained from the South African mirror of the Stanford HIV Drug Resistance Database (HIVdb)
We screened each sequence for evidence of recombination (known to have a biasing effect on selection detection, e.g.
The first dataset comprises pairs of reverse transcriptase (RT) isolates obtained before and after the initiation of highly active anti-retroviral therapy (HAART) from 241 patients (482 sequences). The data were obtained from the Stanford HIVdb using a query that retrieved paired samples from the same patient, filtered on the earlier sample being Reverse Transcriptase Inhibitor (RTI) naive, and the later sample taken during therapy with at least one Non-Nucleoside RTI (NNRTI)
A dataset consisting of 49 protease isolates (from 37 patients), sampled post-Protease Inhibitor (PI) treatment was retrieved from HIVdb (query: Number of PIs = 3, Subtype = C). Additionally, the entire collection of treatment naive protease isolates was obtained, and all full length sequences were searched for two sequences nearest (under the Hamming distance) to each of the 49 post-treatment sequences. The final dataset was constructed by combining the post-treatment and closely related naive sequences: a total of 122 sequences, as some naive sequences were closely related to more than one post-PI sequence. Since protease is only 297 nucleotides long, we were concerned that convergent evolution due to drug resistance might inflate the apparent relatedness between some of the treatment resistant sequences
Foreground branches are marked in red. All terminal foreground branches lead to sequences obtained from patients who had been receiving antiretroviral therapy. See text for details of how we determined which internal branches were assigned to foreground. MEDS and EDEPS allow the presence of a directional component along the foreground branches where antiretroviral therapy exerts selective pressure.
The post-treatment sequences for the final empirical dataset were 83 integrase isolates sampled from 40 patients after Integrase Inhibitor (II, Raltegravir) therapy. 1237 II-naive isolates were obtained from the Stanford HIVdb, and the final Raltegravir dataset was made up of 315 sequences: the 83 post-II isolates, plus the union of the 25 II-naive isolates nearest to each of the 83 post-II isolates under the HKY85 distance
We investigated the power of MEDS by simulating alignments over a balanced 64-taxon phylogeny (see
In real evolving systems, the modeling assumption of selection towards a single target amino acid could be violated. We investigated how such deviations would impact the power of the model by simulating directional selection towards two target amino acids, with substitutions towards one target accelerated on
We used exactly the same simulation configuration and parameters to asses the rates of false positives under the null model (
In evolving proteins, each site could have its own site-specific selective constraints governing amino acid distributions. MEDS assumes that background equilibrium frequencies are the same for all sites, and a potential concern is that deviations from this modeling assumption could lead to excessive false positives. To investigate this, we simulated data under a version of the null model where each site's amino acid equilibrium frequencies were sampled from a symmetric Dirichlet distribution with density
MEDS detected twenty substitutions at seventeen sites under significant directional selection at
Site | Target | MEDS p-value |
FEEDS p-value |
EDEPS Bayes Factor |
Resistance | |
|
L |
|
- |
- | NRTI |
|
|
V | - | - | - | 313 | NRTI |
|
K |
|
|
- | ||
|
L | - | - | - | 211 | NRTI |
|
S |
|
- | - | ||
|
I |
|
- |
|
NNRTI |
|
|
|
- | - | 0.0025 | - | |
|
N |
|
|
|
NNRTI | |
|
Y |
|
- | - | ||
|
F | - | - | - |
|
NRTI |
|
Y |
|
- | - | NRTI | |
|
M |
|
- |
|
NRTI | |
|
Q |
|
- | - | ||
|
S | - | - | - | 1772 | |
|
L |
|
- | 2245 | ||
|
R | - | - | - | 105 | |
|
I |
|
- |
|
NNRTI | |
|
V |
|
- |
|
NRTI | |
|
L |
|
|
|
NNRTI | |
|
Y |
|
- | - | ||
|
S |
|
- |
|
NNRTI | |
|
|
- | - |
|
- | |
|
F |
|
- | 2727 | NRTI | |
|
T |
|
- | - | ||
|
R |
|
- | 1401 | NRTI accessory | |
|
L |
|
- |
|
NNRTI | |
|
|
- | - | 0.0006 | - | |
|
A |
|
- | - |
MEDS versus FEEDS LRT, testing for directional selection.
the lower bound of the approximate
Empirical Bayes analysis, testing for directional selection on protein data.
‘-’: not significant.
Nucleoside reverse-transcriptase inhibitor.
Non-nucleoside reverse-transcriptase inhibitor.
Remarkably, FEEDS detected only six sites under diversifying selection (table S5), two of which are known resistance mutations, strongly supporting the inclusion of a directional component in the model. A continuous directional selection model (DEPS) detected 46 sites under directional selection with Bayes factors
MEDS detected nine substitutions under directional selection at
Site | Target | MEDS p-value |
FEEDS p-value |
EDEPS Bayes Factor |
Resistance | |
|
|
- |
- | 0.0005 | - | PI |
|
T |
|
- | - | ||
|
V |
|
- | 145 | PI accessory | |
|
D |
|
- | - | ||
|
|
- | - | 0.0026 | - | PI |
|
E |
|
- | - | PI accessory | |
|
E |
|
- | - | ||
|
V | - | - | 0.0011 | 257 | PI accessory |
|
S |
|
0.0013 | - | PI accessory | |
|
A | - | - |
|
|
PI |
|
V |
|
- |
|
PI | |
|
M |
|
|
|
PI | |
|
L |
|
- | - | PI accessory |
MEDS versus FEEDS LRT, testing for directional selection.
Empirical Bayes analysis, testing for directional selection on protein data.
‘-’: not significant.
Protease inhibitor.
FEEDS identified six sites involved in diversifying selection (table S7), with all six listed on HIVdb. In addition to two sites already detected by MEDS (74 and 90), sites 10 and 71 are listed as accessory mutations, while 54 and 82 are major resistance mutations. DEPS appeared to be much more conservative on this dataset, detecting four sites under directional selection, two of which are listed on HIVdb (see table S2).
MEDS detected six substitutions under significant directional selection at the 1% level (see
Site | Target | MEDS p-value |
FEEDS p-value |
EDEPS Bayes Factor |
Resistance | |
|
I |
|
- |
- | INI |
|
|
A |
|
|
|
INI accessory | |
|
S |
|
|
|
INI | |
|
R |
|
|
|
INI | |
|
H |
|
|
|
INI | |
|
H |
|
|
|
INI | |
|
R | - | - | - |
|
INI accessory |
|
Q | - | - | - |
|
|
|
|
- | - | 0.0064 | - | |
|
|
- | - | 0.0048 | - | INI accessory |
MEDS versus FEEDS LRT, testing for directional selection.
Empirical Bayes analysis, testing for directional selection on protein data.
‘-’: not significant.
Integrase inhibitor.
FEEDS found seven sites under diversifying selection (table S9), six of which are known resistance mutations. 230 is the only correctly identified resistance site in the integrase dataset that is detected as being under diversifying selection by FEEDS, but not directional selection by MEDS. 230 R and N are listed as selected by Raltegravir. DEPS detected 39 substitutions under directional selection (see table S3), nine of which appear on the HIVdb list.
Comparing the fit of FEEDS and MEDS on
The power of MEDS, like that of other codon methods, strongly depends on the information content of the sequences, specifically on the number of times that substitutions toward the target occur along the foreground lineages. For example, even when
Hence, we tabulate MEDS results only for sites with at least one substitution towards the target on any foreground branch.
# FG branches |
|
||||
2 | 5 | 10 | 100 | 1000 | |
4 | 0 (8) |
0 (16) | 0 (37) | 0.31 (110) | 0.79 (155) |
8 | 0 (11) | 0 (18) | 0.04 (62) | 0.51 (129) | 0.73 (170) |
16 | 0 (31) | 0.018 (54) | 0.036 (83) | 0.59 (177) | 0.71 (201) |
32 | 0.02 (62) | 0.03 (71) | 0.16 (116) | 0.68 (223) | 0.80 (282) |
Numbers in brackets are the number of times at least one substitution towards the target occurred along foreground branches:
# FG branches | # substitutions to target AA | |||||
0 | 1 | 2 | 3 | 4 |
|
|
4 | 0 (1674) |
0 (119) | 0.2 (58) | 0.77 (48) | 0.99 (111) | N/A |
8 | 0 (1610) | 0 (146) | 0.23 (53) | 0.69 (26) | 1 (21) | 0.99 (144) |
16 | 0 (1454) | 0 (200) | 0.34 (92) | 0.49 (39) | 0.79 (34) | 0.97 (181) |
32 | 0 (1246) | 0.03 (234) | 0.4 (107) | 0.41 (70) | 0.70 (46) | 0.97 (297) |
Numbers in brackets are the number of times that many substitutions towards the target occurred along foreground branches:
For data simulated with two target residues, each on eight foreground branches, the occurrence of at least one substitution towards
Substitutions to both targets |
|
|
|
|
|
|
|
|
MEDS detects at least one target: | 0.64 | 0.81 | 0.89 | 0.92 | 0.95 | 0.98 | 1 | 1 |
MEDS detects both targets: | 0.19 | 0.36 | 0.48 | 0.52 | 0.63 | 0.76 | 0.78 | 0.81 |
Total sites: | 538 | 288 | 214 | 179 | 132 | 99 | 69 | 32 |
Substitutions along foreground branches. Each target has 8 foreground branches along which changes towards it were accelerated.
MEDS behaves conservatively. With data simulated under the null model, far fewer sites are identified as under episodic directional selection than would be expected from the nominal p-value thresholds. Across all four foreground configurations, only one false positive detection (
0.005 | 0.05 | 0.5 | 5 | |
|
0.005 | 0.0025 | 0.0025 | 0.0075 |
|
0.02 | 0.0175 | 0.02 | 0.015 |
|
0.0325 | 0.0325 | 0.035 | 0.0375 |
We have proposed a codon (MEDS) and a protein (EDEPS) model of episodic directional selection, and demonstrated their performance on three HIV-1 datasets, where drug-induced directional episodic selection is expected to operate. We have also proposed a model of episodic diversifying selection (FEEDS), to rigorously evaluate the importance of modeling the directional component of natural selection. As expected, on all datasets, our episodic directional tests strongly outperform a test for continuous directional selection (DEPS) for detecting drug resistance sites. The assumptions of DEPS are inappropriate for the analysis of episodic selection, where selection is limited to specific regions of the phylogeny, because DEPS assumes uniform selection over the whole phylogeny. This serves as a caution against using suboptimal models, rather than a criticism of DEPS.
We tested MEDS with extensive simulations. MEDS is a conservative test, even when strong constraints on the amino acid state space are introduced in the form of site-specific equilibrium frequencies. Under the alternative model, good power is achieved even when relatively few substitutions towards target amino acids take place along foreground branches. When we deviate from the alternative model and elevate the substitution rate towards several target residues, the power to detect both targets is lower than it would be assuming independence. This reduction in power is expected: as the number of targets along foreground branches increases, the directional nature of the process is lost.
Hughes
Another interesting property of directional models is exemplified by a substitution in the protease dataset. 93L is a polymorphic mutation selected for by protease inhibitors. Despite L already being the most common residue in subtype C, the model detects selective pressure towards it – the proportion of L residues is indeed lower in nave sequences. At the population level this appears as purifying selection: the most common amino acid increases in frequency. This is nevertheless detected by our test. Far from being problematic, such information could be useful for directing treatment, if it turns out that patients with I at position 93 are more susceptible to PI therapy. Such observations should, of course, be directly verified with clinical data.
There are clear differences in organism-wide amino acid exchangeabilities in HIV-1
MEDS and EDEPS were designed with HIV-1 drug resistance in mind, but should be applicable wherever episodic directional selection occurs along multiple lineages. To use the models, two specific conditions must be met: 1) Lineages expected to be under directional selection must be known
With HIV-1 drug resistance datasets, the foreground labeling strategy might prove important. On the RT dataset, branch-labeling was straightforward, as we had access to pre-treatment sequences for each patient. This is not the case for most real-world datasets, and other approximate labeling schemes, as well as the robustness of the results to these differences, should be investigated.
Another consideration is the rooting of the tree. With directional models, the expected amino acid frequencies change across the phylogeny, and the position of the root becomes important
Amidst growing concerns about an epidemic of circulating drug resistant HIV-1, the WHO and SATuRN are recommending a scale-up of drug resistance surveillance
The maximum-likelihood phylogeny for the reverse transcriptase dataset. Foreground branches are marked in red. All terminal foreground branches lead to sequences obtained from patients who had been receiving antiretroviral therapy.
(PDF)
The maximum-likelihood phylogeny for the integrase dataset. Foreground branches are marked in red. All terminal foreground branches lead to sequences obtained from patients who had been receiving antiretroviral therapy.
(PDF)
A balanced phylogeny used for simulations. Foreground branches are marked in red. See
(PDF)
Reverse transcriptase results - DEPS.
(PDF)
Protease results - DEPS.
(PDF)
Integrase results - DEPS.
(PDF)
Reverse Transcriptase - MEDS: Maximum likelihood parameter values for the test for episodic directional selection.
(PDF)
Reverse Transcriptase - FEEDS: Maximum likelihood parameter values for the test for episodic diversifying selection.
(PDF)
Protease - MEDS: Maximum likelihood parameter values for the test for episodic directional selection.
(PDF)
Protease - FEEDS: Maximum likelihood parameter values for the test for episodic diversifying selection.
(PDF)
Integrase - MEDS: Maximum likelihood parameter values for the test for episodic directional selection.
(PDF)
Integrase - FEEDS: Maximum likelihood parameter values for the test for episodic diversifying selection.
(PDF)
Simulation details. The variation in nuisance parameters used for our simulations.
(PDF)