R/Medicine on R Views

A Guide to Binge Watching R / Medicine 2021

Thu, 09 Sep 2021 00:00:00 +0000

R / Medicine is a big deal. This year, the conference grew by 13% with 665 people from over 60 countries signing up for the virtual event which was held last month. 34% percent of the registrants were from outside of the United States and 17% identified as physicians.

The conference is now an established international event where experts report on the advanced use of the R language, Machine Learning, and statistical analysis, and discuss the successes and challenges associated with bringing these technologies to day-to-day medical practice.

Almost all of the talks, including keynotes, regular talks, lightning talks, pre-conference workshops and poster sessions are available online. Find the links on the R / Medicine site or look through the playlist on the R Consortium Youtube Channel. Note that the posters can be viewed by going to the conference spatial.chat site. (If you and a friend visit at the same time you should be able to “walk around” the posters and chat about what you see.)

To kick off an evening of binge watching the conference I would begin with the keynotes.

The Keynotes

Dr. Karandeep Singh sets the hook for his talk, Bringing Machine Learning Models to the Bedside at Scale, two minutes into the video when he asks:

Who are the twenty sickest patients in the hospital right now who are not in the ICU?

This straightforward question immediately gets to the promise and the problems of introducing large scale machine learning algorithms into the hospital, and indicates how medical practice interacts with big money questions about allocating resources. Both physicians and administrators would like to identify high risk patients and treat them proactively while being able to confidently spend less on unnecessary test for low risk patients. About (5:10) into the talk, Karandeep begins discussing the challenges associated with introducing machine learning models.

In the remainder of the talk he describes the technical infrastructure and then the governance or “social infrastructure” needed for success.

If you enjoy a good detective story, and take pride in your ability to interpret a well-done statistical plot you are certainly going to want to watch Ziad Obermeyer’s keynote Dissecting Algorithmic Bias. About two minutes into the video Professor Obermeyer sets the stage with the warning:

The single greatest threat to all of the gains that we can make in using algorithms in medicine is letting them go wrong in increasingly well known ways.

and the observation that due to the focus of the US health care management on “high risk care management” an estimated 150 to 200 million Americans are sorted by algorithms every year. He goes on to work through a case study that illustrates how an algorithm built with good intentions had the effect of scaling up racial bias.

A second case study features an algorithm that “fights against” racial bias. Along the way, Ziad weaves two common themes into his presentation:

So many of the ways that algorithms can go wrong come from training algorithms with the wrong target variables, often “convenient and tempting proxies”.
The necessity of follow-up work to fix underlying problems.

In the remainder of this post, I have organized the talks into six categories that you may find helpful for setting your viewing program: Clinical Practice, Clinical Trials, Medical Data, R in Production, R Tools, and Short Courses. The majority of the talks have a machine learning angle. There is quite a bit of Shiny and several R packages, not all of them on CRAN, are featured. I have provided links when I could find them. I don’t want to spoil anyone’s fun in searching through the videos for “Easter Eggs”, but the Reproducible Research with R short course contains the first preview on the Quarto Publishing system in a talk from anyone at RStudio. (Note that the video needs some editing. Start watching at 9 minutes.)

Clinical Practice

Building an Interpretable ML Model API for Interpretation of CNVs in Patients with Rare Diseases - Francisco Requena
Subgroup Identification and Precision Medicine with the personalized R Package - Jared Huling
R and Shiny Dashboards to Facilitate Quality Improvement in Anesthesiology and Periopeartive Care - Robert Lobato
tidytof: Predicting Patient Outcomes from Single-cell Data using Tidy Data Principles - Timothy Keyes
Assessing ML Model Performance in DIverse Populations and Across Time - Victor Castro, Roy Perlis

Clinical Trials

Designing Early Phase Clinical Trials with ppseq - Emily Zabor
Collaborative, Reproducible Exploration of Clinical Trial Data - Michael Kane
Graphical Displays in R for Clinical Trials - Steven Schwager
ctrialsgov: Access, Visualization, and Discovery of the ClinicalTrials.gov Database - Taylor Arnold

Medical Data

Scaling Up and Deploying Shiny and Text Mining for National Health Decisions - Andreas Soteriade, Chris Beeley
Mapping African Health Data with afrimapr Package, Training & Community - Andy South
You R What You Measure: Digital Biomarkers for Insights in Personalized Health - Irene van den Broek
Shiny and REDCap for a Global Research Consortium - Judith Lewis, Stephany Duda
Diving into Registry Data: Using R for Large Norwegian Health Registries - Julia Romanowska
ReviewR: A Shiny App for Reviewing Clinical Records - Laura Wiley, David Mayer
DOPE: An R package for Processing and Classifying Drug Names - Layla Bouzoubaa
medicaldata for Teaching #Rstats - Peter Higgins
Stem Cell Transplant Outcomes Reporting using R/Shiny - Richard Hanna, Stephan Kadauke

R in Production

Second Server to the Right and Straight On ‘til Production: Deploying a GxP Shiny Application - Marcus Adams
Target Markdown and stantargets for Bayesian model validation pipelines - Will Landau
GENETEX: A Genomics Report Text Mining R Package to Capture Real-world Clinico-genomic Data - David Miller, Sophia Shalhout

R Tools

Generalized Additive Models for Longitudinal Biomedical Data - Ariel Mundo
Multistate Data Using the survival Package - Beth Atkinson
Bayesian Random-Effects Meta-analysis using bayesmeta - Christian Rover
An arsenal of R Functions for Statistical Summaries - Ethan Heinzen, Beth Atkinson, Jason Sinnwell
R Markdown and officedown to Automate Clinical Trial Reporting - Damian Rodziewicz
Creating and Styling PPTX Slides with rmarkdown - Emil Hvitfeldt
runway: an R Package to Visualize Prediction Model Performance - Jie Cao, Karandeep Singh
clinspacy: An R package for Clinical Natural Language Processing - Jie Cao, Karandeep Singh
Data Visualization for Machine Learning Practitioners - Julie Silge
Animated Data Visualizations with gganimate for Science Communication during the Pandemic - Kristen Panthagani
Incorporating Risk-of-Bias Assessments into Evidence Syntheses with robvis - Luke McGuinness, Randall Boyes, Alex Fowler
‘gpmodels’: A Grammar of Prediction Models - Sean Meyer, Karandeep Singh
CONSORT Diagrams in R with ggconsort - Travis Gerke

Short Courses

Secure Medical Data Collection: Best Practices with Excel, and Leveling Up to REDCap and CollaboratoR - Peter Higgins, Will Beasley, Kenneth MacLean, Amanda Miller
Introduction to R for Medical Data - Ted Laderas, Daniel Chen, Mara Alexeev
An Introductory R Guide for Targeted Maximum Likelihood Estimation in Medical Research - Ehsan Karim, Hanna Frank
Mapping Spatial Health Data - Marynia Kolak, Susan Paykin
From SAS to R - Joe Krsszun
Reproducible Research with R - Alison Hill, Stephan Kaduke, Paul Villanueva

Analysing the HIV pandemic, Part 4: Classification of lab samples

Thu, 23 May 2019 00:00:00 +0000

Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio

Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

In this post we complete our series on analysing the HIV pandemic in Africa. Previously we covered the bigger picture of HIV infection in Africa, and a pipeline for drug resistance testing of samples in the lab.

Then, in part 3 we saw that sometimes the same patient’s genotype must be repeatedly analysed in the lab, from samples taken years apart.

Let’s say we have genotyped a patient five years ago and we have a current genotype sequence. It should be possible to retrieve the previous sequence from a database of sequences without relying on identifiers only or at all. Sometimes when someone remarries they may change their surname or transcription errors can be made, which makes finding previous samples tedious and error-prone. So instead of using patient information to look for previous samples to include, we can rather use the sequence data itself and then confirm the sequences belong to the same patient or investigate any irregularities. If we suspect mother-to-child transmission from our analysis, we confirm this with the healthcare worker who sent the sample.

In this final part, we discuss how the inter- and intra-patient HIV genetic distances were analyzed using logistic regression to gain insights into the probability distribution of these two classes. In other words, the goal is to find a way to tell whether two genetic samples are from the same person or from two different people.

Samples from the same person can have slightly different genetic sequences, due to mutations and other errors. This is especially useful in comparing samples of genetic material from retroviruses.

Preliminary analysis

To help answer this question, we downloaded data from the Los Alamos HIV sequence database (specifically, Virus HIV-1, subtype C, genetic region POL CDS).

Each observation is the (dis)similarity distance between different samples.

library(readr)
library(dplyr)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

pt_distance <- 
  read_csv("dist_sample_10.csv.zip", col_types = "ccdccf")

head(pt_distance)

## # A tibble: 6 x 6
##   sample1                sample2                 distance sub   area  type 
##   <chr>                  <chr>                      <dbl> <chr> <chr> <fct>
## 1 KI_797.67744.AB874124… KI_481.67593.AB873933.…   0.0644 B     INT   Inter
## 2 502-2794.39696.JF3202… WC3.27170.EF175209.B.U…   0.0418 B     INT   Inter
## 3 KI_882.67653.AB874186… KI_813.67589.AB874131.…   0.0347 B     INT   Inter
## 4 HTM360.13332.DQ322231… C11-2069070.63977.AB87…   0.0487 B     INT   Inter
## 5 O5598.34737.GQ372062.… LM49.4011.AF086817.B.T…   0.0360 B     INT   Inter
## 6 GKN.45901.HQ026515.B.… C11-2069083.65198.AB87…   0.0699 B     INT   Inter

Next, plot a histogram of the distance between samples. This clearly shows that the distance between samples of the same subject (intra-patient) is smaller than the distance between different subjects (inter-patient). This is not surprising.

However, from the histogram it is also clear that there is not a clear demarcation between these types. Simply eye-balling the data seems to indicate that one could use an arbitrary threshold of around 0.025 to indicate whether the sample is from the same person or different people.

pt_distance %>% 
  mutate(
    type = forcats::fct_rev(type)
  ) %>% 
  ggplot(aes(x = distance, fill = type)) +
  geom_histogram(binwidth = 0.001) +
  facet_grid(rows = vars(type), scales = "free_y") +
  scale_fill_manual(values = c("red", "blue")) +
  coord_cartesian(xlim = c(0, 0.1)) +
  ggtitle("Histogram of phylogenetic distance by type")

Modeling

Since we have two sample types (intra-patient vs inter-patient), this is a binary classification problem.

Logistic regression is a simple algorithm for binary classification, and a special case of a generalized linear model (GLM). In R, you can use the glm() function to fit a GLM, and to specify a logistic regression, use the family = binomial argument.

In this case we want to train a model with distance as independent variable, and type the dependent variable, i.e. type ~ distance.

We train on 100,000 (n = 1e5) observations purely to reduce computation time:

pt_sample <- 
  pt_distance %>% 
  sample_n(1e5)
model <- glm(type ~ distance, data = pt_sample, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

(Note that sometimes the model throws a warning indicating numerical problems. This happens because the overlap between intra and inter is very small. If there is a very sharp dividing line between classes, the logistic regression algorithm has problems to converge.)

However, in this case the numerical problems doesn’t actually cause a practical problem with model itself.

The model summary tells us that the distance variable is highly significant (indicated by the ***):

summary(model)

## 
## Call:
## glm(formula = type ~ distance, family = binomial, data = pt_sample)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4035  -0.0050  -0.0010  -0.0002   8.4904  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.7887     0.1796   32.23   <2e-16 ***
## distance    -355.1454     9.3247  -38.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23659.2  on 99999  degrees of freedom
## Residual deviance:  1440.5  on 99998  degrees of freedom
## AIC: 1444.5
## 
## Number of Fisher Scoring iterations: 12

Now we can use the model to compute a prediction for a range of genetic distances (from 0 to 0.05) and create a plot.

newdata <-  data.frame(distance = seq(0, 0.05, by = 0.001))
pred <- predict(model, newdata, type = "response")

plot_inter <- 
  pt_sample %>% 
  filter(distance <= 0.05, type == "Inter") %>% 
  sample_n(2000)
  
plot_intra <- 
  pt_sample %>% 
  filter(distance <= 0.05, type == "Intra") %>% 
  sample_n(2000)

threshold <-  with(newdata, approx(pred, distance, xout = 0.5))$y

ggplot() +
  geom_point(data = plot_inter, aes(x = distance, y = 0), alpha = 0.05, col = "blue") +
  geom_point(data = plot_intra, aes(x = distance, y = 1), alpha = 0.05, col = "red") +
  geom_rug(data = plot_inter, aes(x = distance, y = 0), col = "blue") +
  geom_rug(data = plot_intra, aes(x = distance, y = 0), col = "red") +
  geom_line(data = newdata, aes(x = distance, y = pred)) +
  annotate(x = 0.005, y = 0.9, label = "Type == intra", geom = "text", col = "red") +
  annotate(x = 0.04, y = 0.1, label = "Type == inter", geom = "text", col = "blue") +
  geom_vline(xintercept = threshold, col = "grey50") +
  ggtitle("Model results", subtitle = "Predicted probability that Type == 'Intra'") +
  xlab("Phylogenetic distance") +
  ylab("Probability")

Logistic regression essentially fits an s-curve that indicates the probability. In this case, for small distances (lower than ~0.01) the probability of being the same person (i.e., type is intra) is almost 100%. For distances greater than 0.03 the probability of being type intra is almost zero (i.e., the model predicts type inter).

The model puts the distance threshold at approximately 0.016.

The practical value of this work

In part 2, we discussed how researchers developed an automated pipeline of phylogenetic analysis. The project was designed to run on the Raspberry Pi, a very low-cost computing device. This meant that the cost of implementation of the project is low, and the project has been implemented at the National Health Laboratory Service (NHLS) in South Africa.

In this part, we described the very simple logistic regression model that runs as part of the pipeline. In addition to the descriptive analysis, e.g., heat maps and trees (as described in part 3), this logistic regression makes a prediction whether two samples were obtained from the same person, or from two different people. This prediction is helpful in allowing the laboratory staff identify potential contamination of samples, or indeed to match samples from people who weren’t matched properly by their name and other identifying information (e.g., through spelling mistakes or name changes).

Finally, it’s interesting to note that traditionally the decision whether two samples were intra-patient or inter-patient was made on heuristics, instead of modelling. For example, a heuristic might say that if the genetic distance between two samples is less than 0.01, they should be considered a match from a single person.

Heuristics are easy to implement in the lab, but sometimes it can happen that the origin of the original heuristic gets lost. This means that it’s possible that the heuristic is no longer applicable to the sample population.

This modelling gave the researchers a tool to establish confidence intervals around predictions. In addition, it is now possible to repeat the model for many different local sample populations of interest, and thus have a tool that is better able to discriminate given the most recent data.

Conclusion

In this multi-part series of HIV in Africa we covered four topics:

In part 1, we analysed the incidence of HIV in sub-Sahara Africa, with special mention of the effect of the wide-spread availability of anti-retroviral (ARV) drugs during 2004. Since then, there was a rapid decline in HIV infection rates in South Africa.
In part 2, we described the PhyloPi project - a phylogenetic pipeline to analyse HIV in the lab, available for the low-cost RaspBerry Pi. This work as published in the PLoS ONE journal: “PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility”
Then, part 3 described the biological mechanism how the HIV virus mutates, and how this can be modeled using a Markov chain, and visualized as heat maps and phylogenetic trees.
This final part covered how we used a very simple logistic regression model to identify if two samples in the lab came from the same person or two different people.

Closing thoughts

Dear readers,

I hope that you enjoyed this series on ‘Analysing the HIV pandemic’ using R and some of the tools available as part of the tidyverse packages. Learning R provided me not only with a tool set to analyse data problems, but also a community. Being a biologist, I was not sure of the best approach for solving the problem of inter- and intra-patient genetic distances. I contacted Andrie from Rstudio, and not only did he help us with this, but he was also excited about it. It was a pleasure telling you about our journey on this blog site, and a privilege doing this with experts.

Armand

Analysing the HIV pandemic, Part 3: Genetic diversity

Thu, 16 May 2019 00:00:00 +0000

Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio

Recap

In part 2 of this series, we discussed the PhyloPi pipeline for conducting routine HIV phylogenetics in the drug-resistance testing laboratory as a part of quality control. As mentioned, during HIV replication the error-prone viral reverse transcriptase (RT) converts its RNA genome into DNA before it can be integrated into the host cell genome. During this conversion, the enzyme makes random mistakes in the copying process. These mistakes, or mutations, can be deleterious, beneficial or may have no measurable impact on the replicative fitness of the virus. However, the fast rate of mutation provides enough divergence to be useful for phylogenetic analysis.

Introduction

As infections spread from person to person, the virus continues to mutate and become more and more divergent. This allows us to use the genetic information we obtain while doing the drug resistance test and analyse the sequences for abnormalities.

We showed how DNA sequences can be aligned and, based on the composition of ‘columns’ in these strings, a distance matrix can be calculated of each string against each other. In the example we discussed in part 2, we had a very simple method for calculating matches, i.e., we used either a one or zero. We can get closer to the truth by using substitution models, as we will explain below. In many machine learning algorithms, it is required that one first calculate the distances of each observation against each other, and the choice of algorithm is up to the analyst. Phylogenetic inference is very similar in that a distance matrix needs to be constructed on which the tree can be calculated.

If the sequence targeted for phylogenetic inference is very stable with little or no evolution, the distances calculated will be zero or very close to it. This will not allow for differentiation. However, as we mentioned, HIV has a very fast rate of evolution due to its error-prone reverse transcriptase.

Cuevas et al. (2015) published work on the in vivo rate of HIV evolution. Their analysis revealed the highest mutation rate of any biological entity of $4.1 \cdot 10^{-3}$ ($sd=1.7 \cdot 10^{-3}$). However, the error-prone reverse transcriptase is not the only mechanism of mutation. One defence against HIV infection is an enzyme called apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like or APOBEC. These enzymes act on RNA and convert or mutate cytidine to uridine (uridine in RNA is the thymadine counterpart in DNA). This results in a G to A mutation on the cDNA.

Also, shown by Cuevas et al, these enzymes are not equally active in all people. On the other hand, the viral Vif protein inhibits this hypermutation by ‘tagging’ the APOBEC protein with ubiquinone for degradation by the cytoplasmic ubiquitin-dependent proteasome machinery.

But how does this virus-driven mutation, or APOBEC-driven hypermutation, affect the virus in a negative (or positive) way?

We first need to understand how RNA is translated into proteins. Below is a table showing the codon combinations for each of the 20 amino acids.

Figure 1: Amino acid encoding. Available at https://www.biologyjunction.com/protein-synthesis-worksheet/

As can be seen from the table above, some amino acids are encoded by more than one codon. For example, if we change the codon CGU to AGA, the resulting amino acid stays Arginine or R. This is referred to as a silent mutation, since the resulting protein will look the same. On the other hand, if we mutate AGU to CGU, the resulting mutation is from Serine to Arginine, or in single-letter notation, S to R. A change in the amino acid is referred to as a non-synonymous mutation.

Example

In reality, the APOBEC enzyme recognizes specific RNA sequence motifs, but just to give an idea of how this works, let’s look at an example.

Load some packages:

library(ape)
library(Biostrings)
library(tibble)
library(tidyr)
library(dplyr)
library(knitr)
library(plotly)
library(RColorBrewer)
library(diagram)

Create a RNA sequence (remember U is T in RNA language):

WT <- c("CGA", "GUU", "AUA", "GAG", "UGG", "AGU")

We have the sequence CGAGUUAUAGAGUGGAGU that we created in the cell block above as codons for clarity. We can now translate this sequence using the codon table or some function.

translate_dna_sequence <- function(x){
  x %>% 
    paste0(collapse = "") %>% 
    gsub("U", "T", .) %>% 
    DNAString() %>% 
    as.DNAbin() %>% 
    trans() %>% 
    .[[1]] %>% 
    as.character.AAbin()
}

AA <- WT %>% translate_dna_sequence()

The code block above translated our RNA sequence into a protein sequence: R, V, I, E, W, S.

Now let’s mutate all occurrences of C to U/T:

MUT <- gsub("C", "U", WT)

The resulting mutant sequence is: UGA, GUU, AUA, GAG, UGG, AGU, and if we now translate that, we get …

AA <- MUT %>% translate_dna_sequence()

… the protein sequence: *, V, I, E, W, S.

The * means a stop codon was introduced. Stop codons are responsible for terminating translation from RNA to protein. If one of the viral genes has a stop codon in it, the protein will truncate prematurely and the protein will most likely be dysfunctional. Mutations other than stop codons could also have a negative effect on the virus, or it can cause resistance to an ARV.

Calculating genetic distances from a multiple sequence alignment (MSA)

In part 2, we showed the general principle of a MSA. In biology, sequence alignments are used to look at similarities of DNA or protein sequences. For most phylogenetic analysis, a multiple sequence alignment is a requirement, and the more accurate the MSA, the more accurate the phylogenetic inference.

First, we read in the multiple sequence alignment file.

# Read in the alignment file
aln <- read.dna('example.aln', format = 'fasta')

Next, we can calculate the distance matrix using the Kimura two-parameter (K80) model. There are various models that can be applied when looking at DNA substitution models. We will use a model based on Markov chains. Remember:

“All models are wrong, but some are useful” - George Box

This is very true when it comes to estimating genetic distances and phylogenetic inference. Consider the image below:

Figure 2: transversions vs transitions. Available at https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/All_transitions_and_transversions.svg/1024px-All_transitions_and_transversions.svg.png

The figure above shows transition and transversion events. Transition between A and G (the purines) and C and T (the pyrimidines) are more likely than transversions (indicated by the red arrows). The K80 model takes this into account as one of its parameters, and these rates, or probabilities, are calculated or estimated by maximum likelihood.

Let’s see what that looks like:

tmDNA <- matrix(c(0.8,0.05,0.1,0.05,
                  0.05,0.8,0.05,0.1,
                  0.1,0.05,0.8,0.05,
                  0.05,0.1,0.05,0.8),
                nrow = 4, byrow = TRUE)
stateNames <- c("A","C","G", "T")
row.names(tmDNA) <- stateNames; colnames(tmDNA) <- stateNames

tmDNA %>% 
  kable(
    caption = "Example K80 probabilities of transitions or transversions"
  )

Table 1: Example K80 probabilities of transitions or transversions
	A	C	G	T
A	0.80	0.05	0.10	0.05
C	0.05	0.80	0.05	0.10
G	0.10	0.05	0.80	0.05
T	0.05	0.10	0.05	0.80

plotmat(tmDNA,pos = c(2,2), 
        lwd = 1, box.lwd = 2, 
        cex.txt = 0.8, 
        box.size = 0.1, 
        box.type = "circle", 
        box.prop = 0.5,
        box.col = "light blue",
        arr.length=.1,
        arr.width=.1,
        self.cex = .6,
        self.shifty = -.01,
        self.shiftx = .14,
        main = "Markov Chain")

This example is contrived, but should explain the concept of a substitutions model. The viral reverse transcriptase is not a random sequence generator, but it does make mistakes. Most of the time when it is copying the RNA into DNA, the base (state) stays the same. Then also, the probability of a transversion vs. a transition is different. If you look at the figure above where we introduced transversion and transition, you will notice that A is more similar to G, and T is more similar to C in its chemical structure.

There are many other substitution models. It is not always trivial to select the best model for phylogenetic inference. One technique is to run multiple maximum likelihood phylogenetic calculations using different models, and then pick the model with the lowest AIC (Akaike Information Criterion). For our pipeline, we selected the rather simple K80 model. Since we are looking at different sets of sequences at each submission, a simple model is probably better in order to avoid the problems caused by overfitting.

We can use the ape package and calculate distances using the K80 model.

# Calculate the genetic distances between sequences using the K80 model, as.mattrix makes the rest easier
alnDist <- dist.dna(aln, model = "K80", as.matrix = TRUE)
alnDist[1:5, 1:5] %>% 
  kable(caption = "First few rows of our distance matrix")

Table 2: First few rows of our distance matrix
	01_AE.JP.AB253686_INT	B.US.HM450245_INT	B.AU.AF407664_INT	B.CN.KJ820110_INT	B.RU.HM466986_INT
01_AE.JP.AB253686_INT	0.0000000	0.0935626	0.0961965	0.0962887	0.0962887
B.US.HM450245_INT	0.0935626	0.0000000	0.0378446	0.0378167	0.0378748
B.AU.AF407664_INT	0.0961965	0.0378446	0.0000000	0.0454602	0.0494138
B.CN.KJ820110_INT	0.0962887	0.0378167	0.0454602	0.0000000	0.0479955
B.RU.HM466986_INT	0.0962887	0.0378748	0.0494138	0.0479955	0.0000000

The matrix has a shape of 47 by 47, so we just preview the first 5 rows and columns.

Reduction of the heatmap to focus on the important data

The pipeline mentioned uses the Basic Local Alignment Search Tool (BLAST) to retrieve previously sampled sequences, and adds these retrieved sequences to the analysis. BLAST is like a search engine you use on the web, but for protein or DNA sequences. By doing this, important sequences from retrospective samples are included, which enables PhyloPi to be aware of past sequences and not just batch-per-batch aware. Have a look at the paper for some examples.

The data we have is ready to use for heatmap plotting purposes, but since the data also contains previously sampled sequences, comparing those sequences amongst themselves would be a distraction. We are interested in those samples, but only compared to the current batch of samples analysed. The figures below should explain this a bit better.

(#fig:distracting data)A diagram of a heatmap with lots of redundant and distracting data.

From the image above you can see that, typical of a heatmap, it is symmetrical on the diagonal. We show submitted vs retrieved samples in both the horizontal and vertical direction. Notice also, annotated as “Distraction”, the previous samples are compared amongst themselves. We are not interested in those samples now, as we would already have acted on any issues then. What we want instead is a heatmap, as depicted in the image below.

(#fig:focussed data)A diagram of a more focussed heatmap with the redundant and distracting data removed.

Fortunately, we have a very powerful tool, R, at our disposal, and plenty of really useful and convenient packages like dplyr to fix this.

alnDistLong <- 
  alnDist %>% 
  as.data.frame(stringsToFactors = FALSE) %>% 
  rownames_to_column(var = "sample_1") %>% 
  gather(key = "sample_2", value = "distance", -sample_1, na.rm = TRUE) %>% 
  arrange(distance)

alnDistLong %>% head()

##                sample_1              sample_2 distance
## 1 01_AE.JP.AB253686_INT 01_AE.JP.AB253686_INT        0
## 2     B.US.HM450245_INT     B.US.HM450245_INT        0
## 3     B.AU.AF407664_INT     B.AU.AF407664_INT        0
## 4     B.CN.KJ820110_INT     B.CN.KJ820110_INT        0
## 5     B.RU.HM466986_INT     B.RU.HM466986_INT        0
## 6     B.US.DQ127546_INT     B.US.DQ127546_INT        0

Final cleanup and removal of distracting data

# get the names of samples originally in the fasta file used for submission
qSample <- names(read.dna("example.fasta", format = "fasta"))

# compute new order of samples, so the new alignment is in the order of the heatmap example
sample_1 <- unique(alnDistLong$sample_1)
new_order <- c(sort(qSample), setdiff(sample_1, qSample))
new_order

##  [1] "01_AE.JP.AB253686_INT"  "01_AE.TH.JX448243_INT" 
##  [3] "01_AE.VN.LC100946_INT"  "38_BF1.UY.FJ213783_INT"
##  [5] "B.AU.AF407664_INT"      "B.CN.KJ820110_INT"     
##  [7] "B.KR.JN417106_INT"      "B.RU.HM466986_INT"     
##  [9] "B.US.DQ127546_INT"      "B.US.GU076504_INT"     
## [11] "B.US.HM450245_INT"      "BC.CN.JQ898256_INT"    
## [13] "C.ZA.KT183056_INT"      "C.ZM.KM049918_INT"     
## [15] "C.ZM.KM050042_INT"      "01_AE.TH.JX448252_INT" 
## [17] "01_AE.TH.JX448250_INT"  "01_AE.TH.JX448249_INT" 
## [19] "C.ZA.KT183058_INT"      "C.ZM.KM049913_INT"     
## [21] "B.KR.JN417120_INT"      "B.KR.JN417117_INT"     
## [23] "B.KR.JN417116_INT"      "57_BC.CN.JX679207_INT" 
## [25] "C.ZM.KM050043_INT"      "C.ZM.KM050041_INT"     
## [27] "01_AE.JP.AB253682_INT"  "01_AE.JP.AB253689_INT" 
## [29] "B.US.KJ704790_INT"      "B.ES.KC238594_INT"     
## [31] "B.AU.AF407665_INT"      "B.AU.AF407667_INT"     
## [33] "B.CN.KC987976_INT"      "B.CN.KT192001_INT"     
## [35] "B.US.AF040369_INT"      "B.US.M38429_INT"       
## [37] "B.US.DQ127547_INT"      "B.US.DQ127543_INT"     
## [39] "C.ZA.KT183062_INT"      "B.US.GU076505_INT"     
## [41] "B.US.GU076507_INT"      "C.ZM.KM049917_INT"     
## [43] "01_AE.CN.JQ302565_INT"  "01_AE.VN.FJ185234_INT" 
## [45] "F1.BR.FJ771006_INT"     "BF.AR.AF408631_INT"    
## [47] "BC.CN.KC898983_INT"

Plot the heatmap using plotly for interactivity

alnDistLong %>% 
  filter(
    sample_1 %in% qSample,
    sample_1 != sample_2
    ) %>% 
  mutate(
    sample_2 = factor(sample_2, levels = new_order)
  ) %>% 
  plot_ly(
    x = ~sample_2,
    y = ~sample_1,
    z = ~distance,
    type = "heatmap", colors = brewer.pal(11, "RdYlBu"), 
    zmin = 0.0, zmax = 0.03,  xgap = 2, ygap = 1
) %>% 
  layout(
    margin = list(l = 100, r = 10, b = 100, t = 10, pad = 4), 
    yaxis = list(tickfont = list(size = 10), showspikes = TRUE),
    xaxis = list(tickfont = list(size = 10), showspikes = TRUE)
  )

Phylogenetic tree

Above we used the package ape to calculate the genetic distances for the heatmap.

Another way of looking at our alignment data is to use phylogenetic inference. The PhyloPi pipeline saves each step of phylogenetic inference to allow the user to intercept at any step. We can use the newick tree file (a text file formatted as newick) and draw our own tree:

tree <- read.tree("example-tree.txt")
plot.phylo(
  tree, cex = 0.8, 
  use.edge.length = TRUE, 
  tip.color = 'blue', 
  align.tip.label = FALSE, 
  show.node.label = TRUE
)
nodelabels("This one", 9, frame = "r", bg = "red", adj = c(-8.2,-46))

We have highlighted a node with a red block, with the text “This one”, which we can now discuss. We have three leaves in this node - KM050043, KM050042, KM050041 - and if you would look up these accession numbers at NCBI, you will notice the publication it is tied to:

“HIV transmission. Selection bias at the heterosexual HIV-1 transmission bottleneck”

In this paper, the authors looked at selection bias when the infection is transmitted. They found that in a pool of viral quasi-species, transmission is biased to benefit the fittest viral quasi-species. The node highlighted above shows the kind of clustering one would expect with a study like the one mentioned above. You will also notice plenty of other nodes, which you can explore using the accession number and searching for it here.

The tree above is much like a dendrogram used when displaying agglomerative or hierarchical clustering. The numbers on the tree indicate the probability that the corresponding clusters are correct. The branch lengths indicate the distances between samples. In conjunction with a properly coloured heatmap, this is very useful for finding relevant clusters to investigate. If the reason for close clustering cannot be explained, the tests are repeated.

The importance of phylogenetics

Phylogenetics, and thus genetic distance calculations, are used in many branches of biology. It is one of the quality-control measures at our disposal, but it has been used for the reconstruction of the origin of HIV. You may find the research papers listed below interesting where the authors used phylogenetics to infer the zoonotic origins of HIV.

As another example, in 1998, six foreign medical workers were accused of deliberately infecting hospitalized children with HIV and were sentenced to death in Libya. In 2006, de Oliveira, et al. used phylogenetics to provide evidence that the origin of the HIV strains that infected the children had an evolutionary history in the mid-90s, which was before the health care workers arrived in 1998. The six medics were released in 2007. There is also a very good writeup on the case by Declan Butler. Although probably very emotional, this would be a great movie.

These techniques are also used in criminal convictions. However, the interpretation of this kind of evidence in court cases can be unsafe. The insights of Pillay, et al. should bring this to light.

Summary

In this post we discussed that as infections spread from person to person, the virus continues to mutate and become more and more divergent. This allows using the genetic information we obtain while doing the drug resistance test and analyse the sequences for abnormalities.

We then showed how to compute genetic distance using multiple sequence alignment (MSA) and that it’s possible to model this process as a Markov chain. Then you can view the resulting model as a heatmap or phylogenetic trees.

This finds practical application in diverse situations, for exampling shedding light on the origin of the HIV virus, as well as evidence in legal trials.

What’s next

In the fourth and final part of this series, we will show how we analysed the inter- and intra-patient genetic distances of HIV sequences by logistic regression. This was useful in properly colouring our heatmap explained in this series. See you there!

Analysing the HIV pandemic, Part 2: Drug resistance testing

Tue, 07 May 2019 00:00:00 +0000

Dominique Goedhals is a pathologist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

Andrie de Vries is the author of “R for Dummies”, and a Solutions Engineer at RStudio

Introduction

In part 1 of this four-part series about HIV AIDS, we discussed the HIV pandemic in Sub-Saharan Africa. In this second installment, we cover a recent publication in the PLoS ONE journal: “PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility”.

The authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug-resistance testing facility.

HIV drug resistance

Natural selection is the process by which some form of selective pressure favours a phenotypic trait or change. These phenotypic traits can be the blood group of a person, whether a pea is wrinkly or not, or whether an infectious organism is susceptible or resistant to a drug. Many times these phenotypic traits, or physical attributes, are caused by genetics.

Genotyping is the process by which one can infer this phenotypic trait from a genotype, and this is used more and more frequently in medicine. For exampe, in breast cancer treatment, the BRCA (BReast CAncer) genes are genotyped to determine whether these cancer suppressing genes are intact. If there is a deleterious or damaging mutation in one of these genes, it can increase the risk of developing breast cancer, thus a phenotype of increased risk of breast cancer.

For most organisms, the copying of genetic material happens by very precise enzymes or pathways, but occasionally mutations do occur. If a mutation occurs and is sufficiently damaging, it gets removed from the gene pool. However, if the mutation is sufficiently beneficial, it increases the survival of this genetic variation and might biasly select for it.

In the previous post, we discussed ARVs (antiretrovirals) and how these drugs changed the landscape of HIV infection by preventing the development of AIDS. We mentioned that ARVs suppress viral replication. One of the steps in HIV replication is the conversion of its single-stranded RNA to DNA, which can then be incorporated into the DNA of infected cells. The enzyme responsible for this conversion is reverse transcriptase, and it has a high error rate when doing this conversion. One can thus say that HIV has a high evolutionary rate, or mutation rate. These genes are translated into viral proteins, which are required to make more virions (viral particles). Proteins are strings or polymers of amino acid residues with an alphabet of 20 choices of amino acids or letters. The sequence of the DNA or RNA influences the sequence of the protein; thus, mutations in the DNA or RNA can result in changes in the protein, and our targets for stopping HIV replication are proteins/enzymes.

There are various classes of ARVs which interfere with viral replication by inhibition of viral enzymes. If the DNA or RNA sequence encoding this enzyme is changed, the result might be an unfit virus not capable of further infection or replication. On the other hand, if this mutation results in an ARV-resistant virus, replication and infection can still continue in the presence of the ARV in question, possibly causing the ARV to become ineffective in stopping replication.

The question remains, why do people develop resistance? The short answer: it’s a numbers game.

If the patient received the correct regimen of ARVs (known as HAART, or highly active antiretroviral treatment) and is taking the doses correctly, the viral load will suppress. Suppression is caused by stopping viral replication, and if the virus is not replicating, the error-prone reverse transcriptase can’t cause mutations, which in turn cannot be favoured by selective pressure. If the patient is not taking any treatment, the virus is replicating and thus inevitably mutating, but there is no selective pressure to select for these variants. Lastly, if the patient is adhering poorly to the treatment, there are times where the levels of the treatment are too low to effectively suppress viral replication completely. In this scenario, mutants with a mutation which makes them less susceptible to the treatment will replicate more than the wild type counterparts - these are called escape mutants.

The reason why this is a numbers game is that the virus is mutating randomly and one resulting amino acid residue could be replaced by any of 19 other amino acid residues. It is only when this change causes an increase in replicative fitness while there is some form of selective pressure that this mutant can become a dominant quasi-species and the patient develops resistance.

Mutations are expressed using the notation [WT AA][POS][Mutant AA], where:

WT denotes wild type (the typical genotype)
AA denotes amino acid residue
POS denotes the position on the protein
Mutant means the changed genotype

We mentioned some classes of ARVs in part 1. To the viral reverse transcriptase, NRTIs (Nucleoside/Nucleotide Reverse Transcriptase Inhibitors) look like the building blocks of DNA called nucleotides. If the reverse transcriptase incorporates one of these ‘fake’ nucleotides, it is not able to further extend the DNA strand, leaving it incomplete, thus interfering with replication. Not all mutations cause the same level of resistance. These levels are:

Level	Total score
Susceptible	0 to 9
Potential low-level resistance	10 to 14
Low-level resistance	15 to 29
Intermediate resistance	30 to 59
High-level resistance	>= 60

Source

We can plot resistance scores for five commonly used NRTIs.

suppressPackageStartupMessages({
  library(dplyr)
  library(readr)
  library(stringr)
  library(tidyr)
  library(ggplot2)
  library(knitr)
  library(broom)
})

nrti_dr_scores <- read_tsv("ScoresNRTI_1555579653110.tsv", col_types = "cdcddddddd")

nrti_dr_scores %>% 
  select(Rule, ABC:AZT, FTC:TDF) %>% 
  gather(arv, score, 2:6) %>% 
  filter(!grepl(" ", Rule)) %>% 
  mutate(effect = ifelse(score > 0, "resistance", "hyper-susceptible")) %>% 
  
  ggplot(aes(x = Rule, y = score, fill = effect)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  facet_grid(. ~ arv)

We can see that 3TC and FTC have the exact same profiles, and they are chemically also very similar, as shown in the figure below.

(#fig:3TC and FTC)The chemical structures of 3TC (left) and FTC (right). Available at http://aras.ab.ca/articles/HAART-Nukes-AIDS-Umber

Also, note that some of the mutations increase susceptibility for AZT and TDF, indicated by a negative value for resistance. This is called hyper-susceptibility, and is used by clinicians treating patients.

For example, the mutation M184V means that the wild type AA at position 184 is a methionine (M) and it has been mutated to valine (V). Although this mutation makes the virus highly resistant to 3TC, it has a crippling effect on viral replication, i.e., the virus can still replicate in the presence of 3TC, but slower. This mutation also makes the virus hypersusceptible to AZT and TDF. The way clinicians use this knowledge is to keep patients on 3TC in order to keep the selective pressure for M184V, and use AZT or TDF as the other NRTI. It is typical to have a patient on two NRTIs, which is sometimes referred to as the “back bone”, and then one drug from another drug class to which the patient is fully susceptible. Knowing the genotype of the virus allows us to infer the phenotype, which in this case is the drug-resistance profile.

PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility

The goal of HIV drug resistance genotyping is to determine which drugs will produce the best response in the patient, and, as mentioned earlier, we use the viral sequence information for this. Due to the rapid evolution of HIV, we can use this attribute in quality assurance. PCR (polymerase chain reaction) is very sensitive to contamination, and if gross cross-contamination occurred during this process, the sequences of, say, two unrelated individuals might be very similar. Also, the viral sequences of a patient over time will be more similar than the sequences between different people.

Let’s say we genotyped a patient five years ago and we have a current genotype sequence. It should be possible to retrieve the previous sequence from a database of sequences without relying on identifiers only, or at all. Sometimes when someone remarries they may change their surname or transcription errors can be made, which makes finding previous samples tedious and error-prone. So instead of using patient information to look for previous samples to include, we can instead use the sequence data itself, and then confirm the sequences belong to the same patient, or investigate any irregularities. If we suspect mother-to-child transmission from our analysis, we confirm this with the health care worker who sent the sample.

We recently published an automated pipeline for maintaining a sequence database, automatically retrieving the most similar sequences from previous genotyped viral isolates, calculating genetic distances and phylogenetic inference. Let’s look at each of these steps.

Firstly, we cannot conduct phylogenetic analysis on all past and present sequences; this would be very computationally expensive and time-consuming, and the result will be very difficult to interpret. Rather, we want to focus on the current batch of sequences the laboratory generated, but also the most similar sequences from previous batches stored in our rolling database:

We used a tool called BLAST (Basic Local Alignment Search Tool) for this. This tool is used to add our new submissions to the current rolling database and then also retrieve the most similar previous sequences.
These sequences are aligned using MAFFT.
The resulting multiple sequence alignment is automatically curated with trimAl.

Finally, the sequences are ready for phylogenetic inference.

For this, we used FastTree. As its name implies, it is fast and capable of handling large datasets requiring minimal resources.
The resulting tree is rendered using the ETE3 python API.
R is used to calculate a distance matrix from the multiple sequence alignment using the ape library and plotly for visualization.

In part 3 of this series, we will talk more about the distance matrix calculation and how logistic regression was used to look at inter- and intra-patient genetic distances of HIV sequences by mining a large public database at the Los Alamos HIV sequence database. This was important, as the insights gained here were used to colour the distance matrix so that the user’s attention is drawn to relevant samples.

This is an R for medicine blog post, but there is a lot of jargon in the paragraph above. We can clear things up a bit, but please check out our publication.

How does it work?

Firstly, our DNA sequences are strings consisting of an alphabet: A, C, G, and T. Also, genetic distances are much like Levenshtein or Hamming distances, or other edit distance algorithms.

Raw strings

Consider the following strings, A, B and C:

A: peter kicked the ball really far
B: i think it was yesterday when peter kicked the ball really far
C: pieter kicked the round ball really hard

We can see that there are obvious similarities between these three sentences, but it would be much easier if they where aligned.

Aligned strings

A: ______________________________p eter kicked the _____ ball really far
B: i think it was yesterday when p eter kicked the _____ ball really far
C: ______________________________pieter kicked the round ball really hard

By aligning the string it is much easier to calculate the similarities or differences.

Curated strings

Next, we remove the overhangs since it is possible that in reality strings A and C also had more text on the left-hand side, but it was not sampled. Depending on your situation, we could also remove the internal ‘gaps’ like the word ‘round’. For our pipeline, insertions and deletions, like the letter ‘i’ in our example and the word ‘round’ are real features we would like to include. We also have a substitution in C, where the ‘f’ in A and B was changed to an ‘h’.

A: p eter kicked the _____ ball really far
B: p eter kicked the _____ ball really far
C: pieter kicked the round ball really har

Calculation

A: p eter kicked the _____ ball really far
B: p eter kicked the _____ ball really far
M: 111111 111111 111 11111 1111 111111 111

We can see for A and B we have matches for all of the features. If we sum up all the ones, we get 33, so the distance between them:

\[ d = \frac{33 - 33}{33} = 0\]

B: p eter kicked the _____ ball really far
C: pieter kicked the round ball really har
M: 101111 111111 111 00000 1111 111111 011

\[ d = \frac{33 - 26}{33} = 0.212\]

After the multiple sequence alignment and curation, each sequence is compared to each in order to calculate a distance matrix. This can then be used to create a phylogenetic tree, like a kind of dendrogram that can be calculated using hierarchical clustering. The above is very simplified, but should give enough background to understand the rest of the post. The resource at EMBL-EBI Train Online is a good place to get started if you want to know more

The pipeline on a Raspberry Pi

The Raspberry Pi is a small and cheap single-board computer. It is used amongst many hobbyists for all kinds of projects, for example:

One of the motivations behind developing this computer was to teach kids to code or engage in electronics

All of the above are very important, but the Raspberry Pi has made its way into science and medicine as well. For example, a group developed a cheap instrument to diagnose Ebola virus infection in the field. Researchers can attach various sensors to the Raspberry Pi and use it for data collection.

Benchmarking

For our application, we needed to show that the Pi can handle the problem we wanted it to solve, so we did some benchmarking.

We used Selenium WebDriver to operate the pipeline as a human would, by actually browsing for an input file and submitting it through the button. Time stamps were taken for each step, and the number of blast hits that were included in the phylogenetic inference was also recorded. For this exercise, we set the number of closest sequences to retrieve for each sample to 5, which means the submitted sample and 4 of the genetically closest samples. However, it is possible that different submitted sequences have retrieved a sequence in common; these will be included in the analysis only once. When we start analyzing this data, we will see this.

# Read csv with time data
time_dat <- read_csv(
  "timeFile.csv", 
  col_types = "ccd",
  col_names = c("Run", "Description", "Measure")
)

head(time_dat) %>% 
  kable(caption = "First few lines of the benchmarking data.")

Table 1: First few lines of the benchmarking data.
Run	Description	Measure
final5best_random_1	blastHits	5.000000
final5best_random_1	blast	11.219230
final5best_random_1	mafftTime	13.404623
final5best_random_1	trimalTime	0.111737
final5best_random_1	fasttreeTime	0.986582
final5best_random_1	heatmapTime	2.354820

The Run column shows some info regarding the benchmarking experiment. We know we asked for the five best hits to be included; the sequences were pseudo-randomly selected. We started with one sequence for submission and then incremented this by one up to 50. The above again shows how data is not always in the best format for working with. We need to extract the digits at the end of the Run variable. Previously we used the tidyr::gather() function to pivot data from wide to long. This time we will use the spread() function to make long data wide.

time_dat <- time_dat %>% 
  mutate(nSubmitted = str_extract(Run, "\\d+$") %>% as.numeric) %>% 
  select(-Run ) %>% 
  spread(Description, Measure)

head(time_dat) %>% 
  kable(caption = "First few lines of the benchmarking data after some cleaning.")

Table 2: First few lines of the benchmarking data after some cleaning.
nSubmitted	blast	blastHits	fasttreeTime	heatmapTime	mafftTime	renderTime	trimalTime
1	11.21923	5	0.986582	2.354820	13.40462	1.686239	0.1117370
2	22.08694	10	3.129514	2.369152	30.26920	1.890183	0.2699649
3	33.67705	15	5.480334	2.400223	47.42213	2.107776	0.4849610
4	43.58782	21	4.627502	2.437273	76.47209	2.243336	0.7980120
5	55.43246	25	10.753521	2.476636	105.21836	2.494058	1.0820050
6	65.18629	30	9.688977	2.516058	128.93219	2.653201	1.4656579

We got rid of the useless data in the Run variable and extracted the useful information into the nSubmitted variable.

Below are the explanations for the variables.

nSubmitted: Number of sequences submitted or uploaded to the pipeline
blast: time in seconds for blast to find most similar previously sequenced samples
blastHits: the number of sequences retrieved
mafftTime: the time it took to create a multiple-sequence alignment
trimalTime: the time it took to clean the multiple-sequence alignment
fasttreeTime: the time it took for phylogenetic inference
heatmapTime: the time it took to produce the heatmap
renderTime: the time it took to render the tree

Number of sequences submitted vs. most similar sequences retrieved

time_dat %>%
  ggplot(aes(x = nSubmitted, y = blastHits)) +
  geom_smooth(method = lm, se = FALSE, colour = "black", formula = y ~ x - 1, size = 0.25) +
  geom_point() +
  theme_bw() +
  xlab("Number of sequences submitted") +
  ylab("Number of sequences retrieved using blastn") +
  annotate("text", x = 41, y = 72, label = "y == 4.628 * x", parse = TRUE) +
  annotate("text", x = 40, y = 60, label = "R^2 == 0.998", parse = TRUE)

fit <- lm(blastHits ~ nSubmitted - 1, data = time_dat)
tidy(fit) %>% 
  kable(caption = "Regression analysis of the number of blast hits retrieved.")

Table 3: Regression analysis of the number of blast hits retrieved.
term	estimate	std.error	statistic	p.value
nSubmitted	4.628026	0.0280312	165.1026	0

A linear line fits the data really well. We mentioned that if different sequences retrieve the same sequence from the database, it is used only once. The slope of this line will depend on the genetic diversity of the database. A more diverse database will have a steeper slope, whereas a less diverse database will have a shallower slope. Also, theoretically, at some point, the line will reach an asymptote as the number of requested sequences start to saturate the number of available sequences. Practically, one would not have to submit more than 16 - 24 samples at a time; thus, we are in the linear part of the rarefaction curve. We can thus see from this that for the Los Alamos data used in the analysis, about 4.5 sequences get retrieved for every sequence submitted.

BLAST time vs. number of sequences submitted

time_dat %>%
  ggplot(aes(x = nSubmitted, y = blast)) +
  geom_smooth(method = lm, se = FALSE, colour = "black", formula = y ~ x, size = 0.25) +
  geom_point(colour = "blue") +
  theme_bw() +
  xlab("Number of input sequences") + ylab("Time in seconds (blastn)") +
  annotate("text", x = 41, y = 90, label = "y == 11.0453 * x", parse = TRUE) +
  annotate("text", x = 40, y = 60, label = "R^2 == 0.9999", parse = TRUE)

fit <- lm(time_dat$blast ~ time_dat$nSubmitted)
tidy(fit) %>% 
  kable(caption = "Regression analysis of blastn time vs. number of sequences.")

Table 4: Regression analysis of blastn time vs. number of sequences.
term	estimate	std.error	statistic	p.value
(Intercept)	-0.8176139	0.5185500	-1.576731	0.121426
time_dat$nSubmitted	11.0453236	0.0176978	624.105409	0.000000

Again, we see a linear relationship for blastn and the time it takes to complete. For every sequence submitted, it takes about 11 seconds to search a database of about 11,000 sequence entries. We can say the blastn displays linear time complexity or $O(n)$ time. We did not discover anything new here. Remember, the purpose of this is to show off the Pi flexing its muscles. (You can read about the BLAST algorithm here.)

Multiple sequence alignment time vs. number of total sequences, submitted and retrieved

fit <- lm(mafftTime ~ I(blastHits^2) - 1, data = time_dat)

time_dat %>%
  ggplot(aes(x = blastHits, y = mafftTime)) +
  geom_point(colour = "blue") +
  geom_smooth(method = "lm",formula = y ~ I(x^2) - 1, colour = "black", size = 0.25) +
  annotate("text", x = 190, y = 1800, label = "y == 0.09997 * x^2", parse = TRUE) +
  theme_bw() +
  xlab("Number of sequences in alignment") + 
  ylab("Time in seconds (MAFFT)")

tidy(fit) %>% 
  kable(caption = "Regression analysis of multiple sequence alignment.")

Table 5: Regression analysis of multiple sequence alignment.
term	estimate	std.error	statistic	p.value
I(blastHits^2)	0.099974	0.0004048	246.9813	0

Since in multiple sequence alignment, each sequence is aligned with each other sequence, we would expect $O(N^2)$ time complexity. We can see in our regression result that we are very close to what we expect. And $O$ is a bit less than a sixth of a second. Thus, if we would analyse 16 sequences, we would retrieve $16 * 4.5 = 72$, and the multiple-sequence alignment would take $0.09997 * 72^2 = 518$ seconds or ~8.6 minutes, which is not bad. Also consider that you can submit your samples and walk away.

Impact

It is important to mention that PhyloPi is not used for tracking or detecting transmission clusters, but rather offers a way of automating phylogenetic analysis. Some patients will be genotyped more than once, and these sequences will cluster very closely on a phylogenetic tree. This offers a spot check into the quality of the results. Sometimes we find that the patient has two different first names, which they interchangeably use depending on the health care worker and patient language preference. We have also detected sample swaps which otherwise would have gone unnoticed.

What next?

In part 3, we will discuss how the inter- and intrapatient HIV genetic distances were analyzed using logistic regression to gain insights into the probability distribution of these two classes. This is also where we asked Andrie from RStudio for help. It was useful for us biologists and virologists to have someone not just to oversee the analysis we did, but also to implement the correct analysis to get the job done. Hope to see you in the next section!

Analysing the HIV pandemic, Part 1: HIV in sub-Sahara Africa

Tue, 30 Apr 2019 00:00:00 +0000

Sabeehah Vawda is a pathologist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio

Introduction

The Human Immunodeficiency Virus (HIV) is the virus that causes acquired immunodeficiency syndrome (AIDS). The virus invades various immune cells, causing loss of immunity, and thus increased susceptibility to infections, including Tuberculosis and cancer. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug resistance testing facility. In this series of blog posts we highlight the serious problem of HIV infection in sub-Saharan Africa, with special analysis of the situation in South Africa.

Stages of HIV infection

HIV infection can be divided into the three consecutive stages: acute primary infection, asymptomatic stage, and the symptomatic stage.

The first stage, acute primary infection, has symptoms very much like flu and may last for a week or two. The body reacts with an immune response, which results in the production of antibodies to fight the HIV infection. This process is called seroconversion and can last a couple of months. During this stage, although the patient is infected and the virus is spreading through the body, the patient might not test positive. This initial period of seroconversion is called ‘the window period’ and depends on the type of test used. Rapid tests are done at the point of care. This means that the test can be done at the clinic with a finger prick and the result is ready in 20 minutes. The drawback of this test is a window period of three months and a small false positive rate. The rapid test detects HIV antibodies, and because the immune system needs some time to produce sufficient antibodies to be detected, there is this window period. Most laboratories these days use fourth-generation ELISA (Enzyme-Linked Immunosorbent Assay) for HIV diagnosis and confirmation. This technique detects both HIV antibodies and antigens. Antigens are the foreign objects that the immune system recognizes as ‘non-self’; in this case, it is the viral protein p24. The advantage of this technique is a window period of only one month.

This first stage, including the window period, is then followed by the asymptomatic stage, which may last for as long as ten years. During this stage, the infected person does not experience symptoms and feels healthy. However, the virus is still replicating and destroying immune cells, especially CD4 cells. This damages the immune system and ultimately leads to stage 3 if not treated. This does not mean that people at stage 3 are doomed, but the earlier treatment starts, the better the outcome.

Stage 3 is referred to as symptomatic HIV infection or AIDS (Acquired Immune Deficiency Syndrome). At this stage, the immune system is so weak that it is not able to fight off bacterial or fungal infections that typically do not cause infections in immune competent people. These serious infections are called opportunistic infections, and have a high morbidity and mortality rate.

Transmission and epidemiology

Worldwide, approximately 36.9 million (UNAIDS) people are living with HIV.

HIV is transmitted mainly by:

Having unprotected sex
Non-sterile needles in drug use or sharing needles
Mother-to-child transmission during birth or breastfeeding
Infected blood transfusions, transplants or other medical procedures (very unlikely)

We mentioned the window period of the HIV infection as well as the asymptomatic stage. During any of the stages, it is possible to transmit the infection. The problem with the window period is an unknown HIV status or falsely assumed negative status, and during the asymptomatic stage, there is no reason for the infected person to seek medical attention. There are obviously behavioural issues in HIV transmission, and due to the long asymptomatic phase, HIV-positive status can be unknown for a long period. For these reasons, it is important that high-risk individuals do frequent HIV tests to determine their status.

Treatment for HIV infection

HIV is treatable but not (yet) curable. The good news, however, is that if a person receives antiretroviral (ARV) treatment, their viral load suppresses (viral replication stops) and the chance of transmitting HIV drastically decreases.

So 30 years into this pandemic, the big question is, why is HIV still a problem?

Not all countries adopted the use of ARVs in an equal manner. Although AZT (Zidovudine) was the first drug to be approved by the FDA in March 1987, it was soon discovered that monotherapy with only AZT was not effective for very long, as the virus developed resistance to the medicine quickly. Since then, ARVs have come a long way, and patients are placed on:

HAART (Highly Active Antiretroviral Treatment), or
cART (combination Antiretroviral Treatment), which typically consists of 3 drugs of different classes.

HIV in Africa

Let’s look at the rates of HIV infection in different African countries. The world factbook by the CIA has some HIV infection rate data.

suppressPackageStartupMessages({
  library(dplyr)
  library(readr)
  library(stringr)
  library(tidyr)
  library(ggplot2)
  library(forcats)
  library(knitr)
  library(maptools)
  library(viridis)
  library(RColorBrewer)
  library(mapproj)
  library(broom)
  library(ggrepel)
  library(sf)
})

# read the HIV data
HIV_rate_2016 <- read_csv(
  file.path(file_path, "HIV rates.csv"), col_names = TRUE, col_types = "cd"
  )

# read the Africa shape file
africa <-
  sf::st_read(
    file.path(file_path, "Africa_SHP/Africa.shp"), 
    stringsAsFactors = FALSE, quiet = TRUE
    ) %>%
  rename(Country = "COUNTRY") %>%
  left_join(HIV_rate_2016, by = "Country")

africa %>%
  ggplot(aes(fill = Rate)) +
  geom_sf() +
  coord_sf() +
  scale_fill_viridis(option = "plasma") +
  theme_minimal()

In the choropleth above, we see that South Africa, Botswana, Lesotho, and Swaziland seem to have the highest rates of infection. This is presented as the percentage infected, which takes into account population sizes. It is important to understand that the level of denial is indirectly proportional to the reported rate of infection. Even in this day and age, denial of stigmatized diseases is an issue.

Cleaning the data

We can also look at the burden of HIV as the number of people infected, and we might get a different picture from what we saw from the choropleth.

Here, we read in the data, and rename the columns to Country, PersCov (percentage ARV coverage), NumberOnARV (Number of patients on ARVs), and NumberInfected (Number of patients infected).

# Read csv with ARV infection dat
arv_dat <- read_csv(file.path(file_path, "ARV cov 2017.csv"), 
  col_types = "cccc",
  col_names = c("Country", "PersCov", "NumberOnARV", "NumberInfected"),
  skip = 1
)

head(arv_dat)

## # A tibble: 6 x 4
##   Country             PersCov    NumberOnARV NumberInfected           
##   <chr>               <chr>      <chr>       <chr>                    
## 1 Afghanistan         No data    790         No data                  
## 2 Albania             42 [40-44] 570         1400 [1300-1400]         
## 3 Algeria             80 [75-87] 11000       14 000 [13 000-15 000]   
## 4 Andorra             No data    No data     No data                  
## 5 Angola              26 [22-30] 78700       310 000 [260 000-360 000]
## 6 Antigua and Barbuda No data    No data     No data

This data has several symptoms of being very messy:

Very long variable names, descriptive, but difficult to work with; this was changed during import
The values contain confidence intervals in brackets; this will be difficult to work with as-is
We might want to transform no data to NA
We are interested in Sub-Saharan Africa, but the data is for the whole world

# A list of Sub-Saharan countries
sub_sahara <- readLines(file.path(file_path, "Sub-Saharan.txt"))

clean_column <- function(x){
  # Remove the ranges in brackets and convert the values to numeric
  x %>% 
    str_replace_all("\\[.*?\\]", "") %>% 
    str_replace_all("<", "") %>%
    str_replace_all(" ", "") %>% 
    as.numeric()
}

arv_dat <- 
  arv_dat %>% 
  filter(Country %in% sub_sahara) %>% 
  na_if("No data") %>% 
  mutate_at(2:4, clean_column)

head(arv_dat)

## # A tibble: 6 x 4
##   Country      PersCov NumberOnARV NumberInfected
##   <chr>          <dbl>       <dbl>          <dbl>
## 1 Angola            26       78700         310000
## 2 Benin             55       38400          70000
## 3 Botswana          84      318000         380000
## 4 Burkina Faso      65       61400          94000
## 5 Burundi           77       60100          78000
## 6 Cameroon          49      254000         510000

We use a regular expression to get rid of all the square bracket ranges. We also remove the “<” sign and spaces within numbers, change “No data” to NA, and convert the characters to numbers. We filter out the countries we don’t want. (Note that some countries are not available in the ARV data, e.g., Swaziland and Reunion.)

Highest infected countries

Now look at the countries with the highest number of infected people of all ages.

arv_dat %>% 
  top_n(4, wt = NumberInfected) %>% 
  arrange(-NumberInfected) %>% 
  kable(
    caption = "Countries with the highest number of HIV infections"
  )

Table 1: Countries with the highest number of HIV infections
Country	PersCov	NumberOnARV	NumberInfected
South Africa	61	4359000	7200000
Nigeria	33	1040000	3100000
Mozambique	54	1156000	2100000
Kenya	75	1122000	1500000

We can see that South Africa has the highest number of HIV-infected people in Sub-Saharan Africa.

HIV in Southern Africa

In South Africa, the first AIDS-related death occurred in 1985. Not all patients were eligible to receive ARVs, and it was only in 2004 that ARVs became available in the public sector in South Africa. Eligibility restriction still applied, so not all HIV infected patients received treatment.

Ideally, a country would have all its HIV-infected people on treatment, but due to financial constraints, this is not always possible. In South Africa, patients were only initialized on ARVs when their CD4 counts dropped below a certain level. This threshold was initially 200 cells/mL in 2004, which was then changed to 350 cells/mL and 500 cell/mL at later intervals. These recommendations were a compromise between the availability of funds and getting ARVs to the people needing it the most. CD4 cells are a major component of the immune system; the lower the CD4 cell count the higher the chance for opportunistic infections. Thus, the idea is to support the patients who are most likely to contract an opportunistic infection.

The problem with this was that about only a third of the HIV infected people in South Africa were receiving HAART treatment. In 2017, the guidelines changed to test and treat; i.e., any newly diagnosed patient will receive HAART treatment. This is a big improvement for many reasons, but notably a lower infection rate. If a patient is taking HAART treatment and it is effective in suppressing the viral replication, the chances of the patient transmitting the virus are very close to zero.

However, these treatments are not without side effects, which in some cases causes very poor adherence to the treatment. There are numerous factors to blame here, specifically socio-economic factors and depression. There is also ignorance and the “fear of knowing”, which causes people not to know their status. Finally, human nature brings with it various other complexities, such as conspiracy theories, and religious and personal beliefs. This will be a very long post if we delve into all the issues, but the take-home message is: the situation is complicated.

ARV coverage by country

We looked at the rate of HIV infections, and also the number of people infected, in the most endemic countries. We have talked about treatment. It would be interesting to look at ARV coverage by country.

Let’s see how these countries rank by ARV coverage:

arv_dat %>%
  na.omit(PersCov) %>%
  ggplot(aes(x = reorder(Country, PersCov), y = PersCov)) +
  geom_point(aes(colour = NumberInfected), size = 3) +
  scale_colour_viridis(
    name = "Number of people infected", 
    trans = "log10",
    option = "plasma"
  ) +
  coord_flip() +
  ylab("% ARV coverage") + xlab("Country") +
  theme_bw()

This shows that Zimbabwe, Namibia, Botswana, and Rwanda have the highest ARV coverage (above 80%). South Africa has the highest number of infections (as we saw before), and coverage of just above 60%.

Botswana rolled out their treatment program in 2002, and by mid-2005, about half of the eligible population received ARV treatment. South Africa, on the other hand, only started treatment in 2004, which we discuss later.

When talking about treatment, we should also look at the changes in mortality.

HIV related deaths

Read in the data:

hiv_mort <- 
  read_csv(file.path(file_path, "HIV deaths.csv"), col_types = "ccccc") %>% 
  na_if("No data") %>% 
  mutate_at(vars(starts_with("Deaths")), clean_column) %>% 
  filter(Country %in% sub_sahara)

head(hiv_mort)

## # A tibble: 6 x 5
##   Country      Deaths_2017 Deaths_2010 Deaths_2005 Deaths_2000
##   <chr>              <dbl>       <dbl>       <dbl>       <dbl>
## 1 Angola             13000       10000        7900        3900
## 2 Benin               2500        2600        4300        2600
## 3 Botswana            4100        5900       13000       15000
## 4 Burkina Faso        2900        5400       12000       15000
## 5 Burundi             1700        5400        8600        8500
## 6 Cameroon           24000       25000       26000       17000

summary(hiv_mort)

##    Country           Deaths_2017      Deaths_2010      Deaths_2005    
##  Length:43          Min.   :   100   Min.   :   100   Min.   :   100  
##  Class :character   1st Qu.:  1900   1st Qu.:  1975   1st Qu.:  2050  
##  Mode  :character   Median :  4400   Median :  5400   Median :  8250  
##                     Mean   : 15442   Mean   : 23483   Mean   : 33227  
##                     3rd Qu.: 16250   3rd Qu.: 27250   3rd Qu.: 48250  
##                     Max.   :150000   Max.   :200000   Max.   :260000  
##                     NA's   :3        NA's   :3        NA's   :3       
##   Deaths_2000    
##  Min.   :   100  
##  1st Qu.:  1150  
##  Median :  6500  
##  Mean   : 26496  
##  3rd Qu.: 41500  
##  Max.   :130000  
##  NA's   :3

The 2017 mean for the dataset as a whole is about half of that during the early 2000s. It would be interesting to plot this data, but it will probably be too busy as it is. We can instead have a look at countries which had the most change.

hiv_mort <- hiv_mort %>% 
  mutate(
    min = apply(hiv_mort[, 2:4], 1, FUN = min),
    max  = apply(hiv_mort[, 2:4], 1, FUN = max),
    Change = max - min
  )

Next, we can create a plot of the data, and look at the top five countries with the biggest change in HIV-related mortality.

hiv_mort %>%
  top_n(5, wt = Change) %>%
  gather(Year, Deaths, Deaths_2017:Deaths_2000) %>% 
  na.omit() %>%
  mutate(
    Year = str_replace(Year, "Deaths_", "") %>% as.numeric(),
    Country = fct_reorder(Country, Deaths)
  ) %>% 
  ggplot(aes(x = Year, y = Deaths, color = Country)) +
  geom_line(size = 1) +
  geom_vline(xintercept = 2004, color = "black", linetype = "dotted", size = 1.5) +
  scale_color_viridis(option = "D", discrete = TRUE) +
  theme_bw() +
  theme(legend.position = "bottom")

Remember, we mentioned that HAART (Highly Active Antiretroviral Treatment) was introduced in 2004 in South Africa, depicted here by the black dotted line. It is easy to appreciate the dramatic effect the introduction of ARVs had in South Africa.

Although the picture above is positive, the fight is not over. The target is to get at least 90% of HIV-infected patients on treatment. Adherence to ARV regimens stays crucial not only to suppress viral replication, but also to minimize the development of drug resistance.

Infection rates

As mentioned earlier, if a patient is taking and responding to treatment, the viral load gets suppressed and the chances of transmitting the infection become very close to null. Thus, the more patients with an undetectable viral load, the lower the transmission rate.

Read the data:

new_infections <- 
  read_csv(file.path(file_path, 
    "Epidemic transition metrics_Trend of new HIV infections.csv"), 
    na = "...", 
    col_types = cols(
      .default = col_character(),
      `2017_1` = col_double()
    )
  ) %>% 
  select(
    -ends_with("_upper"), 
    -ends_with("lower"), 
    -ends_with("_1")
  ) %>% 
  mutate_at(-1, clean_column) %>%
  na.omit()

## Warning: Duplicated column names deduplicated: '2017' => '2017_1' [26]

new_infections %>% 
  gather(Year, NewInfections, 2:9) %>% 
  ggplot(aes(x = Year, y = NewInfections, color = Country)) +
  geom_point() +
  theme_classic() +
  theme(legend.position = "none") +
  xlab("Year") + 
  ylab("Number of new infections")

This is a bit busy. Countries that are highly endemic with good ARV coverage and prevention of infection programs should have a steeper decline in the newly infected people. At first glance, it looks like some of the data points are fairly linear. Let’s go with that assumption, and apply linear regression to each country.

rates_modeled <- 
  new_infections %>% 
  filter(Country %in% sub_sahara) %>% 
  na.omit() %>% 
  gather(Year, NewInfections, 2:9) %>% 
  mutate(Year = as.numeric(Year)) %>% 
  group_by(Country) %>% 
  do(tidy(lm(NewInfections ~ Year, data = .))) %>% 
  filter(term == "Year") %>% 
  ungroup() %>% 
  mutate(
    Country = fct_reorder(Country, estimate, .desc = TRUE)
  ) %>% 
  arrange(desc(estimate)) %>% 
  select(-one_of("term", "statistic"))

rates_modeled %>% 
  head() %>% 
  kable(
    caption = "Results of linear regression: Rate of new infections per year"
  )

Table 2: Results of linear regression: Rate of new infections per year
Country	estimate	std.error	p.value
Madagascar	469.04762	12.56126	0.0000000
Côte d’Ivoire	190.47619	153.99689	0.2623441
Botswana	130.95238	92.46968	0.2064860
Mali	108.33333	23.21683	0.0034452
Congo	103.57143	16.45271	0.0007486
Eritrea	89.28571	23.05347	0.0082374

rates_modeled %>%
  na.omit() %>% 
  ggplot(aes(x = Country, y = estimate, fill = p.value >= 0.05)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  ylab("Estimated change in HIV infection (people/year)")

With a quick look at the plot shown above, we can see that for most countries, a linear model fits the data with a significant p-value cutoff of 0.05.

It is important to note here that the data we have at hand is from 2010 to 2017. This shows that some countries - notably, South Africa - are on a good trajectory. Botswana, being the “Poster Child” of a good HIV treatment and prevention program, seems to have stabilized in terms of rate of infection, with a positive but insignificant estimate of the rate of infection. This could be explained by the following reasons:

First African country to introduce HAART, 2002
Progressive in terms of prevention programs
Looking only from 2010, we are missing the dramatic decline in infection
The WHO goal is to get 90% of a country’s infected people on HAART, but the last 5-7% might be the hardest to convince

We can combine the ARV and estimated rates of infection data.

arv_on_infection <- 
  arv_dat %>% 
  left_join(rates_modeled, by = "Country") %>% 
  mutate(p_interpretation = if_else(p.value >= 0.05, "Significant", "Insignificant"))

## Warning: Column `Country` joining character vector and factor, coercing
## into character vector

arv_on_infection %>% 
  na.omit() %>% 
  ggplot(aes(x = PersCov, y = estimate, 
             shape = p_interpretation >= 0.05)) +
  geom_point(aes(color = NumberInfected), size = 2) +
  geom_text_repel(aes(label = Country), size = 3) +
  scale_color_gradient(high = "red", low = "blue") +
  theme_grey() +
  xlab("% ARV coverage") + 
  ylab("Estimated change in HIV infection\n(people/year)") +
  ggtitle("Antiretroviral (ARV) coverage")

South Africa has the highest number of infected people, but on the positive side, has a downward trajectory of about 15000 fewer people newly infected each year. Although ARVs do play a crucial role in controlling this epidemic, it is not the only factor involved. Prevention of mother-to-child transmission has been very successful in South Africa. Awareness campaigns and education are playing a big role as well. The plot above shows our linearly modeled rates.

The laboratory, HIV diagnosis and monitoring

HIV-related laboratory tests are not the only diagnostics done in a Virology department, but in endemic countries, it accounts for the majority of tests which are done. The first HIV-related test done would be for diagnosis. This is done differently in adults than in infants. As we discussed earlier, after HIV infection, the immune system develops antibodies. We can use a field of study called serology to detect antibodies and antigens, and in most cases, an ELISA test is performed to confirm HIV seroconversion or status. Since the mother’s antibodies will be present in the infant, an ELISA will tell us the baby is positive even though not infected. Infants are diagnosed by detecting viral RNA or DNA in their blood. This is done by PCR (Polymerase Chain Reaction).

Once a patient is diagnosed as HIV-positive, the patient will be initiated on HAART, and in most cases, the viral load will be suppressed. In the South African public sector treatment program, after HAART initiation, the patient gets two six-monthly viral load tests to make sure viral replication is suppressed. To keep an eye out for trouble, a yearly viral load is done to confirm adherence and effectiveness of the treatment.

When an unsuppressed viral load is detected, action is taken and adherence counselling is performed. If this does not solve the problem, drug-resistance testing is performed to assess the resistance profile of the infection in order to adjust the ARV regimen accordingly. This is done by isolating the viral RNA, converting it to DNA, amplifying the DNA to sufficient quantities to enable sequencing of the DNA. In our laboratory, we use Sanger sequencing, but other sequencing technologies also exist.

Figure 1: HIV Genome as depicted by the Los Alamos HIV sequence database. Available at https://www.hiv.lanl.gov/content/sequence/HIV/MAP/landmark.html

This diagram depicts the genome of HIV. The most common targets for interfering with viral replication is located in the pol gene. Specifically:

prot: The viral protease. Many of the viral proteins are translated as longer polypeptides, which are then cleaved into mature proteins by the protease.
p51 RT: The viral reverse transcriptase: Each virion contains two copies of viral RNA. The reverse transcriptase converts the RNA to DNA.
p31 int: The viral integrase: This enzyme integrates the reverse transcribed viral DNA into host genomes of the infected cells, and establishes chronic infection.

Essentially, ARVs interfere with these viral enzymes by inhibiting their action:

Protease inhibitors prevent the maturation of viral proteins.
Reverse transcriptase inhibitors prevent the formation of a DNA copy of the viral genome, which then gives the integrase nothing to work with.
Integrase inhibitors prevent the integration of viral DNA into the host genome, which is a crucial part of replication and infection.

Combining these ARVs in clever ways results in HAART or cART. By sequencing the viral RNA, we can detect mutations that cause resistance to specific ARVs. This information is then used to adjust the ARV regimen to once again effectively suppress viral replication.

The viral reverse transcriptase has a high error rate when doing the conversion of RNA to DNA, and introduces random mutations in the viral genome. In the presence of selective pressure like ARVs, these random mutations might give advantageous phenotypic traits to the replicating virus, like drug resistance. On the other hand, if the patient is properly adhering to the treatment, the viral replication is suppressed, replication does not occur, thus mutations can’t occur.

This high rate of mutation can be used in the laboratory as one of the quality-control tools. The polymerase chain reaction is prone to contamination, so it is possible when doing these reactions that one sample might contaminate another. This will give rise to false mutations in the contaminated sample and an erroneous result to the treating clinician, thus direct negative impact on the patient.

What next?

In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug resistance testing facility.

In Part 2 of this four part series, we discuss this pipeline.
In Part 3, we will discuss genetic distances and phylogenetics.
Finally, in Part 4, we will look at the application of logistic regression in analyzing inter- and intra-patient genetic distance of viral sequences.

See you in the next section!

Statistics in Glaucoma: Part II

Fri, 07 Dec 2018 00:00:00 +0000

Samuel Berchuck is a Postdoctoral Associate in Duke University’s Department of Statistical Science and Forge-Duke’s Center for Actionable Health Data Science.

Joshua L. Warren is an Assistant Professor of Biostatistics at Yale University.

Analyzing Visual Field Data

In Part I of this series on statistic in glaucoma, we detailed the use of visual fields for understanding functional vision loss in glaucoma patients. Before discussing a new method for modeling visual field data that accounts for the anatomy of the eye, we discussed how visual field data is typically analyzed by introducing a common diagnostic metric, point-wise linear regression (PLR). PLR is a trend-based diagnostic that uses slope p-values from the location specific linear regressions to discriminate progression status. The motivation for PLR is straightforward, assuming that large negative slopes at numerous visual field locations is indicative of progression. This is characteristic of a large class of methods for analyzing visual field data that attempt to discriminate progression based on changes in the DLS across time. This technique is simple, intuitive, and effective; however, it is often limited due to the naivete of modeling assumptions, including the independence of visual field locations.

Ocular Anatomy in the Neighborhood Structure of the Visual Field

To properly account for the spatial dependencies on the visual field, Berchuck et al. 2018 introduce a neighborhood model that incorporates anatomical information through a dissimilarity metric. Details of the method can be found in Berchuck et al. 2018, but we provide a quick introduction. The key development is the specification of the neighborhood structure through a new definition of adjacency weights. Typically in areal data, the adjacency for two locations $i$ and $j$ is defined as $w_{ij} = 1(i \sim j)$, where $i \sim j$ is the event that locations $i$ and $j$ are neighbors. As discussed in Part I, this assumption is not sufficient due to the complex anatomy of the eye. To account for this additional structure, a more general adjacency is introduced that is a function of a dissimilarity metric, $w_{ij}(\alpha_t) = 1(i \sim j)\exp\{-z_{ij}\alpha_t\}$. Here, $z_{ij}$ is a dissimilarity metric that represents the absolute difference between the Garway-Heath angles of locations $i$ and $j$.

The parameter $\alpha_t$ dictates the importance of the dissimilarity metric at each visual field exam $t$. When $\alpha_t$ becomes large, the model reduces to an independent process, and as $\alpha_t$ goes to zero, the process becomes the standard spatial model for areal data. Based on the specification of the adjacency weights, $\alpha_t$ has a useful interpretation with respect to deterioration of visual ability. In particular, $\alpha_t$ changing over exams indicates that the neighborhood structure on the visual field is changing, which in turn implies damage to the underlying retinal ganglion cell structure. This observation motivates a diagnostic of progression that quantifies variability in $\alpha_t$ across time. We choose the coefficient of variation (CV) and demonstrate that is a highly significant predictor of progression, and furthermore, independent of trend-based methods such as PLR.

Navigating the `womblR` Package

To make the method available to clinicians, the R package womblR was developed. The package provides a suite of functions that walk a user through the full process of analyzing a series of visual fields from beginning to end. The user interface was modeled after other impactful R packages for Bayesian spatial analysis, including spBayes and CARBayes. The package name combines Hadley’s naming convention for R packages (i.e., ending a package with the letter R) with the name of the author of the seminal paper on boundary detection, originally referred to areal wombling (Womble 1951).

We will now walk through the process of analyzing visual field data, estimating the $\alpha_t$ parameters, and assessing progression status. The main function in womblR is the Spatiotemporal Boundary Detection with Dissimilarity Metric model function (STBDwDM). Inference for the method is obtained through Markov chain Monte Carlo (MCMC), which is a computationally intensive method that iterates between updating individual model parameters until enough posterior samples have been collected post-convergence for making accurate posterior inference. Because of the iterative nature of MCMC, the majority of computation is performed within a for loop, so the package is built on C++ through the packages Rcpp and RcppArmadillo. Because of the increased complexity of writing in C++, the pre- and post-processing of the model are done in R with the for loop implemented in C++. The MCMC method employed in womblR is a Metropolis-Hastings within Gibbs algorithm.

Just as a quick aside, with the more recent advent of probabilistic programming, this model could have been implemented using the Hamiltonian Monte Carlo methods used in software like Stan or PyMC3. These programs do not require the derivation of full conditionals, and push the MCMC algorithm to the background. There is undoubtedly a huge market for this type of software, and it is clearly playing a significant role in the popularization of Bayesian modeling. At the same time, implementing MCMC samplers using Rcpp with traditional MCMC algorithms can be instructive, and for those with experience, nearly as quick of a coding experience.

We now begin by formatting the visual field data for analysis. According to the manual, the observed data Y must first be ordered spatially and then temporally. Furthermore, we will remove all locations that correspond to the natural blind spot (which, in the Humphrey Field Analyzer-II, correspond to locations 26 and 35).

###Load package
library(womblR)

###Format data
blind_spot <- c(26, 35) # define blind spot
VFSeries <- VFSeries[order(VFSeries$Location), ] # sort by location
VFSeries <- VFSeries[order(VFSeries$Visit), ] # sort by visit
VFSeries <- VFSeries[!VFSeries$Location %in% blind_spot, ] # remove blind spot locations
Y <- VFSeries$DLS # define observed outcome data

Now that we have assigned the observed outcomes to Y, we move onto the temporal variable Time. For visual field data, we define this to be the time from the baseline visit. We obtain the unique days from the baseline visit and scale them to be on the year scale.

Time <- unique(VFSeries$Time) / 365 # years since baseline visit
print(Time)

## [1] 0.0000000 0.3452055 0.6520548 1.1123288 1.3808219 1.6109589 2.0712329
## [8] 2.3780822 2.5698630

Next, we assign the adjacency matrix and dissimilarity metric (both discussed in Part I).

W <- HFAII_Queen[-blind_spot, -blind_spot] # visual field adjacency matrix
DM <- GarwayHeath[-blind_spot] # Garway-Heath angles

Now that we have specified the data objects Y, DM, W, and Time, we will customize the objects that characterize Bayesian MCMC methods, in particular, hyperparameters, starting values, Metropolis tuning values, and MCMC inputs. These objects have been detailed previously in the womblR package vignette, so we will not spend time going over their definitions. We will only note that they are each list objects similar to the spBayes package. We begin by specifying the hyperparameters.

###Bounds for temporal tuning parameter phi
TimeDist <- abs(outer(Time, Time, "-"))
TimeDistVec <- TimeDist[lower.tri(TimeDist)]
minDiff <- min(TimeDistVec)
maxDiff <- max(TimeDistVec)
PhiUpper <- -log(0.01) / minDiff # shortest diff goes down to 1%
PhiLower <- -log(0.95) / maxDiff # longest diff goes up to 95%

###Hyperparameter object
Hypers <- list(Delta = list(MuDelta = c(3, 0, 0), OmegaDelta = diag(c(1000, 1000, 1))),
               T = list(Xi = 4, Psi = diag(3)),
               Phi = list(APhi = PhiLower, BPhi = PhiUpper))

Then we specify the starting values for the parameters, Metropolis tuning variances, and MCMC details.

###Starting values
Starting <- list(Delta = c(3, 0, 0), T = diag(3), Phi = 0.5)

###Metropolis tuning variances
Nu <- length(Time) # calculate number of visits
Tuning <- list(Theta2 = rep(1, Nu), Theta3 = rep(1, Nu), Phi = 1)

###MCMC inputs
MCMC <- list(NBurn = 10000, NSims = 250000, NThin = 25, NPilot = 20)

We specify that our model will run for a burn-in period of 10,000 scans, followed by 250,000 scans post burn-in. In the burn-in period there will be 20 iterations of pilot adaptation evenly spaced out over the period. The final number of samples to be used for inference will be thinned down to 10,000 based on the thinning number of 25. We can now run the MCMC sampler. Details of the various options available in the sampler can be found in the documentation, help(STBDwDM).

reg.STBDwDM <- STBDwDM(Y = Y, DM = DM, W = W, Time = Time,
                       Starting = Starting, Hypers = Hypers, Tuning = Tuning, MCMC = MCMC,
                       Family = "tobit", 
                       TemporalStructure = "exponential",
                       Distance = "circumference",
                       Weights = "continuous",
                       Rho = 0.99,
                       ScaleY = 10, 
                       ScaleDM = 100,
                       Seed = 54)
## Burn-in progress: |*************************************************|
## Sampler progress: 0%.. 10%.. 20%.. 30%.. 40%.. 50%.. 60%.. 70%.. 80%.. 90%.. 100%..

We quickly assess convergence by checking the traceplots of $\alpha_t$ (note that further MCMC convergence diagnostics should be used in practice).

###Load coda package
library(coda)

###Convert alpha to an MCMC object
Alpha <- as.mcmc(reg.STBDwDM$alpha)

###Create traceplot
par(mfrow = c(3, 3))
for (t in 1:Nu) traceplot(Alpha[, t], ylab = bquote(alpha[.(t)]), main = bquote(paste("Posterior of " ~ alpha[.(t)])))

Converting MCMC Samples into Clinical Statements

Now we calculate the posterior distribution of the CV of $\alpha_t$ and print its moments.

CVAlpha <- apply(Alpha, 1, function(x) sd(x) / mean(x))
plot(density(CVAlpha, adjust = 2), main = expression("Posterior of CV"~(alpha[t])), xlab = expression("CV"~(alpha[t])))

STCV <- c(mean(CVAlpha), sd(CVAlpha), quantile(CVAlpha, probs = c(0.025, 0.975)))
names(STCV)[1:2] <- c("Mean", "SD")
print(STCV)

##       Mean         SD       2.5%      97.5% 
## 0.19121622 0.10205826 0.04636219 0.42744656

For this information to be useful clinically, we convert it into a probability of progression based on a model trained on a large cohort of glaucoma patients (Berchuck et al. 2019). Because the information from $\alpha_t$ is independent of trend-based methods, we show that the optimal use of $\alpha_t$ is combining it with a basic global metric that includes the slope and p-value (and their interaction) of the overall mean at each visual field exam. The trained model coefficients are publicly available and are used below. Furthermore, both the mean, standard deviation, and their interaction of the CV of $\alpha_t$ are included. The probability of progression can be calculated as follows.

###Calculate the global metric slope and p-value
MeanSens <- apply(t(matrix(VFSeries$DLS, ncol = Nu)) / 10, 1, mean) # scaled mean DLS
reg.global <- lm(MeanSens ~ Time) # global regression
GlobalS <- summary(reg.global)$coef[2, 1] # global slope
GlobalP <- summary(reg.global)$coef[2, 4] # global p-value

###Obtain probabiltiy of progression using estimated parameters from Berchuck et al. 2019
input <- c(1, GlobalP, GlobalS, STCV[1], STCV[2], GlobalS * GlobalP, STCV[1] * STCV[2])
coef <- c(-1.7471655, -0.2502131, -13.7317622, 7.4746348, -8.9152523, 18.6964153, -13.3706058)
fit <- input %*% coef
exp(fit) / (1 + exp(fit))

##           [,1]
## [1,] 0.4355997

The probability of progression is calculated to be 0.44, which can be compared to the threshold cutoff for the trained model of 0.325. This cutoff for the probability of progression was determined using operating characteristics, so that the specificity was forced to be in the clinically meaningful range of 85%. Based on this derived threshold, the probability of progression is high enough to indicate that this patient’s disease shows evidence of visual field progression (which is reassuring, because we know this patient has progression as determined by clinicians).

Looking ahead: The third installment will wrap up the discussion on the womblR package and ponder future directions for the role of statistics in glaucoma research. Furthermore, the role of open-source software in medicine will be discussed.

References

Berchuck, S.I., Mwanza, J.C., & Warren, J.L. (2018). Diagnosing Glaucoma Progression with Visual Field Data Using a Spatiotemporal Boundary Detection Method, In press at Journal of the American Statistical Association.
Womble, W. H. (1951). Differential Systematics. Science, 114(2961), 315-322.
Berchuck, S.I., Mwanza, J.C., Tanna, A.P., Budenz, D.L., Warren, J.L. (2019). Improved Detection of Visual Field Progression Using a Spatiotemporal Boundary Detection Method. In press at Scientific Reports (Available upon request).

Statistics in Glaucoma: Part I

Mon, 03 Dec 2018 00:00:00 +0000

Samuel Berchuck is a Postdoctoral Associate in Duke University’s Department of Statistical Science and Forge-Duke’s Center for Actionable Health Data Science.

Joshua L. Warren is an Assistant Professor of Biostatistics at Yale University.

Introduction

Glaucoma is a leading cause of blindness worldwide, with a prevalence of 4% in the population aged 40-80. The disease is characterized by retinal ganglion cell death and corresponding damage to the optic nerve head. Since visual impairment caused by glaucoma is irreversible and efficient treatments exist, early detection of the disease is essential. Determining if the disease is progressing remains one of the most challenging aspects of glaucoma management, since it is difficult to distinguish true progression from variability due to natural degradation or noise. In practice, clinicians monitor progression using a multifactorial approach that relies on various measurements of the disease. In this series of blog posts, we focus on the use of visual fields. Visual field examinations obtain levels of a patient’s actual vision, and the practice is thus referred to as a functional measurement. As such, visual fields are a proxy for a patient’s quality of life, and therefore are typically prioritized in practice.

Visual Field Data

Visual fields are complex spatiotemporal data generated from an intricate anatomical system, which is important to understand for modeling purposes. To illustrate visual field data, we load an example data set from the womblR package on CRAN. The package womblR was developed specifically for analyzing visual field data, and uses a Bayesian hierarchical model that accounts for the complex nature of the data (more details will be provided in Part II). The specific data set comes from the Vein Pulsation Study Trial in Glaucoma and the Lions Eye Institute trial registry, Perth, Western Australia. We begin by loading the package.

library(womblR)

The data set of interest is loaded lazily and can be accessed as follows; we also view the first six rows for illustration.

data(VFSeries)
head(VFSeries)

##   Visit DLS Time Location
## 1     1  25    0        1
## 2     2  23  126        1
## 3     3  23  238        1
## 4     4  23  406        1
## 5     5  24  504        1
## 6     6  21  588        1

The data object VFSeries contains a longitudinal series of visual fields for a glaucoma patient that we will use throughout the three blog posts to exemplify the study of visual fields. This patient has been determined to be progressing, based on the expertise of two clinicians. VFSeries has four variables: Visit, DLS, Time, and Location. The variable Visit represents the visual field test visit number, DLS the observed measure, Time the time of the visual field test (in days from baseline visit), and Location the spatial location on the visual field where the observation occurred. There are 9 visual field exams contained in this data set, and on average 117.25 days between visits.

To help visualize the dataframe, we can use the PlotVFTimeSeries function. PlotVFTimeSeries is a function that plots a patient’s observed visual acuity over time at each location on the visual field.

PlotVfTimeSeries(Y = VFSeries$DLS,
                 Location = VFSeries$Location,
                 Time = VFSeries$Time,
                 main = "Visual field sensitivity time series \n at each location",
                 xlab = "Days from baseline visit",
                 ylab = "Differential light sensitivity (dB)",
                 line.reg = FALSE)

The above figure demonstrates the visual field from a Humphrey Field Analyzer-II (HFA-II) testing machine, which generates 54 spatial locations (only 52 informative locations; note the 2 blanks spots corresponding to the blind spot). The visual field map is constructed by assessing a patient’s response to varying levels of light. Patients are instructed to focus on a central fixation point as light is introduced randomly in a preceding manner over a grid on the visual field. As light is observed, the patient presses a button and the current light intensity is recorded. The process is repeated until the entire visual field is tested. The light intensity is measured in differential light sensitivity (DLS), which quantifies the difference in the HFA-II background and observed light intensity. Smaller values indicate worsening vision.

Spatial Anatomy on the Visual Field

The spatial surface of the visual field is observed on a lattice (i.e., uniform areal data); however, it is a complex projection of the underlying optic nerve head and exhibits anatomically induced spatial dependencies. In particular, localized damage to the optic disc can result in clinically deterministic deterioration across the visual field. Incorporating this non-standard spatial dependence structure into our methodology is a priority for properly analyzing these data, although it is commonly ignored. Translating this into math lingo, this means that a naive modeling of the spatial surface of the visual field would be inappropriate (i.e., neighbors defined through adjacent locations). Instead, the definition of a neighbor when considering vision loss on the visual field must depend on the underlying anatomical proximities.

To illustrate this concept, we begin by displaying the visual field neighborhood structure. The adjacency matrix for the HFA-II is available in the womblR package. In this analysis, we use a queen specification, meaning that an adjacency is defined as any location that shares an edge or corner on the lattice. We now load this adjacency matrix and remove the two locations that correspond to the blind spot.

blind_spot <- c(26, 35) # define blind spot
W <- HFAII_Queen[-blind_spot, -blind_spot] # HFA-II visual field adjacency matrix

This adjacency structure can be displayed using the graph.adjacency function in the igraph package.

library(igraph)
adj.graph <- graph.adjacency(W, mode = "undirected") 
plot(adj.graph)

As mentioned above, naively assuming that all of these adjacencies are equal ignores the important underlying anatomy that enforces these dependencies. This anatomical relationship of the visual field test points and the underlying optic nerve head was studied by Garway-Heath et al. (2000), in which they estimated the angle that each test location’s underlying retinal ganglion cells enters the optic disc, measured in degrees. These angles are the missing link that will allow the visual field adjacency structure to be dictated by the underlying anatomy. These angles can be visualized using the function PlotAdjacency from womblR, which displays neighborhood structures across the visual field. Before using this function, we need to load the angles measured in Garway-Heath et al. (2000). These are available from womblR; again, we remove the blind spot before using.

Angles <- GarwayHeath[-blind_spot] # Garway-Heath angles
summary(Angles)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   80.75  192.50  177.35  275.75  329.00

We are now ready to visualize the neighborhood structure of the visual field using the PlotAdjacency function.

###Plot the angles on the visual field
PlotAdjacency(W = W, 
              DM = Angles, 
              zlim = c(0, 180), 
              Visit = NA, 
              edgewidth = 3.75,
              cornerwidth = 0.33,
              lwd.border = 3.75,
              main = "Garway-Heath angles\n across the visual field")

The angles measured by Garway-Heath et al. are presented at each location on the visual field. More interestingly, the distances between these angles are presented for each of the neighbor pairs. This figure is equivalent to the adjacency plot displayed above, but allows the adjacencies to vary as a function of the anatomy. In particular, if two visual field locations are anatomically similar, the dependency is strengthened (i.e., more white), and if the locations are close to anatomically independent, the dependency is weaker (i.e., more black). Here the edge adjacencies are represented by lines, while the diagonal adjacencies are represented as two triangles. This view of the visual field details the anatomical importance in modeling visual field data, as neighboring locations can have underlying retinal ganglion cells that enter the optic nerve head with a large degree of separation. In particular, locations on either side of the equator, although adjacent, are anatomically close to independent based on anatomy.

How to Model Visual Field Data?

If you have gotten this far in the post, hopefully you have the sense that the study of visual field data is statistically interesting and clinically important for properly assessing a glaucoma patient’s risk of vision loss. In the next two blog posts, we will explore how visual field data are currently analyzed and new methods that account for the anatomical structure detailed above. To accomplish this, we will break down the algorithm and software used to build the womblR package, and will attempt to illustrate the importance of R packages for open-source clinical research.

Reference

Garway-Heath, David F., Darmalingum Poinoosawmy, Frederick W. Fitzke, and Roger A. Hitchings. “Mapping the visual field to the optic disc in normal tension glaucoma eyes” Ophthalmology 107, no. 10 (2000): 1809-1815.

Serendipity at R / Medicine

Tue, 16 Oct 2018 00:00:00 +0000

We knew we were on to something important early on in the process of organizing R / Medicine 2018. Even during our initial attempts to articulate the differences between this conference and R / Pharma 2018, it became clear that the focus on the use of R and statistics in clinical settings was going to be a richer topic than just the design of clinical trials. However, it wasn’t until the conference got underway that we realized there was magic in the mix of attendees. R / Medicine attracted quite a few clinicians who were themselves using R in their work, or were in the process of teaching themselves R. This group catalyzed the discussions that continued throughout the conference, enabling high-bandwidth exchanges that would have otherwise suffered from the effort to translate between the two cultures. The small, single-track nature of the conference helped to keep the conversations going, with the questions and answers at the end of a given talk helping to enrich the quality of successive discussions.

Rob Tibshirani set the collaborative tone for the conference with his opening keynote talk describing the clinical forecasting system he and his collaborators have built to predict platelet usage for the Stanford hospitals. Big-league and big-impact, the system shows the promise of delivering real clinical and financial benefits. Tibshirani’s presentation of the modeling process also set the bar for clarity.

The other keynotes were also “top shelf”. Michael Lawrence spoke about Scientific Software In-the-Large. He laid out three challenges for scientific programming at this scale:
     * Integration of independently developed modules
     * Translation of analyses and prototypes into software
     * Scalability
and addressed these issues using examples from the Bioconductor project.

Victoria Stodden’s Keynote, Computational Reproducibility in Medical Research: Toward Open Code and Data, was a meditation on the need to reassess scientific transparency in an age where big data and computational power are driving medical research, and deep intellectual contributions are encoded in software. I was particularly struck by the idea that progress towards computational reproducibility depends on the coordination of stakeholders.

Perhaps the highest-energy talk of the conference (and maybe all of the conferences I have attended this year) was given by Yale’s Dr. Harlan Krumholz. Unfortunately, we have neither video nor slides from this keynote, but to give you some ideal of Dr. Krumholz iconoclastic work, look at the 2010 Forbes Article and this more recent article published in HealthAffairs. The following are some notes I managed to take at the talk between moments of mesmerization. With respect to medicine in general Dr. Krumholz said that:

There could not be a more exciting era in medicine. Medicine is emerging as an information science and the clinician’s role is changing to be a guide or interpretor, not a shaman.

Commenting on evidence-based medicine:

More than half of the guidlines in cardiology are not based on evidence.

With respect to medical data, he said:

The goal should be to take high-dimensional data and make it low-dimensional. Instead of thinking that everyone should have the same data, we should move towards thinking: How dow we use the data that we do have? There should be no missing data.

I took these statements to mean that teams of clinicians, statisticians, and data scientists should be working towards building predictive models for individual patients based on whatever data is available for them and whatever big data is relevant. This was clearly the music the crowd wanted to dance to.

The slides for most of the rest of the talks are available on the website. One talk I would like to highlight here is Nathaniel Phillips’ talk on Fast and Frugal Trees.

This talk addressed a recurring theme throughout the conference: the difference in decision making between the two cultures of statisticians and physicians. Probabilistic estimates to characteristic risk and to inform decision making are central to a statisticians worldview. Physicians, on the other hand, are in general not comfortable with probabilities, and when push comes to shove, prefer unambiguous guidelines and thresholds, such as blood pressure ranges, to inform treatment decisions. A vexing cultural problem is to identify effective decision models that have a chance of actually being used by clinicians.

The conference finished with a roundtable discussion with the theme Bridging the Two Cultures, with panelists Beth Atkinson, Joseph Chou, Peter Higgins, Stephan Kadauke, Chinonyerem Madu, and Jack Wasey representing both the statistical and clinical points of view. The moderator (me) began by asking three questions: 1. How do clinicians engage with statisticians and data scientists? 2. What are some key ideas you should know about collaborating? 3. In your experience, what kinds of engagements have been the most successful?

Panelists were free to respond as they felt inclined to any or all of the questions. As I recall, a consensus emerged around three key ideas: make an effort to empathize with colleagues, meet frequently and go out of your way to interact with colleagues, and carefully select projects and then cultivate them.

Planning is already underway for R / Medicine 2019. Mark the week of September 23rd, and stay tuned!

R/Medicine on R Views

A Guide to Binge Watching R / Medicine 2021

The Keynotes

Clinical Practice

Clinical Trials

Medical Data

R in Production

R Tools

Short Courses

Analysing the HIV pandemic, Part 4: Classification of lab samples

Preliminary analysis

Modeling

The practical value of this work

Conclusion

Closing thoughts

Analysing the HIV pandemic, Part 3: Genetic diversity

Recap

Introduction

Example

Calculating genetic distances from a multiple sequence alignment (MSA)

Reduction of the heatmap to focus on the important data

Phylogenetic tree

The importance of phylogenetics

Summary

What’s next

Analysing the HIV pandemic, Part 2: Drug resistance testing

Introduction

HIV drug resistance

PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility

How does it work?

Raw strings

Aligned strings

Curated strings

Calculation

The pipeline on a Raspberry Pi

Benchmarking

Number of sequences submitted vs. most similar sequences retrieved

BLAST time vs. number of sequences submitted

Multiple sequence alignment time vs. number of total sequences, submitted and retrieved

Impact

What next?

Analysing the HIV pandemic, Part 1: HIV in sub-Sahara Africa

Introduction

Stages of HIV infection

Transmission and epidemiology

Treatment for HIV infection

HIV in Africa

Cleaning the data

Highest infected countries

HIV in Southern Africa

ARV coverage by country

HIV related deaths

Infection rates

The laboratory, HIV diagnosis and monitoring

What next?

Statistics in Glaucoma: Part II

Analyzing Visual Field Data

Ocular Anatomy in the Neighborhood Structure of the Visual Field

Navigating the womblR Package

Converting MCMC Samples into Clinical Statements

References

Statistics in Glaucoma: Part I

Introduction

Visual Field Data

Spatial Anatomy on the Visual Field

How to Model Visual Field Data?

Reference

Serendipity at R / Medicine

Navigating the `womblR` Package