Equal Size kmeans

Harrison Schramm and Carol DeZwarte 2019-06-13

We were recently presented with a problem where the decision maker wanted to understand how their data would naturally group together. The classic technique of k-means clustering was a natural choice; it’s well known, computationally efficient, and implemented in base R via the kmeans() function. Our problem has a slight wrinkle: the decision maker wished to see the data grouped with (nearly) equal sizes. Now, a ‘true’ statistician would tell the client that the right thing to do from a theoretical perspective was to use native k-means results because some centers can simply have more nearby points than other centers.

reticulate, virtualenv, and Python in Linux

Roland Stevenson 2019-06-10

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin. reticulate is an R package that allows us to use Python modules from within RStudio. I recently found this functionality useful while trying to compare the results of different uplift models. Though I did have R’s uplift package producing Qini charts and metrics, I also wanted to see how things looked with Wayfair’s promising pylift package.

Introducing DeclareDesign, a Platform for Research Design

Graeme Blair, Jasper Cooper, Alexander Coppock and Macartan Humphreys 2019-06-04

Research design consists of a set of choices about what research procedures to use. For example, how many subjects to interview, which questions to ask them, and what to do in the analysis phase with the data that results from these choices. We do not have good tools for assessing whether the chosen procedures are good ones. DeclareDesign is an R package for learning about, implementing, and communicating research procedures, from data collection to data analysis.

April 2019: "Top 40" New CRAN Packages

Joseph Rickert 2019-05-30

One hundred eighty-seven new packages made it to CRAN in April. Here are my picks for the “Top 40”, organized into ten categories: Biotechnology, Data, Econometrics, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization. Biotechnology genpwr v1.00: Provides functions for power and sample size calculations for genetic association studies allowing for mis-specification of the model of genetic susceptibility. The methods employed are extensions of Gauderman (2002) and Gauderman (2002).

Momentum Investing with R

Jonathan Regenstein 2019-05-29

After an extended hiatus, Reproducible Finance is back! We’ll celebrate by changing focus a bit and coding up an investment strategy called Momentum. Before we even tiptoe in that direction, please note that this is not intended as investment advice and it’s not intended to be a script that can be implemented for trading.

Analysing the HIV pandemic, Part 4: Classification of lab samples

Andrie de Vries and Armand Bester 2019-05-23

This is part 4 of a four-part series about the HIV epidemic in Africa. In this final part, we discuss how genetic diversity can be used to classify laboratory samples into either inter-patient or intra-classes, using logistic regression. This helps with quality in the lab, since it’s possible to match new samples with samples from the same patient, taken years apart and allowing for mutation of the HIV virus genomic sequence.

Analysing the HIV pandemic, Part 3: Genetic diversity

Armand Bester and Andrie de Vries 2019-05-16

This is part 3 of a four-part series about the HIV epidemic in Africa. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug-resistance testing facility. In this part, we discuss genetic diversity and how this can be analysed using markov chains and heatmaps.

Virtual Morel Foraging with R

Bryan Lewis 2019-05-13

Enjoy a virtual mushroom hunt with R and RSelenium, which allows R to use a web browser as a human would, including clicking on buttons, etc.

Analysing the HIV pandemic, Part 2: Drug resistance testing

Armand Bester, Dominique Goedhals and Andrie de Vries 2019-05-07

This is part 2 of a four-part series about the HIV epidemic in Africa. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug-resistance testing facility. Part 2 discusses drug-resistance testing of HIV isolates in sub-Saharan Africa.

Analysing the HIV pandemic, Part 1: HIV in sub-Sahara Africa

Armand Bester, Sabeehah Vawda and Andrie de Vries 2019-04-30

The Human Immunodeficiency Virus (HIV) is the virus that causes acquired immunodeficiency syndrome (AIDS). The virus invades various immune cells, causing loss of immunity, and thus increases susceptibility to infections, including Tuberculosis and cancer. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug resistance testing facility. In this first of a series of four posts, we highlight the serious problem of HIV infection in sub-Saharan Africa, with special analysis of the situation in South Africa. The subsequent posts will describe the phylogenetics pipeline (running on a Raspberry Pi), and the analysis of viral sequences using R.