R Views
https://rviews.rstudio.com/
Recent content on R ViewsHugo -- gohugo.ioen-usThu, 15 Apr 2021 00:00:00 +0000An Alternative to the Correlation Coefficient That Works For Numeric and Categorical Variables
https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/
Thu, 15 Apr 2021 00:00:00 +0000https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/
<script src="/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/header-attrs/header-attrs.js"></script>
<p><em>Dr. Rama Ramakrishnan is Professor of the Practice at MIT Sloan School of Management where he teaches courses in Data Science, Optimization and applied Machine Learning.</em></p>
<p>When starting to work with a new dataset, it is useful to quickly pinpoint which pairs of variables appear to be <em>strongly related</em>. It helps you spot data issues, make better modeling decisions, and ultimately arrive at better answers.</p>
<p>The <a href="https://en.wikipedia.org/wiki/Correlation_coefficient"><em>correlation coefficient</em></a> is used widely for this purpose, but it is well-known that it can’t detect non-linear relationships. Take a look at this scatterplot of two variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>.</p>
<pre class="r"><code>set.seed(42)
x <- seq(-1,1,0.01)
y <- sqrt(1 - x^2) + rnorm(length(x),mean = 0, sd = 0.05)
ggplot(mapping = aes(x, y)) +
geom_point() </code></pre>
<p><img src="/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
<p>It is obvious to the human eye that <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> have a strong relationship but the correlation coefficient between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is only -0.01.</p>
<p>Further, if either variable of the pair is <em>categorical</em>, we can’t use the correlation coefficient. We will have to turn to other metrics. If <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> are <strong>both</strong> categorical, we can try <a href="https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V">Cramer’s V</a> or <a href="https://en.wikipedia.org/wiki/Phi_coefficient">the phi coefficient</a>. If <span class="math inline">\(x\)</span> is continuous and <span class="math inline">\(y\)</span> is binary, we can use the <a href="https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient">point-biserial correlation coefficient.</a></p>
<p>But using different metrics is problematic. Since they are derived from different assumptions, we can’t <strong>compare the resulting numbers with one another</strong>. If the correlation coefficient between continuous variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is 0.6 and the phi coefficient between categorical variables <span class="math inline">\(u\)</span> and <span class="math inline">\(v\)</span> is also 0.6, can we safely conclude that the relationships are equally strong? According to <a href="https://en.wikipedia.org/wiki/Phi_coefficient">Wikipedia</a>,</p>
<blockquote>
<p>The correlation coefficient ranges from −1 to +1, where ±1 indicates perfect agreement or disagreement, and 0 indicates no relationship. The phi coefficient has a maximum value that is determined by the distribution of the two variables if one or both variables can take on more than two values.</p>
</blockquote>
<p>A phi coefficient value of 0.6 between <span class="math inline">\(u\)</span> and <span class="math inline">\(v\)</span> may not mean much if its maximum possible value in this particular situation is much higher. Perhaps we can normalize the phi coefficient to map it to the 0-1 range? But what if that modification introduces biases?</p>
<p>Wouldn’t it be nice if we had <strong>one</strong> uniform approach that was easy to understand, worked for continuous <strong>and</strong> categorical variables alike, and could detect linear <strong>and</strong> nonlinear relationships?</p>
<p>(BTW, when <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> are continuous, looking at a scatter plot of <span class="math inline">\(x\)</span> vs <span class="math inline">\(y\)</span> can be very effective since the human brain can detect linear and non-linear patterns very quickly. But even if you are lucky and <em>all</em> your variables are continuous, looking at scatterplots of <em>all</em> pairs of variables is hard when you have lots of variables in your dataset; with just 100 predictors (say), you will need to look through 4950 scatterplots and this obviously isn’t practical)</p>
<p><br></p>
<div id="a-potential-solution" class="section level3">
<h3>A Potential Solution</h3>
<p>To devise a metric that satisfies the requirements we listed above, let’s <em>invert</em> the problem: What does it mean to say that <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> <strong>don’t</strong> have a strong relationship?</p>
<p>Intuitively, if there’s no relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, we would expect to see no patterns in a scatterplot of <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> - no lines, curves, groups etc. It will be a cloud of points that appears to be randomly scattered, perhaps something like this:</p>
<pre class="r"><code>x <- seq(-1,1,0.01)
y <- runif(length(x),min = -1, max = 1)
ggplot(mapping = aes(x, y)) +
geom_point() </code></pre>
<p><img src="/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>In this situation, does knowing the value of <span class="math inline">\(x\)</span> give us any information on <span class="math inline">\(y\)</span>?</p>
<p>Clearly not. <span class="math inline">\(y\)</span> seems to be somewhere between -1 and 1 with no particular pattern, regardless of the value of <span class="math inline">\(x\)</span>. Knowing <span class="math inline">\(x\)</span> does not seem to help <em>reduce our uncertainty</em> about the value of <span class="math inline">\(y\)</span>.</p>
<p>In contrast, look at the first picture again.</p>
<p><img src="/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>Here, knowing the value of <span class="math inline">\(x\)</span> <em>does</em> help. If we know that <span class="math inline">\(x\)</span> is around 0.0, for example, from the graph we will guess that <span class="math inline">\(y\)</span> is likely near 1.0 (the red dots). We can be confident that <span class="math inline">\(y\)</span> is <strong>not</strong> between 0 and 0.8. Knowing <span class="math inline">\(x\)</span> helps us eliminate certain values of <span class="math inline">\(y\)</span>, <strong>reducing our uncertainty</strong> about the values <span class="math inline">\(y\)</span> might take.</p>
<p>This notion - that knowing something reduces our uncertainty about something else - is exactly the idea behind <a href="https://en.wikipedia.org/wiki/Mutual_information">mutual information</a> from <a href="https://en.wikipedia.org/wiki/Information_theory">Information Theory</a>.</p>
<p>According to <a href="https://en.wikipedia.org/wiki/Mutual_information">Wikipedia</a> (emphasis mine),</p>
<blockquote>
<p>Intuitively, mutual information measures the information that <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> share: It measures <strong>how much knowing one of these variables reduces uncertainty about the other</strong>. For example, if <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> are independent, then knowing <span class="math inline">\(X\)</span> does not give any information about <span class="math inline">\(Y\)</span> and vice versa, so their mutual information is zero.</p>
</blockquote>
<p>Furthermore,</p>
<blockquote>
<p><strong>Not limited to real-valued random variables and linear dependence like the correlation coefficient</strong>, MI is more general and determines how different the joint distribution of the pair <span class="math inline">\((X,Y)\)</span> is to the product of the marginal distributions of <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span>.</p>
</blockquote>
<p>This is very promising!</p>
<p>As it turns out, however, implementing mutual information is not so simple. We first need to estimate the joint probabilities (i.e., the joint probability density/mass function) of <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> before we can calculate their Mutual Information. If <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> are categorical, this is easy but if one or both of them is continuous, it is more involved.</p>
<p>But we can use the basic insight behind mutual information – that knowing <span class="math inline">\(x\)</span> may reduce our uncertainty about <span class="math inline">\(y\)</span> – in a different way.</p>
<p><br></p>
</div>
<div id="the-x2y-metric" class="section level3">
<h3>The X2Y Metric</h3>
<p>Consider three variables <span class="math inline">\(x\)</span>, <span class="math inline">\(y\)</span> and <span class="math inline">\(z\)</span>. If knowing <span class="math inline">\(x\)</span> reduces our uncertainty about <span class="math inline">\(y\)</span> by 70% but knowing <span class="math inline">\(z\)</span> reduces our uncertainty about <span class="math inline">\(y\)</span> by only 40%, we will intuitively expect that the association between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> will be stronger than the association between <span class="math inline">\(z\)</span> and <span class="math inline">\(y\)</span>.</p>
<p>So, if we can <em>quantify</em> the reduction in uncertainty, that can be used as a measure of the strength of the association. One way to do so is to measure <span class="math inline">\(x\)</span>’s ability to <em>predict</em> <span class="math inline">\(y\)</span> - after all, <strong>if <span class="math inline">\(x\)</span> reduces our uncertainty about <span class="math inline">\(y\)</span>, knowing <span class="math inline">\(x\)</span> should help us predict <span class="math inline">\(y\)</span> better than if we didn’t know <span class="math inline">\(x\)</span></strong>.</p>
<p>Stated another way, we can think of reduction in prediction error <span class="math inline">\(\approx\)</span> reduction in uncertainty <span class="math inline">\(\approx\)</span> strength of association.</p>
<p>This suggests the following approach:</p>
<ol style="list-style-type: decimal">
<li>Predict <span class="math inline">\(y\)</span> <em>without using</em> <span class="math inline">\(x\)</span>.
<ul>
<li>If <span class="math inline">\(y\)</span> is continuous, we can simply use the average value of <span class="math inline">\(y\)</span>.</li>
<li>If <span class="math inline">\(y\)</span> is categorical, we can use the most frequent value of <span class="math inline">\(y\)</span>.</li>
<li>These are sometimes referred to as a <em>baseline</em> model.</li>
</ul></li>
<li>Predict <span class="math inline">\(y\)</span> <em>using</em> <span class="math inline">\(x\)</span>
<ul>
<li>We can take any of the standard predictive models out there (Linear/Logistic Regression, CART, Random Forests, SVMs, Neural Networks, Gradient Boosting etc.), set <span class="math inline">\(x\)</span> as the independent variable and <span class="math inline">\(y\)</span> as the dependent variable, fit the model to the data, and make predictions. More on this below.</li>
</ul></li>
<li>Calculate the <strong>% decrease in prediction error</strong> when we go from (1) to (2)
<ul>
<li>If <span class="math inline">\(y\)</span> is continuous, we can use any of the familiar error metrics like RMSE, SSE, MAE etc. I prefer mean absolute error (MAE) since it is less susceptible to outliers and is in the same units as <span class="math inline">\(y\)</span> but this is a matter of personal preference.</li>
<li>If <span class="math inline">\(y\)</span> is categorical, we can use Misclassification Error (= 1 - Accuracy) as the error metric.</li>
</ul></li>
</ol>
<blockquote>
<p>In summary, the % reduction in error when we go from a baseline model to a predictive model measures the strength of the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. We will call this metric <code>x2y</code> since it measures the ability of <span class="math inline">\(x\)</span> to predict <span class="math inline">\(y\)</span>.</p>
</blockquote>
<p>(This definition is similar to <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination"><em>R-Squared</em></a> from Linear Regression. In fact, if <span class="math inline">\(y\)</span> is continuous and we use the Sum of Squared Errors as our error metric, the <code>x2y</code> metric is equal to R-Squared.)</p>
<p>To implement (2) above, we need to pick a predictive model to use. Let’s remind ourselves of what the requirements are:</p>
<ul>
<li>If there’s a non-linear relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, the model should be able to detect it</li>
<li>It should be able to handle all possible <span class="math inline">\(x\)</span>-<span class="math inline">\(y\)</span> variable types: continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical</li>
<li>We may have hundreds (if not thousands) of pairs of variables we want to analyze so we want this to be quick</li>
</ul>
<p><a href="https://en.wikipedia.org/wiki/Decision_tree_learning">Classification and Regression Trees (CART)</a> satisfies these requirements very nicely and that’s the one I prefer to use. That said, you can certainly use other models if you like.</p>
<p>Let’s try this approach on the ‘semicircle’ dataset from above. We use CART to predict <span class="math inline">\(y\)</span> using <span class="math inline">\(x\)</span> and here’s how the fitted values look:</p>
<pre class="r"><code># Let's generate the data again
set.seed(42)
x <- seq(-1,1,0.01)
d <- data.frame(x = x,
y = sqrt(1 - x^2) + rnorm(length(x),mean = 0, sd = 0.05))
library(rpart)
preds <- predict(rpart(y~x, data = d, method = "anova"), type = "vector")
# Set up a chart
ggplot(data = d, mapping = aes(x = x)) +
geom_point(aes(y = y), size = 0.5) +
geom_line(aes(y=preds, color = '2')) +
scale_color_brewer(name = "", labels='CART', palette="Set1")</code></pre>
<p><img src="/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Visually, the CART predictions seem to approximate the semi-circular relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. To confirm, let’s calculate the <code>x2y</code> metric step by step.</p>
<ul>
<li>The MAE from using the average of <span class="math inline">\(y\)</span> to predict <span class="math inline">\(y\)</span> is 0.19.</li>
<li>The MAE from using the CART predictions to predict <span class="math inline">\(y\)</span> is 0.06.</li>
<li>The % reduction in MAE is 68.88%.</li>
</ul>
<p>Excellent!</p>
<p>If you are familiar with CART models, it is straightforward to implement the <code>x2y</code> metric in the Machine Learning environment of your choice. An R implementation is <a href="x2y.R">here</a> and details can be found in the <a href="#appendix">appendix</a> but, for now, I want to highlight two functions from the R script that we will use in the examples below:</p>
<ul>
<li><code>x2y(u, v)</code> calculates the <code>x2y</code> metric between two vectors <span class="math inline">\(u\)</span> and <span class="math inline">\(v\)</span></li>
<li><code>dx2y(d)</code> calculates the <code>x2y</code> metric between all pairs of variables in a dataframe <span class="math inline">\(d\)</span></li>
</ul>
<p><br></p>
</div>
<div id="two-caveats" class="section level3">
<h3>Two Caveats</h3>
<p>Before we demonstrate the <code>x2y</code> metric on a couple of datasets, I want to highlight two aspects of the <code>x2y</code> approach.</p>
<p>Unlike metrics like the correlation coefficient, the <code>x2y</code> metric is <strong>not</strong> symmetric with respect to <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. The extent to which <span class="math inline">\(x\)</span> can predict <span class="math inline">\(y\)</span> can be different from the extent to which <span class="math inline">\(y\)</span> can predict <span class="math inline">\(x\)</span>. For the semi-circle dataset, <code>x2y(x,y)</code> is 68.88% but <code>x2y(y,x)</code> is only 10.2%.</p>
<p>This shouldn’t come as a surprise, however. Let’s look at the scatterplot again but with the axes reversed.</p>
<pre class="r"><code>ggplot(data = d, mapping = aes(x = y)) +
geom_point(aes(y = x), size = 0.5) +
geom_point(data = d[abs(d$x) < 0.05,], aes(x = y, y = x), color = "orange" ) +
geom_point(data = d[abs(d$y-0.6) < 0.05,], aes(x = y, y = x), color = "red" )</code></pre>
<p><img src="/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>When <span class="math inline">\(x\)</span> is around 0.0, for instance, <span class="math inline">\(y\)</span> is near 1.0 (the orange dots). But when <span class="math inline">\(y\)</span> is around 0.6, <span class="math inline">\(x\)</span> can be in the (-0.75, - 1.0) range <em>or</em> in the (0.5, 0.75) range (the red dots). Knowing <span class="math inline">\(x\)</span> reduces the uncertainty about the value of <span class="math inline">\(y\)</span> a lot more than knowing <span class="math inline">\(y\)</span> reduces the uncertainty about the value of <span class="math inline">\(x\)</span>.</p>
<p>But there’s an easy solution if you <em>must</em> have a symmetric metric for your application: just take the average of <code>x2y(x,y)</code> and <code>x2y(y,x)</code>.</p>
<p>The second aspect worth highlighting is about the comparability of the <code>x2y</code> metric across variable pairs. All <code>x2y</code> values where the <span class="math inline">\(y\)</span> variable is continuous will be measuring a % reduction in MAE. All <code>x2y</code> values where the <span class="math inline">\(y\)</span> variable is categorical will be measuring a % reduction in Misclassification Error. Is a 30% reduction in MAE equal to a 30% reduction in Misclassification Error? It is problem dependent, there’s no universal right answer.</p>
<p>On the other hand, since (1) <em>all</em> <code>x2y</code> values are on the same 0-100% scale (2) are conceptually measuring the same thing, i.e., reduction in prediction error and (3) our objective is to quickly scan and identify strongly-related pairs (rather than conduct an in-depth investigation), the <code>x2y</code> approach may be adequate.</p>
<p><br></p>
</div>
<div id="application-to-the-iris-dataset" class="section level3">
<h3>Application to the Iris Dataset</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">iris flower dataset</a> is iconic in the statistics/ML communities and is widely used to illustrate basic concepts. The dataset consists of 150 observations in total and each observation has four continuous variables - the length and the width of petals and sepals - and a categorical variable indicating the species of iris.</p>
<p>Let’s take a look at 10 randomly chosen rows.</p>
<pre class="r"><code>iris %>% sample_n(10) %>% pander</code></pre>
<table>
<colgroup>
<col width="20%" />
<col width="19%" />
<col width="20%" />
<col width="19%" />
<col width="19%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">Sepal.Length</th>
<th align="center">Sepal.Width</th>
<th align="center">Petal.Length</th>
<th align="center">Petal.Width</th>
<th align="center">Species</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">5.9</td>
<td align="center">3</td>
<td align="center">5.1</td>
<td align="center">1.8</td>
<td align="center">virginica</td>
</tr>
<tr class="even">
<td align="center">5.5</td>
<td align="center">2.6</td>
<td align="center">4.4</td>
<td align="center">1.2</td>
<td align="center">versicolor</td>
</tr>
<tr class="odd">
<td align="center">6.1</td>
<td align="center">2.8</td>
<td align="center">4</td>
<td align="center">1.3</td>
<td align="center">versicolor</td>
</tr>
<tr class="even">
<td align="center">5.9</td>
<td align="center">3.2</td>
<td align="center">4.8</td>
<td align="center">1.8</td>
<td align="center">versicolor</td>
</tr>
<tr class="odd">
<td align="center">7.7</td>
<td align="center">2.6</td>
<td align="center">6.9</td>
<td align="center">2.3</td>
<td align="center">virginica</td>
</tr>
<tr class="even">
<td align="center">5.7</td>
<td align="center">4.4</td>
<td align="center">1.5</td>
<td align="center">0.4</td>
<td align="center">setosa</td>
</tr>
<tr class="odd">
<td align="center">6.5</td>
<td align="center">3</td>
<td align="center">5.2</td>
<td align="center">2</td>
<td align="center">virginica</td>
</tr>
<tr class="even">
<td align="center">5.2</td>
<td align="center">2.7</td>
<td align="center">3.9</td>
<td align="center">1.4</td>
<td align="center">versicolor</td>
</tr>
<tr class="odd">
<td align="center">5.6</td>
<td align="center">2.7</td>
<td align="center">4.2</td>
<td align="center">1.3</td>
<td align="center">versicolor</td>
</tr>
<tr class="even">
<td align="center">7.2</td>
<td align="center">3.2</td>
<td align="center">6</td>
<td align="center">1.8</td>
<td align="center">virginica</td>
</tr>
</tbody>
</table>
<p>We can calculate the <code>x2y</code> values for all pairs of variables in <code>iris</code> by running <code>dx2y(iris)</code> in R (details of how to use the <code>dx2y()</code> function are in the <a href="#appendix">appendix</a>).</p>
<pre class="r"><code>dx2y(iris) %>% pander</code></pre>
<table style="width:72%;">
<colgroup>
<col width="20%" />
<col width="20%" />
<col width="19%" />
<col width="11%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">x</th>
<th align="center">y</th>
<th align="center">perc_of_obs</th>
<th align="center">x2y</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">Petal.Width</td>
<td align="center">Species</td>
<td align="center">100</td>
<td align="center">94</td>
</tr>
<tr class="even">
<td align="center">Petal.Length</td>
<td align="center">Species</td>
<td align="center">100</td>
<td align="center">93</td>
</tr>
<tr class="odd">
<td align="center">Petal.Width</td>
<td align="center">Petal.Length</td>
<td align="center">100</td>
<td align="center">80.73</td>
</tr>
<tr class="even">
<td align="center">Species</td>
<td align="center">Petal.Length</td>
<td align="center">100</td>
<td align="center">79.72</td>
</tr>
<tr class="odd">
<td align="center">Petal.Length</td>
<td align="center">Petal.Width</td>
<td align="center">100</td>
<td align="center">77.32</td>
</tr>
<tr class="even">
<td align="center">Species</td>
<td align="center">Petal.Width</td>
<td align="center">100</td>
<td align="center">76.31</td>
</tr>
<tr class="odd">
<td align="center">Sepal.Length</td>
<td align="center">Petal.Length</td>
<td align="center">100</td>
<td align="center">66.88</td>
</tr>
<tr class="even">
<td align="center">Sepal.Length</td>
<td align="center">Species</td>
<td align="center">100</td>
<td align="center">62</td>
</tr>
<tr class="odd">
<td align="center">Petal.Length</td>
<td align="center">Sepal.Length</td>
<td align="center">100</td>
<td align="center">60.98</td>
</tr>
<tr class="even">
<td align="center">Sepal.Length</td>
<td align="center">Petal.Width</td>
<td align="center">100</td>
<td align="center">54.36</td>
</tr>
<tr class="odd">
<td align="center">Petal.Width</td>
<td align="center">Sepal.Length</td>
<td align="center">100</td>
<td align="center">48.81</td>
</tr>
<tr class="even">
<td align="center">Species</td>
<td align="center">Sepal.Length</td>
<td align="center">100</td>
<td align="center">42.08</td>
</tr>
<tr class="odd">
<td align="center">Sepal.Width</td>
<td align="center">Species</td>
<td align="center">100</td>
<td align="center">39</td>
</tr>
<tr class="even">
<td align="center">Petal.Width</td>
<td align="center">Sepal.Width</td>
<td align="center">100</td>
<td align="center">31.75</td>
</tr>
<tr class="odd">
<td align="center">Petal.Length</td>
<td align="center">Sepal.Width</td>
<td align="center">100</td>
<td align="center">30</td>
</tr>
<tr class="even">
<td align="center">Sepal.Width</td>
<td align="center">Petal.Length</td>
<td align="center">100</td>
<td align="center">28.16</td>
</tr>
<tr class="odd">
<td align="center">Sepal.Width</td>
<td align="center">Petal.Width</td>
<td align="center">100</td>
<td align="center">23.02</td>
</tr>
<tr class="even">
<td align="center">Species</td>
<td align="center">Sepal.Width</td>
<td align="center">100</td>
<td align="center">22.37</td>
</tr>
<tr class="odd">
<td align="center">Sepal.Length</td>
<td align="center">Sepal.Width</td>
<td align="center">100</td>
<td align="center">18.22</td>
</tr>
<tr class="even">
<td align="center">Sepal.Width</td>
<td align="center">Sepal.Length</td>
<td align="center">100</td>
<td align="center">12.18</td>
</tr>
</tbody>
</table>
<p>The first two columns in the output are self-explanatory. The third column - <code>perc_of_obs</code> - is the % of observations in the dataset that was used to calculate that row’s <code>x2y</code> value. When a dataset has missing values, only observations that have values present for both <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> will be used to calculate the <code>x2y</code> metrics for that variable pair. The <code>iris</code> dataset has no missing values so this value is 100% for all rows. The fourth column is the value of the <code>x2y</code> metric and the results are sorted in descending order of this value.</p>
<p>Looking at the numbers, both <code>Petal.Length</code> and <code>Petal.Width</code> seem to be highly associated with <code>Species</code> (and with each other). In contrast, it appears that <code>Sepal.Length</code> and <code>Sepal.Width</code> are very weakly associated with each other.</p>
<p>Note that even though <code>Species</code> is categorical and the other four variables are continuous, we could simply “drop” the <code>iris</code> dataframe into the <code>dx2y()</code> function and calculate the associations between all the variables.</p>
<p><br></p>
</div>
<div id="application-to-a-covid-19-dataset" class="section level3">
<h3>Application to a COVID-19 Dataset</h3>
<p>Next, we examine a <a href="https://github.com/rama100/x2y/blob/main/covid19.csv">COVID-19 dataset</a> that was downloaded from the <a href="https://github.com/mdcollab/covidclinicaldata/">COVID-19 Clinical Data Repository</a> in April 2020. This dataset contains clinical characteristics and COVID-19 test outcomes for 352 patients. Since it has a good mix of continuous and categorical variables, having something like the <code>x2y</code> metric that can work for any type of variable pair is convenient.</p>
<p>Let’s read in the data and take a quick look at the columns.</p>
<pre class="r"><code>df <- read.csv("covid19.csv", stringsAsFactors = FALSE)
str(df) </code></pre>
<pre><code>## 'data.frame': 352 obs. of 45 variables:
## $ date_published : chr "2020-04-14" "2020-04-14" "2020-04-14" "2020-04-14" ...
## $ clinic_state : chr "CA" "CA" "CA" "CA" ...
## $ test_name : chr "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" ...
## $ swab_type : chr "" "Nasopharyngeal" "Nasal" "" ...
## $ covid_19_test_results : chr "Negative" "Negative" "Negative" "Negative" ...
## $ age : int 30 77 49 42 37 23 71 28 55 51 ...
## $ high_risk_exposure_occupation: logi TRUE NA NA FALSE TRUE FALSE ...
## $ high_risk_interactions : logi FALSE NA NA FALSE TRUE TRUE ...
## $ diabetes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ chd : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ htn : logi FALSE TRUE FALSE TRUE FALSE FALSE ...
## $ cancer : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ asthma : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
## $ copd : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ autoimmune_dis : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ temperature : num 37.1 36.8 37 36.9 37.3 ...
## $ pulse : int 84 96 79 108 74 110 78 NA 97 66 ...
## $ sys : int 117 128 120 156 126 134 144 NA 160 98 ...
## $ dia : int 69 73 80 89 67 79 85 NA 97 65 ...
## $ rr : int NA 16 18 14 16 16 15 NA 16 16 ...
## $ sats : int 99 97 100 NA 99 98 96 97 99 100 ...
## $ rapid_flu : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rapid_flu_results : chr "" "" "" "" ...
## $ rapid_strep : logi FALSE TRUE FALSE FALSE FALSE TRUE ...
## $ rapid_strep_results : chr "" "Negative" "" "" ...
## $ ctab : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ labored_respiration : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rhonchi : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ wheezes : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ cough : logi FALSE NA TRUE TRUE TRUE TRUE ...
## $ cough_severity : chr "" "" "" "Mild" ...
## $ fever : logi NA NA NA FALSE FALSE TRUE ...
## $ sob : logi FALSE NA FALSE FALSE TRUE TRUE ...
## $ sob_severity : chr "" "" "" "" ...
## $ diarrhea : logi NA NA NA TRUE NA NA ...
## $ fatigue : logi NA NA NA NA TRUE TRUE ...
## $ headache : logi NA NA NA NA TRUE TRUE ...
## $ loss_of_smell : logi NA NA NA NA NA NA ...
## $ loss_of_taste : logi NA NA NA NA NA NA ...
## $ runny_nose : logi NA NA NA NA NA TRUE ...
## $ muscle_sore : logi NA NA NA TRUE NA TRUE ...
## $ sore_throat : logi TRUE NA NA NA NA TRUE ...
## $ cxr_findings : chr "" "" "" "" ...
## $ cxr_impression : chr "" "" "" "" ...
## $ cxr_link : chr "" "" "" "" ...</code></pre>
<pre class="r"><code>#%>% pander</code></pre>
<p>There are lots of missing values (denoted by ‘NA’) and lots of blanks as well - for example, see the first few values of the <code>rapid_flu_results</code> field above. We will convert the blanks to NAs so that all the missing values can be treated consistently. Also, the rightmost three columns are free-text fields so we will remove them from the dataframe.</p>
<pre class="r"><code>df <- read.csv("covid19.csv",
stringsAsFactors = FALSE,
na.strings=c("","NA") # read in blanks as NAs
)%>%
select(-starts_with("cxr")) # remove the chest x-ray note fields
str(df) </code></pre>
<pre><code>## 'data.frame': 352 obs. of 42 variables:
## $ date_published : chr "2020-04-14" "2020-04-14" "2020-04-14" "2020-04-14" ...
## $ clinic_state : chr "CA" "CA" "CA" "CA" ...
## $ test_name : chr "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" ...
## $ swab_type : chr NA "Nasopharyngeal" "Nasal" NA ...
## $ covid_19_test_results : chr "Negative" "Negative" "Negative" "Negative" ...
## $ age : int 30 77 49 42 37 23 71 28 55 51 ...
## $ high_risk_exposure_occupation: logi TRUE NA NA FALSE TRUE FALSE ...
## $ high_risk_interactions : logi FALSE NA NA FALSE TRUE TRUE ...
## $ diabetes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ chd : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ htn : logi FALSE TRUE FALSE TRUE FALSE FALSE ...
## $ cancer : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ asthma : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
## $ copd : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ autoimmune_dis : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ temperature : num 37.1 36.8 37 36.9 37.3 ...
## $ pulse : int 84 96 79 108 74 110 78 NA 97 66 ...
## $ sys : int 117 128 120 156 126 134 144 NA 160 98 ...
## $ dia : int 69 73 80 89 67 79 85 NA 97 65 ...
## $ rr : int NA 16 18 14 16 16 15 NA 16 16 ...
## $ sats : int 99 97 100 NA 99 98 96 97 99 100 ...
## $ rapid_flu : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rapid_flu_results : chr NA NA NA NA ...
## $ rapid_strep : logi FALSE TRUE FALSE FALSE FALSE TRUE ...
## $ rapid_strep_results : chr NA "Negative" NA NA ...
## $ ctab : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ labored_respiration : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rhonchi : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ wheezes : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ cough : logi FALSE NA TRUE TRUE TRUE TRUE ...
## $ cough_severity : chr NA NA NA "Mild" ...
## $ fever : logi NA NA NA FALSE FALSE TRUE ...
## $ sob : logi FALSE NA FALSE FALSE TRUE TRUE ...
## $ sob_severity : chr NA NA NA NA ...
## $ diarrhea : logi NA NA NA TRUE NA NA ...
## $ fatigue : logi NA NA NA NA TRUE TRUE ...
## $ headache : logi NA NA NA NA TRUE TRUE ...
## $ loss_of_smell : logi NA NA NA NA NA NA ...
## $ loss_of_taste : logi NA NA NA NA NA NA ...
## $ runny_nose : logi NA NA NA NA NA TRUE ...
## $ muscle_sore : logi NA NA NA TRUE NA TRUE ...
## $ sore_throat : logi TRUE NA NA NA NA TRUE ...</code></pre>
<pre class="r"><code>#%>% pander</code></pre>
<p>Now, let’s run it through the <code>x2y</code> approach. We are particularly interested in non-zero associations between the <code>covid_19_test_results</code> field and the other fields so we zero in on those by running <code>dx2y(df, target = "covid_19_test_results")</code> in R (details in the <a href="#appendix">appendix</a>) and filtering out the zero associations.</p>
<pre class="r"><code>dx2y(df, target = "covid_19_test_results") %>%
filter(x2y >0) %>%
pander</code></pre>
<table style="width:86%;">
<colgroup>
<col width="33%" />
<col width="22%" />
<col width="19%" />
<col width="11%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">x</th>
<th align="center">y</th>
<th align="center">perc_of_obs</th>
<th align="center">x2y</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">covid_19_test_results</td>
<td align="center">loss_of_smell</td>
<td align="center">21.88</td>
<td align="center">18.18</td>
</tr>
<tr class="even">
<td align="center">covid_19_test_results</td>
<td align="center">loss_of_taste</td>
<td align="center">22.73</td>
<td align="center">12.5</td>
</tr>
<tr class="odd">
<td align="center">covid_19_test_results</td>
<td align="center">sats</td>
<td align="center">92.9</td>
<td align="center">2.24</td>
</tr>
</tbody>
</table>
<p>Only <em>three</em> of the 41 variables have a non-zero association with <code>covid_19_test_results</code>. Disappointingly, the highest <code>x2y</code> value is an unimpressive 18%. It is based on just 22% of the observations (since the other 78% of observations had missing values) and makes one wonder if this modest association is real or if it is just due to chance.</p>
<p>If we were working with the correlation coefficient, we could easily calculate a <em>confidence interval</em> for it and gauge if what we are seeing is real or not. Can we do the same thing for the <code>x2y</code> metric?</p>
<p>We can, by using <a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrapping</a>. Given <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, we can sample with replacement a 1000 times (say) and calculate the <code>x2y</code> metric each time. With these 1000 numbers, we can construct a confidence interval easily (this is available as an optional <code>confidence</code> argument in the R functions we have been using; please see the <a href="#appendix">appendix</a>).</p>
<p>Let’s re-do the earlier calculation with “confidence intervals” turned on by running <code>dx2y(df, target = "covid_19_test_results", confidence = TRUE)</code> in R.</p>
<pre class="r"><code>dx2y(df, target = "covid_19_test_results", confidence = TRUE) %>%
filter(x2y >0) %>%
pander(split.tables = Inf)</code></pre>
<table>
<colgroup>
<col width="26%" />
<col width="17%" />
<col width="15%" />
<col width="8%" />
<col width="15%" />
<col width="15%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">x</th>
<th align="center">y</th>
<th align="center">perc_of_obs</th>
<th align="center">x2y</th>
<th align="center">CI_95_Lower</th>
<th align="center">CI_95_Upper</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">covid_19_test_results</td>
<td align="center">loss_of_smell</td>
<td align="center">21.88</td>
<td align="center">18.18</td>
<td align="center">-8.08</td>
<td align="center">36.36</td>
</tr>
<tr class="even">
<td align="center">covid_19_test_results</td>
<td align="center">loss_of_taste</td>
<td align="center">22.73</td>
<td align="center">12.5</td>
<td align="center">-11.67</td>
<td align="center">25</td>
</tr>
<tr class="odd">
<td align="center">covid_19_test_results</td>
<td align="center">sats</td>
<td align="center">92.9</td>
<td align="center">2.24</td>
<td align="center">-1.85</td>
<td align="center">4.48</td>
</tr>
</tbody>
</table>
<p><em>The 95% confidence intervals all contain 0.0</em>, so none of these associations appear to be real.</p>
<p>Let’s see what the top 10 associations are, between <em>any</em> pair of variables.</p>
<pre class="r"><code>dx2y(df) %>%head(10) %>% pander</code></pre>
<table style="width:75%;">
<colgroup>
<col width="22%" />
<col width="22%" />
<col width="19%" />
<col width="11%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">x</th>
<th align="center">y</th>
<th align="center">perc_of_obs</th>
<th align="center">x2y</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">loss_of_smell</td>
<td align="center">loss_of_taste</td>
<td align="center">20.17</td>
<td align="center">100</td>
</tr>
<tr class="even">
<td align="center">loss_of_taste</td>
<td align="center">loss_of_smell</td>
<td align="center">20.17</td>
<td align="center">100</td>
</tr>
<tr class="odd">
<td align="center">fatigue</td>
<td align="center">headache</td>
<td align="center">40.06</td>
<td align="center">90.91</td>
</tr>
<tr class="even">
<td align="center">headache</td>
<td align="center">fatigue</td>
<td align="center">40.06</td>
<td align="center">90.91</td>
</tr>
<tr class="odd">
<td align="center">fatigue</td>
<td align="center">sore_throat</td>
<td align="center">27.84</td>
<td align="center">89.58</td>
</tr>
<tr class="even">
<td align="center">headache</td>
<td align="center">sore_throat</td>
<td align="center">30.4</td>
<td align="center">89.36</td>
</tr>
<tr class="odd">
<td align="center">sore_throat</td>
<td align="center">fatigue</td>
<td align="center">27.84</td>
<td align="center">88.89</td>
</tr>
<tr class="even">
<td align="center">sore_throat</td>
<td align="center">headache</td>
<td align="center">30.4</td>
<td align="center">88.64</td>
</tr>
<tr class="odd">
<td align="center">runny_nose</td>
<td align="center">fatigue</td>
<td align="center">25.57</td>
<td align="center">84.44</td>
</tr>
<tr class="even">
<td align="center">runny_nose</td>
<td align="center">headache</td>
<td align="center">25.57</td>
<td align="center">84.09</td>
</tr>
</tbody>
</table>
<p>Interesting. <code>loss_of_smell</code> and <code>loss_of_taste</code> are <em>perfectly</em> associated with each other. Let’s look at the raw data.</p>
<pre class="r"><code>with(df, table(loss_of_smell, loss_of_taste))</code></pre>
<pre><code>## loss_of_taste
## loss_of_smell FALSE TRUE
## FALSE 55 0
## TRUE 0 16</code></pre>
<p>They agree for <em>every</em> observation in the dataset and, as a result, their <code>x2y</code> is 100%.</p>
<p>Moving down the <code>x2y</code> ranking, we see a number of variables - <code>fatigue</code>, <code>headache</code>, <code>sore_throat</code>, and <code>runny_nose</code> - that are <em>all strongly associated with each other</em>, as if they are all connected by a common cause.</p>
<p>When the number of variable combinations is high and there are lots of missing values, it can be helpful to scatterplot <code>x2y</code> vs <code>perc_of_obs</code>.</p>
<pre class="r"><code>ggplot(data = dx2y(df), aes(y=x2y, x = perc_of_obs)) +
geom_point()</code></pre>
<pre><code>## Warning: Removed 364 rows containing missing values (geom_point).</code></pre>
<p><img src="/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>Unfortunately, the top-right quadrant is empty: there are no strongly-related variable pairs that are based on at least 50% of the observations. There <em>are</em> some variable pairs with <code>x2y</code> values > 75% but none of them are based on more than 40% of the observations.</p>
<p><br></p>
</div>
<div id="conclusion" class="section level3">
<h3>Conclusion</h3>
<p>Using an insight from Information Theory, we devised a new metric - the <code>x2y</code> metric - that quantifies the strength of the association between pairs of variables.</p>
<p>The <code>x2y</code> metric has several advantages:</p>
<ul>
<li>It works for all types of variable pairs (continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical)</li>
<li>It captures linear and non-linear relationships</li>
<li>Perhaps best of all, it is easy to understand and use.</li>
</ul>
<p>I hope you give it a try in your work.</p>
<p>(If you found this note helpful, you may find <a href="https://rama100.github.io/lecture-notes/">these</a> of interest)</p>
<p><br></p>
</div>
<div id="acknowledgements" class="section level3">
<h3>Acknowledgements</h3>
<p>Thanks to <a href="https://mitsloan.mit.edu/faculty/directory/amr-farahat">Amr Farahat</a> for helpful feedback on an earlier draft.</p>
<p><br></p>
</div>
<div id="appendix" class="section level3">
<h3>Appendix: How to use the R script</h3>
<p>The <a href="https://github.com/rama100/x2y/blob/main/x2y.R">R script</a> depends on two R packages - <code>rpart</code> and <code>dplyr</code> - so please ensure that they are installed in your environment.</p>
<p>The script has two key functions: <code>x2y()</code> and <code>dx2y()</code>.</p>
<p><br></p>
<div id="using-the-x2y-function" class="section level4">
<h4>Using the <code>x2y()</code> function</h4>
<p><em>Usage</em>: <code>x2y(u, v, confidence = FALSE)</code></p>
<p><em>Arguments</em>:</p>
<ul>
<li><code>u</code>, <code>v</code>: two vectors of equal length</li>
<li><code>confidence</code>: (OPTIONAL) a boolean that indicates if a confidence interval is needed. Default is FALSE.</li>
</ul>
<p><em>Value</em>: A list with the following elements:</p>
<ul>
<li><code>perc_of_obs</code>: the % of total observations that were used to calculate <code>x2y</code>. If some observations are missing for either <span class="math inline">\(u\)</span> or <span class="math inline">\(v\)</span>, this will be less than 100%.</li>
<li><code>x2y</code>: the <code>x2y</code> metric for using <span class="math inline">\(u\)</span> to predict <span class="math inline">\(v\)</span></li>
</ul>
<p>Additionally, if <code>x2y()</code> was called with <code>confidence = TRUE</code>:</p>
<ul>
<li><code>CI_95_Lower</code>: the lower end of a 95% confidence interval for the <code>x2y</code> metric estimated by <a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrapping</a> 1000 samples</li>
<li><code>CI_95_Upper</code>: the upper end of a 95% confidence interval for the <code>x2y</code> metric estimated by bootstrapping 1000 samples</li>
</ul>
<p><br></p>
</div>
<div id="using-the-dx2y-function" class="section level4">
<h4>Using the <code>dx2y()</code> function</h4>
<p><em>Usage</em>: <code>dx2y(d, target = NA, confidence = FALSE)</code></p>
<p><em>Arguments</em>:</p>
<ul>
<li><code>d</code>: a dataframe</li>
<li><code>target</code>: (OPTIONAL) if you are only interested in the <code>x2y</code> values between a <em>particular variable</em> in <code>d</code> and all other variables, set <code>target</code> equal to the name of the variable you are interested in. Default is NA.</li>
<li><code>confidence</code>: (OPTIONAL) a boolean that indicates if a confidence interval is needed. Default is FALSE.</li>
</ul>
<p><em>Value</em>: A dataframe with each row containing the output of running <code>x2y(u, v, confidence)</code> for <code>u</code> and <code>v</code> chosen from the dataframe. Since this is just a standard R dataframe, it can be sliced, sorted, filtered, plotted etc.</p>
<p><strong>Update on April 16, 2021</strong>: I learned from a commenter that a <a href="https://paulvanderlaken.com/2020/05/04/predictive-power-score-finding-patterns-dataset/">similar approach</a> was proposed in April 2020, and that the R package <a href="https://cran.r-project.org/package=ppsr">ppsr</a> which implements that approach is now available on CRAN.</p>
</div>
</div>
<script>window.location.href='https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/';</script>
COVID-19 Data Forum: Data Journalism
https://rviews.rstudio.com/2021/04/06/covid-19-data-forum-data-journalism/
Tue, 06 Apr 2021 00:00:00 +0000https://rviews.rstudio.com/2021/04/06/covid-19-data-forum-data-journalism/
<p>The <a href="https://covid19-data-forum.org/">COVID-19 Data Forum</a>, a joint project of the Stanford Data Science Institute and the R Consortium, is an ongoing series of multidisciplinary webinars where topic experts discuss data-related aspects of the scientific response to the pandemic. The most recent event, held on March 18, 2021, explored the role of data journalism in the pandemic. This was a bit of a departure from previous forum events<sup>1</sup> because it focused on issues relating to using and interpreting COVID-19 data, and not on the particular kinds of COVID-19 related data that are available.</p>
<p>I think you will find the <a href="https://www.youtube.com/watch?v=Wh-GynBeEsQ">webinar video</a> worth watching. If you are a statistician or epidemiologist working on COVID-19, you may find the data journalists’ accounts of difficulties they faced working with COVID data and statistical models instructive. But, even if you are not directly working on COVID, you may find that listening to the journalists fills in some gaps between what you know about statistics and data visualizations and what you see in the news.</p>
<p>The data journalism event was moderated by <a href="https://twitter.com/irenatfh?lang=en">Dr. Irena Hwang</a>, a data reporter at ProPublica. Speakers included
<a href="https://journalism.columbia.edu/faculty/mark-hansen">Dr. Mark Hansen</a>, David and Helen Gurley Brown Professor of Journalism and Innovation at Columbia University; <a href="https://twitter.com/anarina?lang=en">Ana Carolina Moreno</a>, a senior data journalist at TV Globo in São Paulo, Brazil; and
<a href="https://twitter.com/meghanhoyer?lang=en">Meghan Hoyer</a>, Director of Data Reporting at the Washington Post.</p>
<p>The video of the data journalism event is <a href="https://www.youtube.com/watch?v=Wh-GynBeEsQ">available here</a>. The following short time map and the times referenced in my comments below should be helpful for browsing the ninety minute event.</p>
<ul>
<li><strong>2:37</strong> Irena Hwang introduces Mark Hansen</li>
<li><strong>3:50</strong> Start of Mark’s talk</li>
<li><strong>19:30</strong> Irena introduces Ana Carolina Moreno (Carol)</li>
<li><strong>21:10</strong> Start of Carol’s talk</li>
<li><strong>39:20</strong> Irena introduces Meghan Hoyer</li>
<li><strong>40:00</strong> Start of Meghan’s talk</li>
<li><strong>1:01:40</strong> Start of discussion</li>
</ul>
<h3 id="mark-hansen">Mark Hansen</h3>
<p>In his talk, Mark offers an overview of the profession of data journalism that provides some historical context and emphasizes the hybrid nature of the practice which blends a hard nose detective’s drive to uncover facts with the empathy to tell stories “about who we are and how we live”.</p>
<p><strong>7:00</strong> Mark introduces Joseph Pulitzer’s 1904 paper <a href="https://www.jstor.org/stable/25119561?refreqid=excelsior%3A9216a1bfa7873dae49d35beff9b2b01d&seq=33#metadata_info_tab_contents">The College of Journalism</a> in which Pulitzer includes Statistics as a subject journalists should study. On page 673, Pulitzer writes:</p>
<blockquote>
<p>You want statistics to tell you the truth. You can find truth there if you know how to get at it, and romance, human interest, humor and fascinating revelations as well.</p>
</blockquote>
<p><strong>10:19</strong> Mark describes a piece, <a href="https://www.cjr.org/first_person/journalism-notebooks.php"><em>An ode to reporter’s notebooks</em></a>, published by Philip Eil in the <em>Columbia Journalism Review</em> that offers a personal account of reporting: Eil writes:</p>
<blockquote>
<p>To report is to be alert and alive at a particular time and place.</p>
</blockquote>
<p><img src="mark.png" height = "400" width="600"></p>
<p><strong>11:00</strong> Mark remarks:</p>
<blockquote>
<p>when we’re thinking about bringing computation to journalism we are taking that basic curiosity that we are cultivating in our students minds … and adding computational lines of inquiry to that habit of mind, that questioning why things look the way they do…</p>
</blockquote>
<p><strong>12:08</strong> Mark calls attention to the report by Charles Berret and Cheryl Phillips <a href="https://journalism.columbia.edu/system/files/content/teaching_data_and_computational_journalism.pdf"><em>Teaching Data And Computational Journalism</em></a> and describes some recent activities of the <a href="https://brown.columbia.edu/">Brown Institute</a> at the Columbia School of Journalism.</p>
<h3 id="ana-carolina-moreno">Ana Carolina Moreno</h3>
<p><strong>22:32</strong> Carol introduces Brazil’s universal healthcare system and shows a schematic of the available official and unofficial COVID-19 data sources.</p>
<p><strong>26:00</strong> Carol notes that a platform originally built to track SARS data was adapted to track COVID.</p>
<p><strong>27:38</strong> Carol explains that, in practice, there are many obstacles making it difficult to obtain the data necessary to understand how the pandemic is developing. Some of these are called out in the following slide:</p>
<p><img src="c2.png" height = "400" width="600"></p>
<p><strong>30:37</strong> Carol remarks that hospital data seems to be the most reliable.</p>
<p><strong>31:06</strong> Carol describes how the government changed its policy for reporting deaths. The new scheme of only reporting deaths that have been confirmed in the past twenty-four hours vastly undercounts the current death rate.</p>
<p><strong>31:57</strong> In an effort to obtain more reliable data, a consortium of competing journalists at local news organizations began cooperating by sharing information directly obtained from hospitals every day.</p>
<p><strong>32:35</strong> Carol provides a view of day-to-day journalism at the local news organizations and describes how the data journalist scrape data on a daily basis to populate dashboards showing rolling averages and daily indicators. By focusing on the more reliable hospitalization data journalists are doing their best to track the spread of the pandemic an expose inequities in the health care system.</p>
<h3 id="meghan-hoyer">Meghan Hoyer</h3>
<p><strong>40:07</strong> Meghan begins her walk through of what last year was like for data journalists who were trying to tell the story of the pandemic in real time as it was happening.</p>
<p><strong>41:22</strong> Meghan recounts her experiences trying to make sense of COVID-19 models and expresses the frustration she and other data journalists felt with the multitude of contradictory predictive models.</p>
<p><strong>44:03</strong> In a memorable quote, Meghan remarks:</p>
<blockquote>
<p>Models were inherently problematic and yet they were being forced upon us by society…</p>
</blockquote>
<p><img src="models.png" height = "400" width="600"></p>
<p>Consequently journalists at the AP agreed and decided that they were not going to base stories on models.</p>
<p>In absence of reliable case data, and wanting nothing to do with the models, Meghan explains that data journalists turned to whatever data they could get their hands on to quantify the story of the pandemic.</p>
<p><strong>46:00</strong> Meghan recounts how journalists used garbage pickup data as a proxy for population density to estimate where people were living in NYC and correlate it with case data.</p>
<p><img src="garbage.png" height = "400" width="600"></p>
<p><strong>47:30</strong> Journalists struggled to find data to verify the anecdotal stories they were hearing about the the disparities in who was being affected by virus. Finding that one quarter to one third of the COVID case data was missing information on race, data journalists “hand collected” data by looking city by city to find the missing data.</p>
<p><strong>50:30</strong> Meghan recounts how they turned to age adjusted data to determine the impact of the virus on communities of color.</p>
<p><strong>52:10</strong> Data journalists find that excess deaths is a reliable metric for determining the impact of what is happening on the ground.</p>
<p><strong>54:18</strong> Journalists developed a survey which was returned by seven hundred schools to investigate how going back to school might be affecting students. Among their findings was that districts serving students of color were more likely to start online.</p>
<p><strong>56:35</strong> Meghan discusses the <a href="https://covidtracking.com/">COVID-19 Tracking Project</a> and the effort to sort out the impact of test positivity rates. She reports that because not all states measure the number of people who test in the same way, correctly comparing test positivity rates among states remains an unsolved problem.</p>
<p><strong>58:33</strong> Meghan shares the need to “flip the numbers” to help people understand the meaning of statistics stated in terms of very large numbers. For example, saying that “Since January of last year at least 1 in 15 people who live in Alexandria, Virginia have been infected by the virus” is easier for people to understand than something like: “On March 17th there were 14 cases per 100,000 in Alexandria”.</p>
<p><strong>59:53</strong> Vaccination tracking is another problematic data reporting area. Not only are vaccinations reported differently from state-to-state, but the data that is reported is changing from day-to-day. The CDC is apparently still adding new fields to the vaccination data sets.</p>
<h3 id="the-q-a-discussion">The Q & A Discussion</h3>
<p><strong>1:02</strong> The question and answer discussion begins.</p>
<p><strong>1:02:56</strong> Mark talks about how visualizations evolved over the course of the pandemic.</p>
<p><strong>1:06:08</strong> Carol and then Meghan talk how the lessons the pandemic taught data journalists about competition and collaboration.</p>
<p><strong>1:10:04</strong> Meghan describes how during the pandemic data journalists became advocates for public data.</p>
<p><strong>1:11:21</strong> Carol answers a question about the opportunities for data journalism in Brazil.</p>
<p><strong>1:15:50</strong> Answers a question of how academia is supporting data journalism during the pandemic and mentions an effort to have statistical and scientific experts collaborate with data journalists.</p>
<p><strong>1:20:19</strong> Meghan responds to a question about technical and social challenges for data journalists during the pandemic.</p>
<p><strong>1:23:10</strong> Carol talks about the difference between reporting online news and television news.</p>
<p><strong>1:26:01</strong> Mark answers a question about communicating emotional impact in COVID reporting and ends with emphasizing the importance of communicating honestly about what we do, and do not know.</p>
<p><sup>1</sup>The <a href="https://www.youtube.com/watch?v=6N1p99bLXjk">first forum</a> on May 14, 2020 focused on the data needs and challenges of modeling and controlling the spread of COVID-19, The <a href="https://www.youtube.com/watch?v=mEsDzwIMDz8">second forum</a> on August 13, 2020 explored what was being done to make clinical data available and useful. The <a href="https://www.youtube.com/watch?v=Blab8omzrb8">third forum</a> on December 10, 2020 discussed the role of mobility data.</p>
<script>window.location.href='https://rviews.rstudio.com/2021/04/06/covid-19-data-forum-data-journalism/';</script>
What does it take to do a t-test?
https://rviews.rstudio.com/2021/03/29/what-does-it-take-to-do-a-t-test/
Mon, 29 Mar 2021 00:00:00 +0000https://rviews.rstudio.com/2021/03/29/what-does-it-take-to-do-a-t-test/
<script src="/2021/03/29/what-does-it-take-to-do-a-t-test/index_files/header-attrs/header-attrs.js"></script>
<p>In this post, I examine the fundamental assumption of independence underlying the basic <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">Independent two-sample t-test</a> for comparing the means of two random samples. In addition to independence, we assume that both samples are draws from normal distributions where the population means and common variance are unknown. I am going to assume that you are familiar with this kind of test, but even if you are not you are still in the right place. The references at the end of the post all provide rigorous, but gentle explanations that should be very helpful.</p>
<div id="the-two-sample-t-test" class="section level3">
<h3>The two sample t-test</h3>
<p>Typically, we have independent samples for some numeric variable of interest (say the concentration of a drug in the blood stream) from two different groups, and we would like to know whether it is likely that two groups differ with respect to this variable. The formal test of the null hypothesis, <span class="math inline">\(H_0\)</span>, that the means of the underlying populations from which the samples are drawn are equal, proceeds making some assumptions:</p>
<ol style="list-style-type: decimal">
<li><span class="math inline">\(H_0\)</span> is true</li>
<li>The samples are independent</li>
<li>The data are normally distributed</li>
<li>The variances of the two samples are equal (This is the simplest test.)</li>
</ol>
<p>Next, a test statistic that includes the difference between the two sample means is calculated, and a decision is made to establish a “rejection region” for the test statistic. This region depends on the particular circumstances of the test, and is selected to balance the error of rejecting <span class="math inline">\(H_0\)</span> when it is true against the error of not rejecting <span class="math inline">\(H_0\)</span> when it is false. If we compute the test statistic and its value does not fall in the rejection region, then we do not reject <span class="math inline">\(H_0\)</span> and we conclude that we have found nothing. On the other hand, if the test statistic does fall in the rejection region, then we reject the <span class="math inline">\(H_0\)</span> and conclude that our data along with the the bundle of assumptions we made in setting up the test, and the “steel trap” logic of the t-test itself provide some evidence that the population means are different. (Page 6 of the MIT Open Courseware notes <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading18.pdf">Null Hypothesis Significance Testing II</a> contains an elegantly concise mathematical description of the t-test.)</p>
<p>All of the above assumptions must hold, or be pretty close to holding for the test to give an accurate result. However in my opinion, from the point of view of statistical practice, assumption 2. is fundamental. There are other tests and workarounds for the situations where 4. doesn’t hold. Assumption 3. is very important, but it is relatively easy to check, and the t-test is robust enough to deal with some deviation from normality. Of course, assumption 1. is important. The whole test depends on it, but this assumption is baked into the software that will run the test.</p>
</div>
<div id="independence" class="section level3">
<h3>Independence</h3>
<p>Independence, on the other hand can be a show stopper. Checking for independence is the difference between doing statistics and carrying out a mathematical or maybe just a mechanical exercise. It often involves considerable creative thinking and tedious legwork.</p>
<p>So, what do we mean by independent samples or independent data, and how do we go about verifying it? Independence is a mathematical idea, an abstraction from probability theory. Two events A and B are said to be independent events if the probability of both A and B happening equals the product of the probabilities of A and B happening. That is: P(AB) = P(A)P(B).</p>
<p>A more intuitive way to think about it is in terms of conditionally probability. In general, the probability of A happening given that B happens is defined to be:</p>
<blockquote>
<p>P(A|B) = P(<span class="math inline">\(A\bigcap B\)</span>) / P(B)</p>
</blockquote>
<p>If A and B are independent then P(A|B) = P(A). That is: B has no influence on whether A happens.</p>
<p>“Independent data” or “independent samples” are both shorthand for data sampled or otherwise resulting from independent probability distributions. Relating the mathematical concept to a real world situation requires a clear idea of the population of interest, considerable domain expertise, and a mental slight of hand that is nicely exposed in the short article <a href="https://support.minitab.com/en-us/minitab/19/help-and-how-to/statistics/basic-statistics/supporting-topics/tests-of-means/what-are-independent-samples/">What are independent samples?</a>, by the Minitab® folks. They write:</p>
<blockquote>
<p>Independent samples are samples that are selected randomly so that its observations do not depend on the values other observations.</p>
</blockquote>
<p>Notice what is happening here: what started out as a property of probability distributions has now become a prescription for obtaining data in a way that makes it plausible that we can assume independence for the probability distributions that we imagine govern our data. This is a real magic trick. No procedure for selecting data is ever going to guarantee the mathematical properties of our models. Nevertheless, the statement does show the way to proceed. By systematically tracking down all possibilities for interaction within the sampling process and eliminating the possibilities for one sample to influence another it may be possible to reach confidence that it is plausible to assume that the samples are independent. Because the math says that <a href="http://athenasc.com/Bivariate-Normal.pdf">independent data are not correlated</a> much of the exploratory data analysis involves looking for correlations that would signal dependent data. The Minitab® authors make this clear in the <a href="https://support.minitab.com/en-us/minitab/19/help-and-how-to/statistics/basic-statistics/supporting-topics/tests-of-means/what-are-independent-samples/">example</a> they offer to illustrate their definition.</p>
<blockquote>
<p>For example, suppose quality inspectors want to compare two laboratories to determine whether their blood tests give similar results. They send blood samples drawn from the same 10 children to both labs for analysis. Because both labs tested blood specimens from the same 10 children, the test results are not independent. To compare the average blood test results from the two labs, the inspectors would need to do a paired t-test, which is based on the assumption that samples are dependent.</p>
</blockquote>
<blockquote>
<p>To obtain independent samples, the inspectors would need to randomly select and test 10 children using Lab A and then randomly select and test a different group of 10 different children using Lab B. Then they could compare the average blood test results from the two labs using a 2-sample t-test, which is based on the assumption that samples are independent.</p>
</blockquote>
<p>Nicely said, and to further make their point, I am sure that the authors would agree that if it somehow turned out that the children from lab B happened to be the identical twins of the children in Lab A, they still would not have independent samples.</p>
</div>
<div id="what-happens-when-samples-are-not-independent" class="section level3">
<h3>What happens when samples are not independent</h3>
<p>The following example illustrates the consequences of performing a t-test when the independence assumption does not hold. We adapt a method of <a href="https://blog.revolutionanalytics.com/2016/08/simulating-form-the-bivariate-normal-distribution-in-r-1.html">simulating a bivariate normal distribution</a> with a specified covariance matrix that produces two dependent samples with a specified correlation matrix.</p>
<pre class="r"><code>library(tidyverse)
library(ggfortify)
set.seed(9999)</code></pre>
<p>First, we simulate a two uncorrelated samples with 20 observations each and run a two-sided t-test with equal variances. As you would expect, test output shows that there are 38 degrees of freedom and the p-value is large.</p>
<pre class="r"><code>rbvn_t<-function (n=20, mu1=1, s1=4, mu2=1, s2=4, rho=0)
{
X <- rnorm(n, mu1, s1)
Y <- rnorm(n, mu2 + (s2/s1) * rho *
(X - mu1), sqrt((1 - rho^2)*s2^2))
t.test(X,Y, mu=0, alternative = "two.sided", var.equal = TRUE)
}
rbvn_t()</code></pre>
<pre><code>##
## Two Sample t-test
##
## data: X and Y
## t = 2.1, df = 38, p-value = 0.04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.06266 5.06516
## sample estimates:
## mean of x mean of y
## 2.9333 0.3694</code></pre>
<p>Now we simulate 10,000 two-sided t-tests with independent samples having 20 observations in each sample.</p>
<pre class="r"><code>ts <- replicate(10000,rbvn_t(n=20, mu1=1, s1=4, mu2=1, s2=4, rho=0)$statistic)</code></pre>
<p>Plotting the simulated samples shows that the empirical density curve nicely overlays the theoretical density for the t-distribution.</p>
<pre class="r"><code>p <- ggdistribution(dt, df = 38, seq(-4, 4, 0.1))
autoplot(density(ts), colour = 'blue', p = p, fill = 'blue') +
ggtitle("When variables are independent")</code></pre>
<p><img src="/2021/03/29/what-does-it-take-to-do-a-t-test/index_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Moreover, the 0.975 quantile, the value that would indicate the upper boundary for the acceptance region for an <span class="math inline">\(\alpha\)</span> value of 0.05 is very close to the theoretical value of 2.024.</p>
<pre class="r"><code>quantile(ts,.975)</code></pre>
<pre><code>## 97.5%
## 1.996</code></pre>
<pre class="r"><code>qt(.975,38)</code></pre>
<pre><code>## [1] 2.024</code></pre>
<p>Next, we simulate 10,000 small samples of 20 with a correlation of 0.3.</p>
<pre class="r"><code>ts_d <- replicate(10000,rbvn_t(n=20, mu1=1, s1=4, mu2=1, s2=4, rho=.3)$statistic)</code></pre>
<p>We see that now the fit is not so good. There simulated distribution has noticeably less probability in the tails.</p>
<pre class="r"><code>pd <- ggdistribution(dt, df = 38, seq(-4, 4, 0.1))
autoplot(density(ts_d), colour = 'blue', p = pd, fill = 'blue') +
ggtitle("When variables are NOT independent")</code></pre>
<p><img src="/2021/03/29/what-does-it-take-to-do-a-t-test/index_files/figure-html/unnamed-chunk-7-1.png" width="672" />
The .975 quantile is much lower than the theoretical value of 2.024 showing that dependent data would lead to very misleading p-values.</p>
<pre class="r"><code>quantile(ts_d,.975)</code></pre>
<pre><code>## 97.5%
## 1.73</code></pre>
</div>
<div id="summary" class="section level3">
<h3>Summary</h3>
<p>Properly performing a t-test on data obtained from an experiment could mean doing a whole lot of up front work to design the experiment in a way that will make the assumptions plausible. One could argue that the real practice of statistics begins even before making exploratory plots. Doing statistics with found data is much more problematic. At a minimum, doing a simple t-test means acquiring more that a superficial understanding of how the data were generated.</p>
<p>Finally, when all is said and done, and you have a well constructed t-test that results in a sufficiently small p-value to reject the null hypothesis, you will have attained what most people call a statistically significant result. However, I think this language misleadingly emphasizes the mechanical grinding of the “steel trap” logic of the test that I mentioned above. Maybe instead we should emphasize the work that went into checking assumptions, and think about hypothesis tests as producing “plausibly significant” results.</p>
</div>
<div id="some-resources-for-doing-t-tests-in-r" class="section level3">
<h3>Some resources for doing t-tests in R</h3>
<ul>
<li><p>Holmes and Huber (2019) <a href="https://web.stanford.edu/class/bios221/book/">Modern Statistics for Modern Biology</a>, Chapter 6,</p></li>
<li><p>Orloff and Bloom (2014) <a href="https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading18.pdf">Null Hypothesis Significance Testing II</a></p></li>
<li><p>Poldrack (2018) <a href="https://web.stanford.edu/group/poldracklab/statsthinking21/">Statistical Thinking for the 21st century</a>, Chapter 9</p></li>
<li><p>Spector <a href="https://statistics.berkeley.edu/computing/r-t-tests">Using t-tests in R</a></p></li>
<li><p>Wetherill (2015) <a href="https://datascienceplus.com/t-tests/">How to Perform T-tests in R</a></p></li>
</ul>
</div>
<script>window.location.href='https://rviews.rstudio.com/2021/03/29/what-does-it-take-to-do-a-t-test/';</script>
February 2021: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2021/03/19/february-2021-top-40-new-cran-packages/
Fri, 19 Mar 2021 00:00:00 +0000https://rviews.rstudio.com/2021/03/19/february-2021-top-40-new-cran-packages/
<p>In February, two hundred forty-three new packages made it to CRAN, many of them very interesting and at least one entertaining. It was exceptionally difficult to pick the “Top 40”, but here they are, more or less, in eleven categories: Computational Methods, Data, Finance, Games, Genomics, Machine Learning, Mathematics, Medicine, Networks and Graphs, Statistics, Utilities, and Visualization. <code>iconr</code> in the Networks and Graphs section is a package for doing computational archaeology, a relatively new field that I hope will dig R. I also hope that <code>sassy</code> in the Statistics sections helps some statisticians find their way to R.</p>
<h3 id="computational-methods">Computational Methods</h3>
<p><a href="https://cran.r-project.org/package=blaster">blaster</a> v1.0.3: Implements an efficient BLAST-like sequence comparison algorithm, written in C++11 and using native R data types. See <a href="https://www.biorxiv.org/content/10.1101/399782v1">Schmid et al. (2018)</a> for background and <a href="https://cran.r-project.org/web/packages/blaster/readme/README.html">README</a> for an example.</p>
<p><a href="https://cran.r-project.org/package=rando">rando</a> v0.2.0: Provides random number generating functions that are much more context aware than the built-in functions. The functions are also safer, as they check for incompatible values, and reproducible.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=AWAPer">AWAPer</a> 0.1.46: Provides catchment area weighted climate data NetCDF files from the Bureau of Meteorology <a href="http://www.bom.gov.au/jsp/awap/">Australian Water Availability Project</a> for all of Australia. There is a vignette on <a href="https://cran.r-project.org/web/packages/AWAPer/vignettes/Catchment_avg_ET_rainfall.html">Daily Area Weighted PET and Precipitation</a> and another on <a href="https://cran.r-project.org/web/packages/AWAPer/vignettes/Point_rainfall.html">Daily Point Precipitation</a></p>
<p><img src="AWAPer.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=caRecall">caRecall</a> v0.1.0: Provides API access to the Government of Canada <a href="https://tc.api.canada.ca/en/detail?api=VRDB">Vehicle Recalls Database</a> used by the Defect Investigations and Recalls Division for vehicles, tires, and child car seats. See the <a href="https://cran.r-project.org/web/packages/caRecall/vignettes/vrd_vignette.html">vignette</a>.</p>
<p><img src="caRecall.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=geofi">geofi</a> v1.0.0: Provides tools for reading Finnish open geospatial data in R. There are vignettes on <a href="https://cran.r-project.org/web/packages/geofi/vignettes/geofi_datasets.html">Datasets</a>, <a href="https://cran.r-project.org/web/packages/geofi/vignettes/geofi_joining_attribute_data.html">Joining Attributes</a>, <a href="https://cran.r-project.org/web/packages/geofi/vignettes/geofi_making_maps.html">Making Maps</a>, <a href="https://cran.r-project.org/web/packages/geofi/vignettes/geofi_spatial_analysis.html">Data Manipulation</a>, and <a href="https://cran.r-project.org/web/packages/geofi/vignettes/tricolore_tutorial.html">Color-coded Maps</a>.</p>
<p><img src="geofi.png" height = "400" width="200"></p>
<p><a href="https://cran.r-project.org/package=hockeystick">hockeystick</a> v0.4.0: Provides easy access to essential climate change data sets for non-climate experts. Users can download the latest raw data from authoritative sources and view it via pre-defined <code>ggplot2</code> charts. Data sets include atmospheric CO2, instrumental and proxy temperature records, sea levels, Arctic/Antarctic sea-ice, and Paleoclimate data. Sources include: <a href="https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html">NOAA Mauna Loa Laboratory</a>, <a href="https://data.giss.nasa.gov/gistemp/">NASA GISTEMP</a>, <a href="https://nsidc.org/data/seaice_index/archives">National Snow and Sea Ice Data Center</a>, <a href="http://www.cmar.csiro.au/sealevel/sl_data_cmar.htm">CSIRO</a>, <a href="https://www.star.nesdis.noaa.gov/socd/lsa/SeaLevelRise/">NOAA Laboratory for Satellite Altimetry</a>, and <a href="https://cdiac.ess-dive.lbl.gov/trends/co2/vostok.html">Vostok Paleo</a> carbon dioxide and temperature data. See <a href="https://cran.r-project.org/web/packages/hockeystick/readme/README.html">README</a> for examples.</p>
<p><img src="hockeystick.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=votesmart">votesmart</a> v0.1.0: Implements a wrapper to the <a href="https://justfacts.votesmart.org/">Project VoteSmart</a> API. See the <a href="https://cran.r-project.org/web/packages/votesmart/vignettes/votesmart.html">vignette</a>.</p>
<h3 id="finance">Finance</h3>
<p><a href="https://cran.r-project.org/package=PriceIndices">PriceIndices</a> v0.0.3: Provides functions to compute bilateral and multilateral indexes. For details, see: <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/roiw.12304">de Haan and Krsinich (2017)</a> and <a href="https://www.tandfonline.com/doi/abs/10.1080/07350015.2020.1816176?journalCode=ubes20">Diewert and Fox (2020)</a>. The <a href="https://cran.r-project.org/web/packages/PriceIndices/vignettes/PriceIndices.html">vignette</a> offers examples.</p>
<p><a href="https://cran.r-project.org/package=treasuryTR">treasuryTR</a> v0.1.1: Generates Total Returns (TR) from bond yield data with fixed maturity (e.g. reported treasury yields) which may provide an alternative to commercial products. See <a href="https://www.mdpi.com/2306-5729/4/3/91">Swinkels (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/treasuryTR/vignettes/treasuryTR.html">vignette</a> for examples.</p>
<p><img src="treasuryTR.png" height = "200" width="400"></p>
<h3 id="games">Games</h3>
<p><a href="https://cran.r-project.org/package=pixelpuzzle">pixelpuzzle</a> v1.0.0: Implements a puzzle game that can be played in the R console. Restore the pixel art by shifting rows. Learn how to play <a href="https://github.com/rolkra/pixelpuzzle">here</a>.</p>
<p><img src="pixelpuzzle.png" height = "200" width="400"></p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.r-project.org/package=CDSeq">CDSeq</a> v1.0.8: Provides functions to estimate cell-type-specific gene expression profiles and sample-specific cell-type proportions simultaneously using bulk sequencing data. See <a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007510">Kang et al. (2019)</a> for the theory and the <a href="https://cran.r-project.org/web/packages/CDSeq/vignettes/CDSeq-vignette.html">vignette</a> for examples.</p>
<p><img src="CDSeq.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=ClusTorus">ClusTorus</a> v0.0.1: Provides various tools for clustering multivariate angular data on the torus including angular adaptations of usual clustering methods such as the k-means clustering, pairwise angular distances. See the <a href="https://cran.r-project.org/web/packages/ClusTorus/vignettes/ClusTorus.html">vignette</a> for examples.</p>
<p><img src="ClusTorus.png" height = "200" width="400"></p>
<p><a href="https://CRAN.R-project.org/package=dsb">dsb</a> v0.1.0: Provides a method for normalizing and denoising protein expression data from droplet based single cell experiments. See the <a href="https://cran.r-project.org/web/packages/dsb/vignettes/dsb_normalizing_CITEseq_data.html">vignette</a> for tutorials on how to integrate <code>dsb</code> with Seurat, Bioconductor and the AnnData class in Python. The preprint <a href="https://www.biorxiv.org/content/10.1101/2020.02.24.963603v1">Mulè et al. (2020)</a> describes the details.</p>
<p><img src="dsb.png" height = "200" width="400"></p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=bestridge">besridge</a> v1.0.4: Provides functions to perform ridge regression in complex situations on high dimensional data using the primal dual active set algorithm proposed in <a href="https://www.jstatsoft.org/article/view/v094i04">Wen et al. (2020)</a>. Functions support regression, classification, count regression and censored regression, group variable selection and nuisance variable selection. See the <a href="https://cran.r-project.org/web/packages/bestridge/vignettes/An-introduction-to-bestridge.html">vignette</a> for examples.</p>
<p><img src="bestridge.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=ROCket">ROCket</a> v1.0.1: Provides functions for estimating receiver operating characteristic (ROC) curves and area under the curve (AUC) calculation which distinguish two types of ROC curve representations: 1) parametric curves - the true positive rate (TPR) and the false positive rate (FPR) are functions of a score parameter and 2) function curves - TPR is a function of FPR. See <a href="https://www.ine.pt/revstat/pdf/rs140101.pdf">Gonçalves et al. (2014)</a> and <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.0006-341X.2004.00200.x">Cai & Pepe (2004)</a> for background and <a href="https://cran.r-project.org/web/packages/ROCket/readme/README.html">README</a> to get started.</p>
<p><img src="ROCket.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=wordpiece">wordpiece</a> v1.0.2: Provides functions to apply <a href="https://arxiv.org/abs/1609.08144">Wordpiece</a> tokenization to input text, given an appropriate vocabulary. The <a href="https://arxiv.org/abs/1810.04805">BERT</a> tokenization conventions are used by default. See the <a href="https://cran.r-project.org/web/packages/wordpiece/vignettes/basic_usage.html">vignette</a> for an example.</p>
<h3 id="mathematics">Mathematics</h3>
<p><a href="https://cran.r-project.org/package=fractD">fractD</a> v0.1.0: Estimates the of fractal dimension of a black area in 2D and 3D (slices) images using the box-counting method. See <a href="https://link.springer.com/article/10.1007%2FBF02065874">Klinkenberg (1994)</a> for background and the <a href="https://cran.r-project.org/web/packages/fractD/vignettes/Calculates_the_fractal_dimension_of_2D_and_3D_images.html">vignette</a> for examples.</p>
<p><img src="fractD.svg" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=spacefillr">spacefillr</a> v0.2.0: Generates random and quasi-random space-filling sequences including <a href="https://en.wikipedia.org/wiki/Halton_sequence">Halton</a>, <a href="https://en.wikipedia.org/wiki/Sobol_sequence">Sobol</a> and other sequences with errors distributed as various types of jittered blue noise. See <a href="https://epubs.siam.org/doi/10.1137/070709359">Joe and Kuo (2018)</a>, <a href="https://graphics.pixar.com/library/ProgressiveMultiJitteredSampling/paper.pdf">Christensen et al. (2018)</a> and <a href="https://dl.acm.org/doi/10.1145/3306307.3328191">Heitz et al. (2019)</a> for background and look <a href="https://github.com/tylermorganwall/spacefillr">here</a> for examples.</p>
<p><img src="spacefillr.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=tensorsign">tensorsign</a> v0.1.0: Provides an efficient algorithm for nonparametric tensor completion via sign series. The algorithm which employs the alternating optimization approach to solve the weighted classification problem is described in <a href="https://arxiv.org/abs/2102.00384">Lee and Wang (2021)</a></p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=bhmbasket">bhmbasket</a> v0.9.1: Provides functions to evaluate basket trial designs with binary endpoints using Bayesian hierarchical models and Bayesian decision rules. See <a href="https://journals.sagepub.com/doi/10.1177/1740774513497539">Berry et al. (2013)</a>, <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/pst.1730">Neuenschwander et al. (2016)</a> and <a href="https://link.springer.com/article/10.1177%2F2168479014533970">Fisch et al. (2015)</a> for background and the <a href="https://cran.r-project.org/web/packages/bhmbasket/vignettes/reproduceExNex.html">vignette</a> for an example.</p>
<p><a href="https://cran.r-project.org/package=bp">bp</a> v1.0.1: Provides functions to aid in the analysis of blood pressure data of all forms by providing both descriptive and visualization tools for researchers. There is a <a href="https://cran.r-project.org/web/packages/bp/vignettes/bp.html">vignette</a>.</p>
<p><img src="blood.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=CHOIRBM">CHOIRBM</a> v0.0.2: Provides functions for visualizing body map data collected with the Collaborative Health Outcomes Information Registry <a href="https://choir.stanford.edu/">CHOIR)</a>. See the <a href="https://cran.r-project.org/web/packages/CHOIRBM/vignettes/plot-one-patient.html">vignette</a>.</p>
<p><img src="CHOIRBM.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=QDiabetes">QDiabetes</a> v1.0-2: Calculates the risk of developing type 2 diabetes using risk prediction algorithms derived by <a href="https://clinrisk.co.uk/ClinRisk/Welcome.html">ClinRisk</a>. Look <a href="https://github.com/Feakster/qdiabetes">here</a> for information and examples.</p>
<p><a href="https://cran.r-project.org/package=SteppedPower">SteppedPower</a> v0.1.0: Provides tools for power and sample size calculations and design diagnostics for longitudinal mixed models with a focus on stepped wedge designs using methods introduced in <a href="https://www.sciencedirect.com/science/article/pii/S1551714406000632?via%3Dihub">Hussey and Hughes (2007)</a> and extensions discussed in <a href="https://journals.sagepub.com/doi/10.1177/0962280220932962">Li et al. (2020)</a>. See the <a href="https://cran.r-project.org/web/packages/SteppedPower/vignettes/Getting_Started.html">vignette</a> to get started.</p>
<p><img src="SteppedPower.png" height = "200" width="400"></p>
<h3 id="networks-and-graphs">Networks and Graphs</h3>
<p><a href="https://cran.r-project.org/package=bnmonitor">bnmonitor</a> v0.1.0. Implements sensitivity and robustness methods for Bayesian networks including methods to perform parameter variations via a variety of co-variation schemes, to compute sensitivity functions and to quantify the dissimilarity of two Bayesian networks via distances and divergences. See <a href="https://www.jair.org/index.php/jair/article/view/10307">Chan and Darwiche (2002)</a>, <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1539-6975.2007.00235.x">Cowell et al. (2007)</a>, and <a href="https://arxiv.org/abs/1809.10794">Goergen and Leonell (2020)</a> for background and <a href="https://cran.r-project.org/web/packages/bnmonitor/readme/README.html">README</a> for examples.</p>
<p><img src="bnmonitor.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=iconr">iconr</a> v0.1.0: Provides formal methods for studying archaeological iconographic data sets (rock-art, pottery decoration, stelae, etc.) using network and spatial analysis See <a href="http://archiv.ub.uni-heidelberg.de/propylaeumdok/512/">Alexander (2008)</a> and <a href="https://hal.archives-ouvertes.fr/hal-02913656">Huet (2018)</a> for background and the <a href="https://cran.r-project.org/web/packages/iconr/vignettes/index.html">vignette</a> for examples.</p>
<p><img src="iconr.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=MLVSBM">MLVSBM</a> 0.2.1: Provides functions for simulation, inference and clustering of multilevel networks using a stochastic block model framework as described in <a href="https://www.sciencedirect.com/science/article/abs/pii/S016794732100013X?via%3Dihub">Chabert-Liddell et al. (2021)</a>. There is a <a href="https://cran.r-project.org/web/packages/MLVSBM/vignettes/vignette.html">tutorial</a>.</p>
<p><img src="MLVSBM.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=motifr">motifr</a> v1.0.0: Provides tools to analyze motifs(small configurations of nodes and edges) in multi-level networks (networks which combine multiple networks in one, e.g. social-ecological networks.) See <a href="https://cran.r-project.org/web/packages/motifr/vignettes/motif_zoo.html">The motif zoo</a> and <a href="https://cran.r-project.org/web/packages/motifr/vignettes/random_baselines.html">Baseline model comparisons</a>.</p>
<p><img src="motifr.svg" height = "200" width="400"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=cfda">cfda</a> v0.9.9: Provides functions to encode categorical data as functional data and perform basis statistical analysis. See <a href="https://hal.inria.fr/hal-02973094/document">Preda et al. (2020)</a> for background and the <a href="https://cran.r-project.org/web/packages/cfda/vignettes/cfda.html">vignette</a> to get started.</p>
<p><img src="cfda.png" height = "350" width="350"></p>
<p><a href="https://cran.r-project.org/package=cvCovEst">cvCovEst</a> v0.3.4: Implements an efficient cross-validated approach for covariance matrix estimation, particularly useful in high-dimensional settings. See the <a href="https://cran.r-project.org/web/packages/cvCovEst/vignettes/using_cvCovEst.html">vignette</a> for background and examples.</p>
<p><img src="cvCovEst.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=flipr">flipr</a> v0.2.1: Implements a permutation framework point estimation, confidence intervals or hypothesis testing for multiple data types. There is a <a href="https://cran.r-project.org/web/packages/flipr/vignettes/flipr.html">Tour of Permutation Inference</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/flipr/vignettes/alternative.html">Alternative Hypothesis Testing</a>, the <a href="https://cran.r-project.org/web/packages/flipr/vignettes/exactness.html">Exactness of Permutation Tests</a>, and <a href="https://cran.r-project.org/web/packages/flipr/vignettes/pvalue-function.html">Calculating p-value Functions</a>.</p>
<p><img src="flipr.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=ipmr">ipmr</a> v0.0.1: implements integral projection models using an expression based framework that handles density dependence and environmental stochasticity and provides tools for diagnostics, plotting, simulations, and analysis. See <a href="https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/0012-9658%282000%29081%5B0694%3ASSSAAN%5D2.0.CO%3B2">Easterling et al. (2000)</a>
for an in depth description of integral projection models. There is an <a href="https://cran.r-project.org/web/packages/ipmr/vignettes/ipmr-introduction.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/ipmr/vignettes/age_x_size.html">Age-Size IPMS</a>, <a href="https://cran.r-project.org/web/packages/ipmr/vignettes/density-dependence.html">Density Dependent IPMS</a>, <a href="https://cran.r-project.org/web/packages/ipmr/vignettes/hierarchical-notation.html">Hierarchical Notation</a>, and <a href="https://cran.r-project.org/web/packages/ipmr/vignettes/proto-ipms.html">Data Structures</a>.</p>
<p><a href="https://cran.r-project.org/package=metapack">metapack</a> v0.1.1: Provides functions performing Bayesian inference for meta-analytic and network meta-analytic models through Markov chain Monte Carlo algorithm. See <a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2015.1006065">Yao et al. (2015)</a> for the theory, the <a href="https://cran.r-project.org/web/packages/metapack/vignettes/intro-to-metapack.html">vignette</a> for an introduction and the <a href="http://merlot.stat.uconn.edu/packages/metapack/">online documentation</a>.</p>
<p><a href="https://cran.r-project.org/package=sassy">sassy</a> v1.0.4: Loads a collection of packages that collectively aim to make R easier for SAS® programmers. Functions bring many familiar SAS® concepts to R, including data libraries, data dictionaries, formats and format catalogs, a data step, and a traceable log. There is an <a href="https://cran.r-project.org/web/packages/sassy/vignettes/sassy.html">Introduction</a>, and vignettes with example <a href="https://cran.r-project.org/web/packages/sassy/vignettes/sassy-figure.html">Figures</a>, <a href="https://cran.r-project.org/web/packages/sassy/vignettes/sassy-listing.html">Listings</a>, and <a href="https://cran.r-project.org/web/packages/sassy/vignettes/sassy-table.html">Tables</a>, as well as a few <a href="https://cran.r-project.org/web/packages/sassy/vignettes/sassy-disclaimers.html">Disclaimers</a> which include a statement indicating that the packages were developed in the context of the pharmaceutical industry but should be generally helpful.</p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=gargoyle">gargoyle</a> v0.0.1: Implements an event-Based framework for building <code>Shiny</code> apps. Instead of relying on standard <code>Shiny</code> reactive objects, this package allow to relying on a lighter set of triggers, so that reactive contexts can be invalidated with more control. See the <a href="https://cran.r-project.org/web/packages/gargoyle/vignettes/gargoyle.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=multidplyr">multidplyr</a> Provides simple multicore parallelism through functions that partition a data frame across multiple worker processes. See the <a href="https://cran.r-project.org/web/packages/multidplyr/vignettes/multidplyr.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=quarto">quarto</a> v0.1: Provides an interface to the <a href="https://github.com/avdi/quarto">Quarto</a> markdown publishing system and allows converting R Markdown documents and <a href="https://jupyter.org/">Jupyter Notebooks</a> to a variety of output formats.</p>
<p><a href="https://cran.r-project.org/package=vmr">var</a> v0.0.2: Provides functions to manage, provision and use virtual machines pre-configured for R, and develop, test and build package in a clean environment. <a href="https://www.vagrantup.com/intro">Vagrant</a> and a provider such as <a href="https://www.virtualbox.org/">Virtualbox</a> must be installed.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=ggh4x">ggh4x</a> v0.1.2.1: Extends <code>ggplot2</code> facets by setting individual scales per panel, resizing panels, providing nested facets, and allowing multiple colour and fill scales per plot. See the <a href="https://cran.r-project.org/web/packages/ggh4x/vignettes/ggh4x.html">Introduction</a>, and the vignettes <a href="https://cran.r-project.org/web/packages/ggh4x/vignettes/Facets.html">Facets</a>, <a href="https://cran.r-project.org/web/packages/ggh4x/vignettes/Miscellaneous.html">Misc</a>, <a href="https://cran.r-project.org/web/packages/ggh4x/vignettes/PositionGuides.html">Position Guides</a>, and <a href="https://cran.r-project.org/web/packages/ggh4x/vignettes/Statistics.html">Statistics</a>.</p>
<p><img src="ggh4x.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=tastypie">tastypie</a> v0.0.3: Provides functions and templates for making pie charts even though you probably shouldn’t. See the vignettes <a href="https://cran.r-project.org/web/packages/tastypie/vignettes/available_templates.html">available templates</a> and <a href="https://cran.r-project.org/web/packages/tastypie/vignettes/your_favourite_template.html">Your favorite template</a>, and look <a href="https://paolodalena.github.io/tastypie/">here</a> for examples.</p>
<p><img src="tastypie.png" height = "350" width="350"></p>
<p><a href="https://cran.r-project.org/package=terrainr">terrainr</a> v0.3.1: Provides functions to retrieve, manipulate, and visualize geospatial data, with an aim towards producing ‘3D’ landscape visualizations in the <a href="https://unity.com/">Unity 3D</a> rendering engine. Functions are also provided for retrieving elevation data and base map tiles from the <a href="https://apps.nationalmap.gov/services/">USGS National Map</a>. There is an <a href="https://cran.r-project.org/web/packages/terrainr/vignettes/overview.html">Introduction</a> and a <a href="https://cran.r-project.org/web/packages/terrainr/vignettes/unity_instructions.html">vignette</a> on importing terrain tiles.</p>
<p><img src="terrainr.jpeg" height = "300" width="300"></p>
<script>window.location.href='https://rviews.rstudio.com/2021/03/19/february-2021-top-40-new-cran-packages/';</script>
Cheat Sheets
https://rviews.rstudio.com/2021/03/10/rstudio-open-source-resorurces/
Wed, 10 Mar 2021 00:00:00 +0000https://rviews.rstudio.com/2021/03/10/rstudio-open-source-resorurces/
<p>In a <a href="https://rviews.rstudio.com/2020/12/02/learn-and-teach-r/">previous post</a>, I described how I was captivated by the virtual landscape imagined by the RStudio education team while looking for resources on the <a href="https://rstudio.com/">RStudio</a> website. In this post, I’ll take a look at
<a href="https://rstudio.com/resources/cheatsheets/"><em>Cheatsheets</em></a> another amazing resource hiding in plain sight.</p>
<p><img src="cs.png" height = "400" width="100%"></p>
<p>Apparently, some time ago when I wasn’t paying much attention, cheat sheets evolved from the home made study notes of students with highly refined visual cognitive skills, but a relatively poor grasp of algebra or history or whatever to an essential software learning tool. I don’t know how this happened in general, but master cheat sheet artist Garrett Grolemund has passed along some of the lore of the cheat sheet at RStudio. Garrett writes:</p>
<blockquote>
<p>One day I put two and two together and realized that our Winston Chang, who I had known for a couple of years, was the same “W Chang” that made the LaTex cheatsheet that I’d used throughout grad school. It inspired me to do something similarly useful, so I tried my hand at making a cheatsheet for Winston and Joe’s Shiny package. The Shiny cheatsheet ended up being the first of many. A funny thing about the first cheatsheet is that I was working next to Hadley at a co-working space when I made it. In the time it took me to put together the cheatsheet, he wrote the entire first version of the tidyr package from scratch.</p>
</blockquote>
<p>It is now hard to imagine getting by without cheat sheets. It seems as if they are becoming expected adjunct to the documentation. But, as Garret explains in the <a href="https://github.com/rstudio/cheatsheets">README</a> for the cheat sheets GitHub repository, <strong>they are not documentation!</strong></p>
<blockquote>
<p>RStudio cheat sheets are not meant to be text or documentation! They are scannable visual aids that use layout and visual mnemonics to help people zoom to the functions they need. … Cheat sheets fall squarely on the human-facing side of software design.</p>
</blockquote>
<p>Cheat sheets live in the space where <a href="https://psnet.ahrq.gov/primer/human-factors-engineering">human factors</a> engineering gets a boost from artistic design. If R packages were airplanes then pilots would want cheat sheets to help them master the controls.</p>
<p>The RStudio site contains sixteen RStudio produced cheat sheets and nearly forty contributed efforts, some of which are displayed in the graphic above. The <a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf"><em>Data Transformation cheat sheet</em></a> is a classic example of a straightforward mnemonic tool.
It is likely that even someone who just beginning to work with <code>dplyr</code> will immediately grok that it organizes functions that manipulate tidy data. The cognitive load then is to remember how functions are grouped by task. The cheat sheet offers a canonical set of classes: “manipulate cases”, “manipulate variables” etc. to facilitate the process. Users that work with <code>dplyr</code> on a regular basis will probably just need to glance at the cheat sheet after a relatively short time.</p>
<p>The <a href="https://github.com/rstudio/cheatsheets/raw/master/shiny.pdf"><em>Shiny cheat sheet</em></a> is little more ambitious. It works on multiple levels and goes beyond categories to also suggest process and workflow.</p>
<p><img src="shiny.png" height = "400" width="100%"></p>
<p>The <a href="https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf"><em>Apply functions cheat sheet</em></a> takes on an even more difficult task. For most of us, internally visualizing multi-level data structures is difficult enough, imaging how data elements flow under transformations is a serious cognitive load. I for one, really appreciate the help.</p>
<p><img src="purrr.png" height = "400" width="100%"></p>
<p>Cheat sheets are immensely popular. And even in this ebook age where nearly everything you can look at is online, and conference attending digital natives travel light, the cheat sheets as artifacts retain considerable appeal. Not only are they useful tools and geek art (Take a look at <a href="https://github.com/rstudio/cheatsheets/raw/master/cartography.pdf"><em>cartography</em></a>) for decorating a workplace, my guess is that they are perceived as <em>runes of power</em> enabling the cognoscenti to grasp essential knowledge and project it in the world.</p>
<p>When in-person conferences resume again, I fully expect the heavy paper copies to disappear soon after we put them out at the RStudio booth.</p>
<script>window.location.href='https://rviews.rstudio.com/2021/03/10/rstudio-open-source-resorurces/';</script>
2021 R Conferences
https://rviews.rstudio.com/2021/03/03/2021-r-conferences/
Wed, 03 Mar 2021 00:00:00 +0000https://rviews.rstudio.com/2021/03/03/2021-r-conferences/
<p><img src="conf2021.png" height = "400" width="100%"></p>
<p>It is not yet clear what lasting impact the Covid-19 pandemic will ultimately have on R conferences. We are still adapting to our inability to attend large events, and trying to make the best of the “silver lining” of virtual events which permit worldwide participation. The following is an attempt to list 2021 conferences that are likely to have interesting R content. I suspect that it is incomplete. If you know of an R Conference that is not mentioned, please add it to the comments section for this post.</p>
<h3 id="upcoming-events">Upcoming Events</h3>
<p><a href="https://www.ire.org/training/conferences/nicar-2021/">NICAR 2021</a> (March 3 - 5), the Investigative Reporters & Editors Conference on data journalism should be well attended by data journalists using R for their everyday reporting.</p>
<p><a href="https://cascadiarconf.com/">CascadiaRConf 2021</a> (June 4 - 5), a jewel of a regional R conference for its first three years, was canceled in 2020. It is back this year as a virtual event. The <a href="https://cascadiarconf.com/speakers/">Call for Presentations</a> is open.</p>
<p><a href="https://www.phuse-events.org/attend/frontend/reg/thome.csp?pageID=2283&eventID=6&traceRedir=2">PHUSE US Connect 2021</a> (June 14 - 18) - PHUSE is a non-profit organization with the mission: “Sharing ideas, tools and standards around data, statistical and reporting technologies to advance the future of life sciences.” The conference which is focused on clinical data science is likely to have some interesting R content this year. The <a href="https://mail.google.com/mail/u/0/#inbox/FMfcgxwLsmclTmvczLGxMrVptgJlVrhW">Call for Papers</a> is open.</p>
<p><a href="https://psiweb.org/conferences/about-the-conference">PSI 2021 Online</a> (June 21 - 23) usually attracts six hundred or so statisticians from the pharmaceutical industry when the conference is held in person. <a href="https://www.psiweb.org/">PSI</a> statisticians bring you <a href="https://rviews.rstudio.com/2021/01/11/wonderful-wednesdays/">Wonderful Wednesdays</a>.</p>
<p><a href="https://user2021.r-project.org/">useR! 2021</a> (July 5 - 9) has an outstanding lineup of <a href="https://user2021.r-project.org/program/keynotes/">keynote speakers</a>. The <a href="https://user2021.r-project.org/program/overview/">program</a> is very likely to make US based attendees night-owls.</p>
<p><a href="https://bioc2021.bioconductor.org/">BioC 2021</a> (August 4 - 6) is the must attend event for anyone doing computational biology. Peruse the <a href="https://bioc2021.bioconductor.org/conferences/">slides</a> of past events to get a “rear view preview” of what to expect.</p>
<p><a href="https://ww2.amstat.org/meetings/jsm/2021/">JSM 2021</a> Seattle (August 7 - 12), the mother of all statistics conferences, usually draws between 4,000 and 6,000 statisticians to in-person events. This organizers appear to be following some pretty optimistic Covid-19 vaccination rate models.</p>
<p><a href="https://events.linuxfoundation.org/r-medicine/">R/Medicine 2021</a> (August 27 - 29) has the dates, but no website yet. Don’t worry, the clinicians are big come from behind organizers. <a href="https://rviews.rstudio.com/2020/09/16/some-thoughts-on-r-medicine-2020/">Last year’s</a> conference was outstanding, and I expect an amazing event again this year.</p>
<p><a href="https://rinpharma.com/">R/Pharma 2021</a> organizers like to give R / Medicine organizers a head start, but a well placed source tells me that the conference will take place in Q3 or Q4. For the past three years, <a href="https://rviews.rstudio.com/2018/10/03/some-thoughts-on-r-pharma-2018/">R/Pharma</a> has been a bright star among R conferences where some of the best Shiny developers in the world meet and discuss their work.</p>
<p><a href="https://info.mango-solutions.com/earl-2021#:~:text=EARL%202021%206%2D10th%20September,of%20the%20world%27s%20leading%20practitioners">EARL Conference 2021</a> (September 6 - 10), the premier R in industry event, will be online this year. The call for abstracts is already open.</p>
<p><a href="https://rstats.ai/">NY R Conference 2021</a> is usually the perfect way to spend a couple of Manhattan Spring days. This year, the organizers are hoping for and in-person event in August or September if things go really well, but planning to surpass their spectacular 2020 virtual event if things don’t.</p>
<p><a href="https://ww2.amstat.org/meetings/biop/2021/workshopinfo.cfm">BIOP 2021</a> Rockville, MD (September 21 - 23) may be an in-person event. This workshop was originally an event for FDA statisticians but is now open to all statisticians interested in statistical practices for all areas regulated by the FDA.</p>
<p><a href="https://rnorthconference.github.io/">noRth 2021</a> (September 29 30) is a regional conference out of the “Twin Cities” that is looking to virtually expand their reach within the R Community. Gabriela de Queiroz heads the list of confirmed speakers which includes new faces from IBM, Google, and the Federal Reserve.</p>
<p><a href="https://2021.foss4g.org/">Foss4g for OSGEO</a> Buenos Aires (September 27 - October 2) is the annual conference of <a href="https://www.osgeo.org/">OSGeo</a>, the Open Source Geospatial Foundation. Given the prominence of R in geospatial analysis this is sure to be an R heavy event. The conference will be online.</p>
<p><a href="https://www.linkedin.com/in/gabrieladequeiroz/">PHUSE EU Connect 21</a> (November 15 - 19) See above.</p>
<p><a href="https://rstats.ai/">R Government</a> has a reasonable chance of pulling off an in-person event (at least for people in the DC area) sometime in December if the region gets a break from Covid.</p>
<h3 id="earlier-events">Earlier Events</h3>
<p><a href="https://rstudio.com/resources/rstudioglobal-2021/">rstudio::global</a> (January 21) - The <a href="https://rviews.rstudio.com/2021/02/04/some-thoughts-on-rstudio-global/">talks</a> from this unique 24 hour, worldwide event are on line.</p>
<p><a href="https://www.eshackathon.org/events/2021-01-ESMAR.html">Evidence Synthesis and Meta-Analysis in R</a> - The talks from this conference and hackathon which attracted 514 participants from 26 countries are online <a href="https://www.youtube.com/channel/UCqoKd8CCBInvyDMqeqGs0YQ">here</a>.</p>
<script>window.location.href='https://rviews.rstudio.com/2021/03/03/2021-r-conferences/';</script>
January 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2021/02/24/january-2020-top-40-new-cran-packages/
Wed, 24 Feb 2021 00:00:00 +0000https://rviews.rstudio.com/2021/02/24/january-2020-top-40-new-cran-packages/
<p>Two hundred thirty new packages made it to CRAN in January. Here are my “Top 40” selections in ten categories: Data, Finance, Genomics, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=igoR">igoR</a> v0.1.1: Provides tools to extract information from the Intergovernmental Organizations (‘IGO’) Database , version 3, provided by the <a href="https://correlatesofwar.org/">Correlates of War Project</a>. See <a href="https://correlatesofwar.org/">Pevehouse et al. (2020)</a> for information from 1815 to 2014, and get started with the <a href="https://cran.r-project.org/web/packages/igoR/vignettes/igoR.html">vignette</a>.</p>
<p><img src="igoR.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=OTrecod">OTrecord</a> v0.1.0: Uses optimal transportation theory as described in <a href="https://www.degruyter.com/document/doi/10.1515/ijb-2018-0106/html">Gares, Guernec & Savy (2019)</a> and <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2020.1775615?journalCode=uasa20">Gares & Omer (2020)</a> to solve recoding problems. Given two databases that share a subset of variables, package functions assist users in obtaining a unique synthetic database with complete information. See the <a href="https://cran.r-project.org/web/packages/OTrecod/vignettes/an-application-of-the-OTrecod-package.html">vignette</a> for examples.</p>
<p><a href="https://cran.r-project.org/package=pwt10">pwt10</a> v10.0-0: Interfaces to the <a href="http://www.ggdc.net/pwt/">Penn World Table 10.x</a> which provides information on relative levels of income, output, input, and productivity for 183 countries between 1950 and 2019.</p>
<p><a href="https://cran.r-project.org/package=trainR">trainR</a> v0.0.1: Interfaces to the the <a href="https://www.nationalrail.co.uk/46391.aspx">National Rail Enquiries</a> systems, including Darwin which provides real-time arrival and departure predictions, platform numbers, delay estimates, schedule changes and cancellations. Look <a href="https://villegar.github.io/trainR/">here</a> for examples.</p>
<h3 id="finance">Finance</h3>
<p><a href="https://cran.r-project.org/package=LSMRealOptions">LSMRealOptions</a> v0.1.0: Provides an implementation of the <a href="https://academic.oup.com/rfs/article-abstract/14/1/113/1587472?redirectedFrom=fulltext">least-squares Monte Carlo</a> simulation method to value American option products and capital investment projects through real options analysis. Cash flows are modeled as being dependent upon underlying state variables that evolve stochastically. See the <a href="https://cran.r-project.org/web/packages/LSMRealOptions/vignettes/LSMRealOptions.html">vignette</a> for examples.</p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.r-project.org/package=AlleleShift">AlleleShift</a> v0.9-2: Provides methods for calibrating and predicting shifts in allele frequencies through redundancy analysis (<code>vegan::rda()</code>) and generalized additive models (<code>mgcv::gam()</code>) and functions to visualize the predicted changes in frequencies. See <a href="https://cran.r-project.org/web/packages/AlleleShift/readme/README.html">README</a> for examples.</p>
<p><img src="AlleleShift.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=GenomeAdmixR">GenomeAdmixR</a> v1.1.3: Provides tools to simulate how patterns in ancestry along the genome change after admixture. Se <a href="https://www.biorxiv.org/content/10.1101/2020.10.19.343491v1">Janzen (2020)</a> for the details and the vignettes <a href="https://cran.r-project.org/web/packages/GenomeAdmixR/vignettes/Demonstrate_isofemales.html">Isofemales</a>, <a href="https://cran.r-project.org/web/packages/GenomeAdmixR/vignettes/Joyplots.html">Joyplot</a>, <a href="https://cran.r-project.org/web/packages/GenomeAdmixR/vignettes/Visualization.html">Visualization</a>, and <a href="https://cran.r-project.org/web/packages/GenomeAdmixR/vignettes/Walkthrough.html">Walkthrough</a>.</p>
<p><img src="GenomeAdmixR.svg" height = "400" width="300"></p>
<p><a href="https://cran.r-project.org/package=MOSS">MOSS</a> v0.1.0: Implements an omics integration method based on sparse singular value decomposition to deal with the challenges of high dimensionality, noise and heterogeneity among samples and features in omics data. See <a href="https://www.nature.com/articles/s41598-020-65119-5">(Gonzalez-Reymundez & Vazquez, 2020)</a> for background and the <a href="https://cran.r-project.org/web/packages/MOSS/vignettes/MOSS_working_example.pdf">vignette</a> for examples.</p>
<p><img src="MOSS.png" height = "300" width="300"></p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=autoMrP">autoMrP</a> v0.98: Implements a tool that improves the prediction performance of multilevel regression with post-stratification (MrP) by combining a number of machine learning methods. For information on the method, refer to <a href="https://lucasleemann.files.wordpress.com/2020/07/automrp-r2pa.pdf">Broniecki, Wüest, Leemann (2020)</a> and the <a href="https://cran.r-project.org/web/packages/autoMrP/vignettes/autoMrP_vignette.pdf">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=aweSOM">aweSOM</a> v1.1: Implements Self-organizing maps, a method for dimensionality reduction and clustering of continuous data, as well as interactive graphics to assist analysis. See <a href="https://link.springer.com/book/10.1007%2F978-3-642-56927-2">Kohonen (2001)</a> for background and the <a href="https://cran.r-project.org/web/packages/aweSOM/vignettes/aweSOM.html">vignette</a> for an overview of the package.</p>
<p><img src="aweSOM.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=RandomForestsGLS">RandomForestsGLS</a> v0.1.2: Fits non-linear generalized least square regression models with Random Forests as described in <a href="https://arxiv.org/abs/2007.15421">Saha, Basu & Datta (2020)</a>.</p>
<p><img src="rf.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=torchaudio">torchaudio</a> v0.1.1.0: Provides access to datasets, models and preprocessing facilities for deep learning in audio. See the <a href="https://cran.r-project.org/web/packages/torchaudio/vignettes/audio_preprocessing_tutorial.html">vignette</a>.</p>
<p><img src="torchaudio.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=vimpclust">vimpclust</a> v0.1.0: Implements functions to perform sparse k-means clustering with a group penalty and variable selection on mixed categorical and numeric data. See <a href="https://www.esann.org/sites/default/files/proceedings/2020/ES2020-103.pdf">Chavet et al. (2020)</a> for background. There are vignettes on <a href="https://cran.r-project.org/web/packages/vimpclust/vignettes/groupsparsewkm.html">numeric</a> and <a href="https://cran.r-project.org/web/packages/vimpclust/vignettes/sparsewkm.html">mixed data</a> sparse k-means clustering.</p>
<p><img src="vimpclust.png" height = "200" width="400"></p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=cmprskcoxmsm">cmprskcoxmsm</a> v0.2.0: Provides functions to estimate treatment effect a under marginal structure model for the cause-specific hazard of competing risk events. Functions also estimate the risk of the potential outcomes, risk difference and risk ratio. See <a href="https://www.tandfonline.com/doi/abs/10.1198/016214501753168154">Hernan et al. (2001)</a> for the theory and the <a href="https://cran.r-project.org/web/packages/cmprskcoxmsm/vignettes/weight_cause_cox.pdf">vignette</a> for examples.</p>
<p><img src="cmprskcoxmsm.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=coder">coder</a> v0.13.5: Provides functions to classify individuals or items based on external code data identified by regular expressions. A typical use case considers patients with medically coded data, such as codes from the International Classification of Diseases. There is an <a href="https://cran.r-project.org/web/packages/coder/vignettes/coder.html">overview</a> and vignettes on <a href="https://cran.r-project.org/web/packages/coder/vignettes/classcodes.html">class codes</a>, <a href="https://cran.r-project.org/web/packages/coder/vignettes/Interpret_regular_expressions.html">interpreting regular expressions</a>, and <a href="https://cran.r-project.org/web/packages/coder/vignettes/ex_data.html">example data</a>.</p>
<p><img src="coder.png" height = "150" width="350"></p>
<p><a href="https://cran.r-project.org/package=dataquieR">dataQuieR</a> v1.0.4: Provides functions to assess data quality issues in studies. See the <a href="https://www.tmf-ev.de/EnglishSite/Home.aspx">TMF Guideline</a> and the <a href="https://dfg-qa.ship-med.uni-greifswald.de">DFG Project</a> for background, and the <a href="https://cran.r-project.org/web/packages/dataquieR/vignettes/DQ-report-example.html">vignette</a> for examples.</p>
<p><img src="dataQuieR.png" height = "250" width="450"></p>
<p><a href="https://cran.r-project.org/package=NHSDataDictionaRy">NHSDataDictionaRy</a> v1.2.1: Provides a common set of simplified web scraping tools for working with the <a href="https://datadictionary.nhs.uk/data_elements_overview.html">NHS Data Dictionary</a>.This package was commissioned by the <a href="https://nhsrcommunity.com/">NHS-R community</a> to provide this consistency of lookups. See the <a href="https://cran.r-project.org/web/packages/NHSDataDictionaRy/vignettes/introduction.html">vignette</a> to get started.</p>
<h3 id="science">Science</h3>
<p><a href="https://cran.r-project.org/package=LPDynR">LPDynR</a> v1.0.1: Implements methods that use phenological and productivity-related variables derived from time series of vegetation indexes to assess ecosystem dynamics. Functions compute an indicator with five classes of land productivity dynamics. Look <a href="https://github.com/xavi-rp/LPD/blob/master/ATBD/LPD_ATBD.pdf">here</a> for background. See the <a href="https://cran.r-project.org/web/packages/LPDynR/vignettes/LPD_PartialTimeSeries_example.html">vignette</a> for an example.</p>
<p><a href="https://cran.r-project.org/package=rgee">rgee</a> v1.0.8: Provides an <a href="https://earthengine.google.com/">Earth Engine</a> client library for R that includes all <code>Earth Engine</code> API classes, modules, and functions, as well as additional functions for importing spatial objects, extracting time series, and displaying metadata and interactive maps. Look <a href="https://r-spatial.github.io/rgee/">here</a> for further details. Read the <a href="https://cran.r-project.org/web/packages/rgee/vignettes/rgee01.html">Introduction</a> and the vignette on <a href="https://cran.r-project.org/web/packages/rgee/vignettes/rgee03.html">Best Practices</a> to get started.</p>
<p><img src="rgee.png" height = "250" width="450"></p>
<p><a href="https://cran.r-project.org/package=SAMtool">SAMtool</a> v1.1.1: Provides tools for simulating the <code>MSEtool</code> operating model to inform data-rich fisheries. It includes a conditioning model, tools for assessing models of varying complexity and comparing models, and diagnostic tools for evaluating assessments inside closed-loop simulations. There is a <a href="https://cran.r-project.org/web/packages/SAMtool/vignettes/SAMtool.html">User Guide</a> and a series of seven more vignettes including an <a href="https://cran.r-project.org/web/packages/SAMtool/vignettes/RCM.html">overview</a> of the Rapid Conditioning Model (RCM) for conditioning <code>MSEtool</code> operating models, and a <a href="https://cran.r-project.org/web/packages/SAMtool/vignettes/RCM_eq.html">mathematical description</a> of RCM.</p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=circularEV">circularEV</a> v0.1.0: Provides functions for performing extreme value analysis on a circular domain. See the <a href="https://cran.r-project.org/web/packages/circularEV/vignettes/localMethods.html">local methods example</a> and the <a href="https://cran.r-project.org/web/packages/circularEV/vignettes/splineML.html">spline example</a>.</p>
<p><img src="circularEV.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=ghcm">ghcm</a> v1.0.0: Implements a statistical hypothesis test for conditional independence which can be applied to both discretely observed functional data and multivariate data. See <a href="https://arxiv.org/abs/2101.07108">Lundborg et al. (2020)</a> for details and the <a href="https://cran.r-project.org/web/packages/ghcm/vignettes/ghcm.html">vignette</a> for an overview of the generalized Hilbert Covariance measure with examples.</p>
<p><img src="ghcm.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=gplite">gplite</a> v0.11.1: Implements the most common Gaussian process models using Laplace and expectation propagation approximations, maximum marginal likelihood inference for the hyperparameters, and sparse approximations for larger datasets. See the <a href="https://cran.r-project.org/web/packages/gplite/vignettes/quickstart.html">vignette</a> for a quick start.</p>
<p><img src="gplite.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=multibridge">multibridge</a> v1.0.0: Implements functions to evaluate hypotheses concerning the distribution of multinomial proportions using bridge sampling. Functions are able to compute Bayes factors for hypotheses that entail inequality constraints, equality constraints, free parameters, and mixtures of all three. See <a href="https://psyarxiv.com/bux7p/">Sarafoglou et al. (2020)</a> for background and the examples: <a href="https://cran.r-project.org/web/packages/multibridge/vignettes/MemoryOfLifestresses.html">Memory of Lifestresses</a>, <a href="https://cran.r-project.org/web/packages/multibridge/vignettes/MendelianLawsOfInheritance.html">Mendelian Laws of Inheritance</a>, and <a href="https://cran.r-project.org/web/packages/multibridge/vignettes/PrevalenceOfStatisticalReportingErrors.html">Prevalence of Statistical Reporting Errors</a>.</p>
<p><a href="https://cran.r-project.org/package=partR2">partR2</a> v0.9.1: Provides functions to to partition the variance explained in generalized linear mixed models (GLMMs) into variation unique to predicators and variation shared among predictors. This can be done using semi-partial <em>R<sup>2</sup></em> and inclusive <em>R<sup>2</sup></em>. See <a href="https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/j.2041-210x.2012.00261.x">Nakagawa & Schielzeth (2013)</a> and <a href="https://royalsocietypublishing.org/doi/10.1098/rsif.2017.0213">Nakagawa, Johnson & Schielzeth (2017)</a> for the theory and the <a href="https://cran.r-project.org/web/packages/partR2/vignettes/Using_partR2.html">vignette</a> for examples.</p>
<p><img src="partR2.jpeg" height = "350" width="500"></p>
<p><a href="https://cran.r-project.org/package=spNetwork">spNetwork</a> v0.1.1: Provides tools to perform spatial analysis on network including estimating network kernel density, building spatial matrices. See <a href="https://www.tandfonline.com/doi/abs/10.1080/13658810802475491?journalCode=tgis20">Okabe et al. (2019)</a> for background and the vignettes: <a href="https://cran.r-project.org/web/packages/spNetwork/vignettes/KNetworkFunctions.html">Network k Functions</a>, <a href="https://cran.r-project.org/web/packages/spNetwork/vignettes/NKDE.html">Network Kernel Density Estimate</a>,
<a href="https://cran.r-project.org/web/packages/spNetwork/vignettes/NKDEdetailed.html">Details about NKDE</a>, and <a href="https://cran.r-project.org/web/packages/spNetwork/vignettes/SpatialWeightMatrices.html">Spatial Weight Matrices</a>.</p>
<p><img src="spNetwork.png" height = "300" width="350"></p>
<p><a href="https://cran.r-project.org/package=ubms">ubms</a> v1.0.2: Provides functions to fit Bayesian hierarchical models, including single-season occupancy, dynamic occupancy, and N-mixture abundance models, of animal abundance and occurrence with the <code>rstan</code> package. See <a href="https://www.jstatsoft.org/article/view/v076i01">Carpenter et al. (2017)</a> and <a href="https://www.jstatsoft.org/article/view/v043i10">Fiske and Chandler (2011)</a> for background. There is a package <a href="https://cran.r-project.org/web/packages/ubms/vignettes/ubms.html">Overview</a>, a vignette on <a href="https://cran.r-project.org/web/packages/ubms/vignettes/random-effects.html">Random Effects</a>, and another on <a href="https://cran.r-project.org/web/packages/ubms/vignettes/JAGS-comparison.html">Comparing ubms with JAGS</a>.</p>
<p><img src="ubms.png" height = "200" width="400"></p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://cran.r-project.org/package=autostsm">autostsm</a> v1.2: Provides functions to automate the decomposition of structural time series into trend, cycle, and seasonal components using the Kalman filter. See <a href="https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780195398649.001.0001/oxfordhb-9780195398649-e-6">Koopman et al. (2012)</a> for the theory and the <a href="https://cran.r-project.org/web/packages/autostsm/vignettes/autostsm_vignette.html">vignette</a> for an overview with examples.</p>
<p><a href="https://cran.r-project.org/package=bayesforecast">bayesforecast</a> v0.0.1: Provides functions to fit Bayesian time series models using <code>Stan</code> for full Bayesian inference. It includes seasonal ARIMA, ARIMAX, dynamic harmonic regression, GARCH, t-student innovation GARCH models, asymmetric GARCH, random walks, stochastic volatility models for univariate time series. See <a href="https://www.jstatsoft.org/article/view/v027i03">Hyndman (2017)</a> and <a href="https://www.jstatsoft.org/article/view/v076i01">Carpenter et al. (2017)</a> for background and <a href="https://cran.r-project.org/web/packages/bayesforecast/readme/README.html">README</a> for examples.</p>
<p><img src="bayesforecast.png" height = "450" width="500"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=autoharp">autoharp</a> v0.0.5: Implements customizable tools for assessing and grading R or R-markdown scripts from students which allow for checking correctness of code output, runtime statistics and static code analysis. There is a <a href="https://cran.r-project.org/web/packages/autoharp/vignettes/user-manual.html">User Manual</a> and a vignettes on the S4 class <a href="https://cran.r-project.org/web/packages/autoharp/vignettes/treeharp.html">treeharp</a>.</p>
<p><img src="autoharp.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=cachem">cachem</a> v1.0.4: Provides functions to cache R objects with automated pruning. Caches can limit either their total size or the age of the oldest object (or both), automatically pruning objects to maintain the constraints. See <a href="https://cran.r-project.org/web/packages/cachem/readme/README.html">README</a> for examples.</p>
<p><a href="https://cran.r-project.org/package=eList">eList</a> v0.2.0: Provides list compression functions to convert for loops into vectorized <code>lapply()</code> functions which support loops with multiple variables, parallelization, and loops across non-standard objects. See the <a href="https://cran.r-project.org/web/packages/eList/vignettes/VectorComprehension.html">vignette</a> for examples.</p>
<p><a href="https://cran.r-project.org/package=Microsoft365R">Microsoft365R</a> v1.0.0: Builds on <code>AzureGraph</code> to implement and interface to <a href="https://www.microsoft.com/en-us/microsoft-365">Microsoft365</a> and enables access to data stored in SharePoint Online and OneDrive. See the <a href="https://cran.r-project.org/web/packages/Microsoft365R/vignettes/Microsoft365R.html">vignette</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=rtables">rtables</a> v0.3.6: Provides a framework for declaring complex multi-level tabulations and then applying them to data. Tables are modeled as hierarchical, tree-like objects which support sibling sub-tables, arbitrary splitting or grouping of data in row and column dimensions, cells containing multiple values, and the concept of contextual summary computations. There is a <a href="https://cran.r-project.org/web/packages/rtables/vignettes/introduction.html">Introduction</a> and a series of vignettes on <a href="https://cran.r-project.org/web/packages/rtables/vignettes/baseline.html">comparing</a> against baseline or control, a clinical trials <a href="https://cran.r-project.org/web/packages/rtables/vignettes/clinical_trials.html">example</a>, constructing tables <a href="https://cran.r-project.org/web/packages/rtables/vignettes/manual_table_construction.html">manually</a>, <a href="https://cran.r-project.org/web/packages/rtables/vignettes/sorting_pruning.html">pruning and sorting</a>
tables, <a href="https://cran.r-project.org/web/packages/rtables/vignettes/subsetting_tables.html">subsetting</a> tables, <a href="https://cran.r-project.org/web/packages/rtables/vignettes/tabulation_concepts.html">Tabulation concepts</a>, and a <a href="https://cran.r-project.org/web/packages/rtables/vignettes/tabulation_dplyr.html">comparison with dplyr</a> tabulation.</p>
<p><img src="rtables.png" height = "250" width="450"></p>
<p><a href="https://CRAN.R-project.org/package=targets">targets</a> v0.1.0: Brings together function-oriented programming and <code>make</code>- like declarative workflows in toolkit for building statistics and data science pipelines in R. The methodology borrows from <a href="https://www.gnu.org/software/make/manual/make.html">GNU make</a> and <a href="https://joss.theoj.org/papers/10.21105/joss.00550">drake</a>. See the <a href="https://cran.r-project.org/web/packages/targets/vignettes/overview.html">vignette</a> and the <a href="https://docs.ropensci.org/targets/">reference website</a>.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=ggmulti">ggmulti</a> v0.1.0: Provides tools such as serial axes objects, Andrew’s plot, various scatter plot glyphs to visualize high dimensional data. There are vignettes on visualizing <a href="https://cran.r-project.org/web/packages/ggmulti/vignettes/highDim.html">high dimensional data</a>, <a href="https://cran.r-project.org/web/packages/ggmulti/vignettes/glyph.html">adding glyphs to scatter plots</a>, and creating <a href="https://cran.r-project.org/web/packages/ggmulti/vignettes/histogram-density-.html">histograms with density</a>.</p>
<p><img src="ggmulti.png" height = "500" width="500"></p>
<p><a href="https://cran.r-project.org/package=ggOceanMaps">ggOceanMaps</a> v1.0.9: Allows plotting data on bathymetric maps using <code>ggplot2</code> using data that contain geographic information from anywhere around the globe. There is a <a href="https://cran.r-project.org/web/packages/ggOceanMaps/vignettes/ggOceanMaps.html">User Manual</a> and a <a href="https://cran.r-project.org/web/packages/ggOceanMaps/vignettes/premade-shapefiles.html">vignette</a> on pre-made shape files.</p>
<p><img src="ggOceanMaps.png" height = "400" width="400"></p>
<p><a href="https://cran.r-project.org/package=pacviz">pacviz</a> v1.0.0.5: Provides functions to map data onto a radial coordinate system and visualize the residual values of linear regression and Cartesian data in the defined radial scheme. See the <a href="https://spencerriley.me/pacviz/book/">pacviz documentation</a> for more information.</p>
<p><img src="pacviz.png" height = "400" width="400"></p>
<p><a href="https://cran.r-project.org/package=parallelPlot">parallelPlot</a> v0.1.0: Provides functions to create parallel coordinates plots using the <code>htmlwidgets</code> package and <code>d3.js</code>. The <a href="https://cran.r-project.org/web/packages/parallelPlot/vignettes/introduction-to-parallelplot.html">vignette</a> provides multiple examples.</p>
<p><img src="parallelPlot.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=thematic">thematic</a> v0.1.1: Provides tools to “theme” <code>ggplot2</code>, <code>lattice</code>, and <code>base</code> graphics using a small set of choices that include foreground color, background color, accent color, and font family. See <a href="https://cran.r-project.org/web/packages/thematic/readme/README.html">README</a> for examples.</p>
<p><img src="theme.png" height = "200" width="400"></p>
<script>window.location.href='https://rviews.rstudio.com/2021/02/24/january-2020-top-40-new-cran-packages/';</script>
R Interface for MiniZinc
https://rviews.rstudio.com/2021/02/15/r-interface-for-minizinc/
Mon, 15 Feb 2021 00:00:00 +0000https://rviews.rstudio.com/2021/02/15/r-interface-for-minizinc/
<script src="/2021/02/15/r-interface-for-minizinc/index_files/header-attrs/header-attrs.js"></script>
<p><em>Akshit Achara is a medical device engineer and computer science enthusiast based in Bengaluru, Karnataka, India. You can connect with Akshit on <a href="https://in.linkedin.com/in/akshit-achara-737589163">LinkedIn</a>.</em></p>
<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p><a href="https://en.wikipedia.org/wiki/Constraint_programming">Constraint programming</a> is a paradigm for solving combinatorial problems that draws on a wide range of techniques from artificial intelligence, computer science, and operations research. <a href="https://www.minizinc.org/">MiniZinc</a> is a free and open-source constraint modeling language. <a href="https://en.wikipedia.org/wiki/Constraint_satisfaction">Constraint satisfaction</a> and <a href="https://en.wikipedia.org/wiki/Discrete_optimization">discrete optimization</a> problems can be formulated in a high-level modeling language. Models are compiled into an intermediate representation that is understood by a <a href="https://www.minizinc.org/software.html#flatzinc">wide range of solvers</a>. MiniZinc itself provides several solvers, for instance GeCode. The existing packages in R are not powerful enough to solve even mid-sized problems in combinatorial optimization.</p>
<p>Until recently, there were implementations of an Interface to MiniZinc in Python like MiniZinc Python and pymzn and JMiniZinc for Java but none for R.</p>
</div>
<div id="rminizinc-as-a-gsoc-project" class="section level2">
<h2>rminizinc as a GSOC project</h2>
<p><a href="https://cran.r-project.org/web/packages/rminizinc/index.html">rminizinc</a> started as a <a href="https://summerofcode.withgoogle.com/archive/2020/projects/6235019934171136/">Google Summer of Code Project</a> in 2020. Initially, the goal was to provide infrastructure/support for creating 15-20 commonly used MiniZinc problems by creating classes and functions for providing the basic syntax/constructs used in those MiniZinc models. However, it was decided that the libminizinc (MiniZinc C++ API) library can be leveraged using Rcpp for parsing various MiniZinc models and a mirror API can be created to construct the MiniZinc models. Using the library helped me in understanding MiniZinc more which in turn also helped to to provide more features and test the package on larger problems. The following objectives were achieved at the end of the GSOC period:</p>
<ul>
<li>Parse a MiniZinc model into R.</li>
<li>Find the model parameters which have not been assigned a value yet.</li>
<li>Set the values of unassigned parameters. (Scope needs to be extended)</li>
<li>Solve a model and get parsed solutions as a named list in R.</li>
<li>Create a MiniZinc model in R using the R6 classes from MiniZinc API mirror.</li>
<li>Manipulate a model.</li>
</ul>
<p>The development continued till the end of GSOC but the package was not yet submitted to CRAN.</p>
</div>
<div id="post-gsoc" class="section level2">
<h2>Post GSOC</h2>
<p>The package submission to CRAN was very challenging. <a href="https://github.com/MiniZinc/libminizinc">Libminizinc</a> and solver binaries were required in order to use the package because the Rcpp functions were using them to parse and solve the models. To tackle this, we leveraged the <a href="https://opensource.com/article/19/7/introduction-gnu-autotools">autotools</a> to create a configure script for letting the users provide custom paths and configure the package during the installation, used #ifdef macros to provide alternative definitions in case libminizinc and/or solvers are not present on the system.</p>
</div>
<div id="examples" class="section level2">
<h2>Examples</h2>
<p>There are a lot of features provided by the package but let’s start with something simple say solving a knapsack problem. Knapsack problem is known by everyone who has interest in constraint programming. The knapsack problem is a problem in combinatorial optimization: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. It derives its name from the problem faced by someone who is constrained by a fixed-size knapsack and must fill it with the most valuable items. The problem often arises in resource allocation where the decision makers have to choose from a set of non-divisible projects or tasks under a fixed budget or time constraint, respectively.</p>
<p>The <code>knapsack()</code> function can be used to directly solve the knapsack problem. Here, <code>n</code> is the number of items, <code>capacity</code> is the total capacity of carrying weight, <code>profit</code> is the profit corresponding to each item and <code>weight</code> is the weight/size of each item. The goal is to maximize the total profit. The solution is returned in the form of a named list with all the solutions along with the optimal solution if found.</p>
<p>Please find the installation instructions in the <a href="https://cran.r-project.org/web/packages/rminizinc/vignettes/R_MiniZinc.html">vignette</a> or the <a href="https://github.com/acharaakshit/RMiniZinc">github readme</a>.</p>
<pre class="r"><code>library(rminizinc)
# knapsack problem
print(knapsack(n = 3, capacity = 9, profit = c(15,10,7), size = c(4,3,2)))</code></pre>
<pre><code>$SOLUTION0
$SOLUTION0$x
[1] 0 0 0
$SOLUTION1
$SOLUTION1$x
[1] 0 0 1
$SOLUTION2
$SOLUTION2$x
[1] 0 0 2
$SOLUTION3
$SOLUTION3$x
[1] 0 0 3
$SOLUTION4
$SOLUTION4$x
[1] 0 0 4
$SOLUTION5
$SOLUTION5$x
[1] 0 1 3
$OPTIMAL_SOLUTION
$OPTIMAL_SOLUTION$x
[1] 1 1 1
</code></pre>
<p>A function to solve the assignment problem has also been provided. More common examples will be provided in the next package releases based on the user feedback. This will especially be useful for the users who don’t have any knowledge of MiniZinc.</p>
<p>Basic knowledge of MiniZinc is required in order to understand the next examples. <a href="https://www.minizinc.org/doc-2.5.3/en/part_2_tutorial.html">MiniZinc tutorial</a> would be a good place to start.</p>
<p>The users can also create a MiniZinc model using the API. Let’s compute the base of a right angled triangle given the height and hypotenuse. The Pythagoras theorem says that In a right-angled triangle, the square of the hypotenuse side is equal to the sum of squares of the other two sides i.e <span class="math inline">\(a² + b² = c²\)</span>. The theorem give us three functions, <span class="math inline">\(c = \sqrt{(a² + b²)}\)</span>, <span class="math inline">\(a = \sqrt(c² - b²)\)</span> and <span class="math inline">\(b = \sqrt(c² - a²)\)</span>.</p>
<p>MiniZinc Representation of the model:</p>
<pre><code>int: a = 4;
int: c = 5;
var int: b;
constraint b>0;
constraint a² + b² = c²;
solve satisfy;</code></pre>
<p>Let’s create and solve the model using rminizinc.</p>
<pre class="r"><code>a = IntDecl(name = "a", kind = "par", value = 4)
c = IntDecl(name = "c", kind = "par", value = 5)
b = IntDecl(name = "b", kind = "var")
# declaration items
a_item = VarDeclItem$new(decl = a)
b_item = VarDeclItem$new(decl = b)
c_item = VarDeclItem$new(decl = c)
# b > 0 is a binary operation
b_0 = BinOp$new(lhs = b$getId(), binop = ">", rhs = Int$new(0))
constraint1 = ConstraintItem$new(e = b_0)
# a ^ 2 is a binary operation
# a$getId() gives the variable identifier
a_2 = BinOp$new(lhs = a$getId(), binop = "^", Int$new(2))
b_2 = BinOp$new(lhs = b$getId(), binop = "^", Int$new(2))
a2_b2 = BinOp$new(lhs = a_2, binop = "+", rhs = b_2)
c_2 = BinOp$new(lhs = c$getId(), binop = "^", Int$new(2))
a2_b2_c2 = BinOp$new(lhs = a2_b2, binop = "=", rhs = c_2)
constraint2 = ConstraintItem$new(e = a2_b2_c2)
solve = SolveItem$new(solve_type = "satisfy")
model = Model$new(items = c(a_item, b_item, c_item, constraint1, constraint2, solve))
cat(model$mzn_string())</code></pre>
<pre><code>int: a;
var int: b;
int: c;
constraint (b > 0);
constraint (((a ^ 2) + (b ^ 2)) = (c ^ 2));
solve satisfy;
</code></pre>
<p>Creating the model required a lot of code which is more cumbersome that writing the model in MiniZinc itself. However, that was to give the users a taste of various classes that can be used to create items and expressions. This will especially be useful in modifying an existing model.</p>
<p>The items can directly be provided as strings.</p>
<pre class="r"><code>a = VarDeclItem$new(mzn_str = "int: a = 4;")
c = VarDeclItem$new(mzn_str = "int: c = 5;")
b = VarDeclItem$new(mzn_str = "var int: b;")
constraint1 = ConstraintItem$new(mzn_str = "constraint b > 0;")
constraint2 = ConstraintItem$new(mzn_str = "constraint a^2 + b^2 = c^2;")
solve = SolveItem$new(mzn_str = "solve satisfy;")
model = Model$new(items = c(a, b, c, constraint1, constraint2, solve))
cat(model$mzn_string())</code></pre>
<pre><code>int: a = 4;
var int: b;
int: c = 5;
constraint (b > 0);
constraint (((a ^ 2) + (b ^ 2)) = (c ^ 2));
solve satisfy;</code></pre>
<p>The model can directly be parsed by providing the string representation as the argument to <code>mzn_parse()</code>. This method uses libminizinc to parse the model. The included mzn files are appearing because the parsed model is serialized back by libminizinc.</p>
<pre class="r"><code>pythagoras_string =
" int: a = 4;
int: c = 5;
var int: b;
constraint b > 0;
constraint a^2 + b^2 = c^2;
solve satisfy;
"
model = mzn_parse(model_string = pythagoras_string)
cat(model$mzn_string())</code></pre>
<pre><code>int: a = 4;
int: c = 5;
var int: b;
constraint (b > 0);
constraint (((a ^ 2) + (b ^ 2)) = (c ^ 2));
solve satisfy;
include "solver_redefinitions.mzn";
include "stdlib.mzn";</code></pre>
<p>Let’s solve the model now.</p>
<pre class="r"><code>solution = mzn_eval(r_model = model)
print(solution)</code></pre>
<pre><code>$SOLUTION_STRING
[1] "{\n \"b\" : 3\n}\n----------\n==========\n"
$SOLUTIONS
$SOLUTIONS$OPTIMAL_SOLUTION
$SOLUTIONS$OPTIMAL_SOLUTION$b
[1] 3</code></pre>
<p>In the next post, we will try another problem and/or use some other features of rminizinc. Please <a href="acharaakshit@gmail.com">let me know</a> what you think.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2021/02/15/r-interface-for-minizinc/';</script>
Some thoughts on rstudio::global talks
https://rviews.rstudio.com/2021/02/04/some-thoughts-on-rstudio-global/
Thu, 04 Feb 2021 00:00:00 +0000https://rviews.rstudio.com/2021/02/04/some-thoughts-on-rstudio-global/
<p><img src="global.png" height = "400" width="100%"></p>
<p>The fifty-five <a href="https://rstudio.com/resources/rstudioglobal-2021/">videos</a> from last month’s rstudio::global conference are now available online. You will find them at the link above arranged in ten categories (Keynotes, Data for Good, Language Interop, Learning, Modeling, Organizational Tooling, Package Dev, Programming, Teaching, and Visualization) that reflect fundamental areas of technical and community infrastructure and R applications. Theses talks were selected from hundreds of submissions, many of which were really very good. I participated in the first selection round and found it impossible to make some choices, so I am certain that it must have been agonizingly difficult for the program committee to pare down to the final selections.</p>
<p>I believe that you will find the content of most of these talks to be nothing less than compelling. The themes and moods of the talks range from informative and deeply technical R issues to data science, journalism, art, education and public service. A few talks transcend the parochial concerns of the R community and address issues that are important to society at large. It is gratifying to see that in the hands of committed people R is helping to make the world just a little bit better. The videos themselves are high quality and a pleasure to watch. Unlike typical conference videos recorded in real time, all of these were produced with excellent lighting, good audio, and were rehearsed, pre-recorded, and edited. Except for the keynotes, the talks are shorter than twenty minutes.</p>
<p>In the remainder of this post, I will highlight just five talks that I personally found compelling. I have arranged them in an order that I think makes sense to view them. But, you might do just as well to sample talks from the categories listed above that organize the talks on the conference page. My selections do not cover the whole range of topics submitted, and they certainly do not include all of the good stuff. I do think, however, that they reflect the quality of the talks, and I hope that if you watch these five you will be motivated to watch the rest too.</p>
<p><img src="global2.png" height = "400" width="600"></p>
<p>My first three selections are by data journalists who are out to make the world a better place. <a href="https://rstudio.com/resources/rstudioglobal-2021/the-opioid-files-turning-big-pharmacy-data-over-to-the-public/">The Opioid Files: Turning big pharmacy data over to the public</a> by Washington Post data journalist <a href="https://www.washingtonpost.com/people/andrew-ba-tran/">Andrew Ba Tran</a> demonstrates the scale of the opiod scandal. I had the opportunity to meet Andrew at the <a href="https://www.ire.org/training/conferences/nicar-2021/">NICAR conference</a> for Investigative Reporters and Editors in 2019 where he was speaking and teaching R. NICAR opened my eyes to the discipline and tradition of <a href="https://gijn.org/2015/11/12/fifty-years-of-journalism-and-data-a-brief-history/#:~:text=As%20of%202015%2C%20and%20after,a%20driving%20force%20for%20stories.&text=The%20use%20of%20computers%20for,data%20analysis%20to%20societal%20issues.">Data Journalism</a> and the efforts of data journalists to harness technology for the public good. Andrew’s conference talk represents this tradition and illustrates the data crunching skills, persistence, and unvarnished storytelling necessary to illuminate a dark topic.</p>
<p>Next, I recommend watching <a href="https://rstudio.com/resources/rstudioglobal-2021/trial-and-error-in-data-viz-at-the-aclu/">Trial and Error in Data Viz at the ACLU</a>. <a href="http://sophiebeiers.com/about/">Sophie Beiers</a> is a data journalist whose work for the ACLU involves discovering and visualizing data with sufficient rigor and clarity to support arguments that will hold up in court. Sophie describes the messy work of iterating through visualizations in a process built around candid feedback from colleagues and stakeholders. Driving the process is a determination to make charts that effectively communicate key points to the intended audience. Sophie’s talk hints at the emotional toll caused by striving to see the people behind the data, and the satisfaction that comes from making a difference. The ACLU analytics team coined a word for expressing the excitement at being able to prove terrible news with quantitative evidence.</p>
<p><img src="terr.png" height = "350" width="600"></p>
<p>The third talk on my list by
<a href="https://www.ft.com/stream/e191658e-c66a-45bc-9bad-343bdc4210b3">John Burn Murdoch</a> on
<a href="https://rstudio.com/resources/rstudioglobal-2021/reporting-on-and-visualising-the-pandemic/">Reporting on and visualizing the pandemic</a> continues the theme of polishing visualizations until they work for the target audience. John is a data journalist with the Financial Times who has garnered quite a following for producing data visualizations that command attention. John’s talk dives deeply into his process of evolving a visualization until it not only illustrates what he wants to show but also shows that it is resonating with his mass audience.</p>
<p>In thinking about Sophie and John’s work, the Japanese word <a href="https://kbjanderson.com/the-real-meaning-of-kaizen/">kaizen</a> (making something good for the good of other people) comes to mind. There are many R visualization experts who can lay down the basic principles of making a good data visualization, but few I think, that have Sophie and John’s empathy and capacity to listen and process criticism.</p>
<p>The final two talks on my list are by R developers who are concerned with the big picture of sustaining the R package ecosystem. Hadley Wickham’s Keynote
<a href="https://rstudio.com/resources/rstudioglobal-2021/reporting-on-and-visualising-the-pandemic/">Maintaining the house the tidyverse built</a> addresses a fundamental challenge encountered by all complex software projects that develop over time. How do you maintain functional stability while coping with growth and change? Hadley’s solution for the tidyverse encapsulated in the following figure:</p>
<p><img src="tidy.png" height = "400" width="600"></p>
<p>may not be the right solution for all subsystems within the R ecosystem, but surely something like it must evolve to reach into all of the corners of the R Universe. For example, consider the <a href="https://cran.r-project.org/web/views/">CRAN Task Views</a>. When they were assembled, each of these curated lists of R packages represented the cutting edge of software for a particular functional area. The Task View maintainers do a mostly unacknowledged, essential service in keeping these up to date. Nevertheless, it is not difficult to discover Task View packages that have not changed significantly in five or ten years or more. To my knowledge, there are no standards for retiring packages and integrating new work. The future of R depends on systems thinking and the development of new ideas and tools for open source development.</p>
<p>These considerations lead to Jeroen Ooms’ talk:
<a href="https://rstudio.com/resources/rstudioglobal-2021/monitoring-health-and-impact-of-open-source-projects/">Monitoring health and impact of open-source projects</a> which describes how <a href="https://ropensci.org/">ROpenSci</a> is taking on the immense challenge of measuring the quality of R packages according to technical, social and scientific indicators, while building out the infrastructure to improve the entire R package ecosystem. Some of the tools for monitoring the status and “health” of open source software are already in place. ROpenSci is offering <a href="https://r-universe.dev/organizations/">R-universe</a>, a platform based in git for managing personal R repositories. Once a “universe” of packages is registered with R-universe every time an author pushes an update the platform will automatically build binaries and documentation.</p>
<p>Enjoy the videos!!</p>
<script>window.location.href='https://rviews.rstudio.com/2021/02/04/some-thoughts-on-rstudio-global/';</script>
Dec 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2021/01/29/dec-2020-top-40-new-cran-packages/
Fri, 29 Jan 2021 00:00:00 +0000https://rviews.rstudio.com/2021/01/29/dec-2020-top-40-new-cran-packages/
<p>One hundred twenty-three new packages made it to CRAN in December. Here are my “Top 40” selections in nine categories: Computational Methods, Data, Genomics, Machine Learning, Medicine, Science, Statistics, Utilities, and Visualization.</p>
<h3 id="computational-methods">Computational Methods</h3>
<p><a href="https://cran.r-project.org/package=FKF.SP">FKF.SP</a> v0.1.0: Provides a fast and flexible Kalman filtering implementation utilizing sequential processing, designed for efficient parameter estimation through maximum likelihood estimation. See the <a href="https://cran.r-project.org/web/packages/FKF.SP/vignettes/FKFSP.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=rminizinc">rminizinc</a> v0.0.4: Implements an interface to <a href="https://www.minizinc.org/">MiniZinc</a>, a free and open-source constraint modeling language which is used to identify feasible solutions out of a very large set of candidates when the problem can be modeled in terms of arbitrary constraints. See the <a href="https://cran.r-project.org/web/packages/rminizinc/vignettes/R_MiniZinc.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=noisySBM">nosiySBM</a> v0.1.4: Implements the variational expectation-maximization algorithm to fit a noisy stochastic block model to an observed dense graph and to perform node clustering. See <a href="https://arxiv.org/abs/1907.10176">Rebafka & Villers (2020)</a> for background and the <a href="https://cran.r-project.org/web/packages/noisySBM/vignettes/UserGuide.html">vignette</a> to get started.</p>
<p><img src="noisySBM.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=qsimulatR">qsimulatR</a> v1.0: Implements a quantum computer simulator with up to 24 qubits which provides many common gates and allows users to define general single qubit gates and general controlled single qubit gates. The package supports plotting circuits and exporting circuits to <a href="https://qiskit.org/"><code>Qiskit</code></a>, a Python package which can be used to run on <a href="https://quantum-computing.ibm.com/">IBM’s Quantum hardware</a>. There is an <a href="https://cran.r-project.org/web/packages/qsimulatR/vignettes/qsimulatR.html">Introduction</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/qsimulatR/vignettes/ExponentiateModN.pdf">Exponentiation modulo n</a>, <a href="https://cran.r-project.org/web/packages/qsimulatR/vignettes/addbyqft.pdf">Addition by Fourier transform</a>, the <a href="https://cran.r-project.org/web/packages/qsimulatR/vignettes/deutsch-jozsa.pdf">Deutsch-Sozsa Algorithm</a>, the <a href="https://cran.r-project.org/web/packages/qsimulatR/vignettes/phase_estimation.pdf">Phase Estimation Algorithm</a> and <a href="https://cran.r-project.org/web/packages/qsimulatR/vignettes/qft.pdf">Quantum Fourier Trafo</a>.</p>
<p><img src="quantum.png" height = "250" width="450"></p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=eyedata">eyedata</a> v0.1.0: Contains anonymized real life, open source data sets from patients treated in <a href="https://en.wikipedia.org/wiki/Moorfields_Eye_Hospital">Moorfields Eye Hospital</a>, London and includes data about people who received intravitreal injections with anti-vascular endothelial growth factor due to age-related macular degeneration or diabetic macular edema. See <a href="https://cran.r-project.org/web/packages/eyedata/readme/README.html#ref-fu2">README</a> for the list of medical publications associated with the data sets.</p>
<p><a href="https://cran.r-project.org/package=rgugik">rgugik</a> v0.2.1: Automates open data acquisition including raster and vector data from the <a href="www.gugik.gov.pl">Polish Head Office of Geodesy and Cartography</a>. See the vignettes <a href="https://cran.r-project.org/web/packages/rgugik/vignettes/DEM.html">Digital Eelvation Model</a>, <a href="https://cran.r-project.org/web/packages/rgugik/vignettes/orthophotomap.html">Orthophotomap</a>, and <a href="https://cran.r-project.org/web/packages/rgugik/vignettes/topodb.html">Topographic Database</a>.</p>
<p><img src="rgugik.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=readrba">readrba</a> v0.1.0: Provides tools to download current and historical <a href="https://www.rba.gov.au/statistics/tables/">statistical tables</a> and <a href="https://www.rba.gov.au/publications/smp/forecasts-archive.html">forecasts</a> from the Reserve Bank of Australia Data which comprise a broad range of Australian macroeconomic and financial time series. See the <a href="https://cran.r-project.org/web/packages/readrba/vignettes/readrba.html">vignette</a> to get started.</p>
<p><img src="readrba.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=threesixtygiving">threesixtygiving</a> v0.2.2: Provides access to open data from <a href="https://www.threesixtygiving.org">360Giving</a>, a database of charitable grant giving in the UK. See the <a href="https://cran.r-project.org/web/packages/threesixtygiving/threesixtygiving.pdf">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=USgas">USgas</a> Links to the <a href="https://www.eia.gov/">US Energy Information Administration</a> to provide and overview of natural gas demand at the county level. See the <a href="https://cran.r-project.org/web/packages/USgas/vignettes/introduction.html">vignette</a>.</p>
<p><img src="USgas.png" height = "300" width="500"></p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.r-project.org/package=polyqtlR">polyqtlR</a> v0.0.4: Provides functions for quantitative trait loci (QTL) analysis in polyploid bi-parental F1 populations. See the <a href="https://cran.r-project.org/web/packages/polyqtlR/vignettes/polyqtlR_vignette.html">vignette</a> for background and examples.</p>
<p><img src="polyqtlR.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=RPPASPACE">RPPASPACE</a> v1.0.7: Provides tools for the analysis of reverse-phase protein arrays (RPPAs), which are also known as <em>tissue lysate arrays</em> or simply <em>lysate arrays</em>. See <a href="https://academic.oup.com/bioinformatics/article/23/15/1986/205819">Hu (2007)</a> for background and the <a href="https://cran.r-project.org/web/packages/RPPASPACE/vignettes/Guide_to_RPPASPACE.pdf">Guide</a> to for examples.</p>
<p><a href="https://cran.r-project.org/package=RVA">RVA</a> v0.0.3: Provides functions to automate downstream visualization & pathway analysis in RNAseq analysis. See the <a href="https://cran.r-project.org/web/packages/RVA/vignettes/RVA.html">vignette</a>.</p>
<p><img src="RVA.png" height = "300" width="500"></p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=comparator">comparator</a> v0.0.1: Implements functions for comparing strings, sequences and numeric vectors for clustering and record linkage applications. It includes generalized edit distances for comparing sequences/strings, Monge-Elkan similarity for fuzzy comparison of token sets, and L-p distances for comparing numeric vectors. See <a href="https://cran.r-project.org/web/packages/comparator/readme/README.html">README</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=DoubleML">DoubleML</a> v0.1.1: Implements the double/debiased machine learning framework of <a href="https://academic.oup.com/ectj/article/21/1/C1/5056401">Chernozhukov et al. (2018)</a> for partially linear regression models, partially linear instrumental variable regression models, interactive regression models and interactive instrumental variable regression models. There are guides on <a href="https://cran.r-project.org/web/packages/DoubleML/vignettes/install.html">Installation</a> and <a href="https://cran.r-project.org/web/packages/DoubleML/vignettes/DoubleML.html">Getting Started</a>.</p>
<p><a href="https://cran.r-project.org/package=functClust">functClust</a> v0.1.6: Provides functions to cluster the components that make up an interactive system on the basis of their functional redundancy for one or more collective, systemic performances. There are six vignettes including and <a href="https://cran.r-project.org/web/packages/functClust/vignettes/a.Overview.html">Overview</a>, a simple <a href="https://cran.r-project.org/web/packages/functClust/vignettes/b.Simplest_use.html">Use Case</a>, and <a href="https://cran.r-project.org/web/packages/functClust/vignettes/e.Multi_functionality.html">Multi Fuctionality</a>.</p>
<p><img src="functClust.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=mlpack">mlpack</a> v3.4.2.1: Implements bindings to the mlpack C++ machine learning library. See <a href="https://joss.theoj.org/papers/10.21105/joss.00726">Curtin et al (2018)</a> for background and look <a href="https://www.mlpack.org/doc/mlpack-3.4.2/r_documentation.html">here</a> for documentation.</p>
<p><a href="https://cran.r-project.org/package=RFCCA">RFCCA</a> v1.0.3: Implements Random Forest with Canonical Correlation Analysis, a method for estimating the canonical correlations between two sets of variables depending on the subject-related covariates. The method is described in <a href="https://arxiv.org/abs/2011.11555">Alakus et al. (2020)</a>. See the <a href="https://cran.r-project.org/web/packages/RFCCA/vignettes/RFCCA.html">vignette</a> for examples.</p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=babsim.hospital">babsim.hospital</a> v11.5.14: Implements a discrete-event simulation model for a hospital resource planning. Motivated by the challenges faced by health care institutions in the current COVID-19 pandemic, it can be used by health departments to forecast demand for intensive care beds, ventilators, and staff resources. See <a href="https://www.jstatsoft.org/article/view/v090i02">Ucar, Smeets & Azcorra (2019)</a>, <a href="https://www.rcpjournals.org/content/futurehosp/6/1/17">Lawton & McCooe (2019)</a> and the <a href="https://www.th-koeln.de/informatik-und-ingenieurwissenschaften/babsimhospital_78996.php">website</a> for background, and the <a href="https://cran.r-project.org/web/packages/babsim.hospital/vignettes/babsim-vignette-introduction.pdf">vignette</a> to get started.</p>
<p><img src="sim.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=healthyR">healthyR</a> v0.1.1: Implements hospital data analysis workflow tools including modeling tools, and tools to review common administrative hospital data such as average length of stay, readmission rates, average net pay amounts by service lines, and more. See the <a href="https://cran.r-project.org/web/packages/healthyR/vignettes/getting-started.html">vignette</a>.</p>
<p><img src="healthyR.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=metaSurvival">metaSurvival</a> v0.1.0: Provides a function to assess information from a summary survival curve and test the between-strata heterogeneity. See the <a href="https://github.com/shubhrampandey/metaSurvival">GitHub repo</a> for an example.</p>
<p><img src="metaSurvival.png" height = "300" width="500"></p>
<h3 id="science">Science</h3>
<p><a href="https://cran.r-project.org/package=cmcR">cmcR</a> v0.1.3: Implements the congruent matching cells method for cartridge case identification as proposed by <a href="https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=911193">Song (2013)</a> as well as an extension of the method proposed by <a href="https://nvlpubs.nist.gov/nistpubs/jres/120/jres.120.008.pdf">Tong et al. (2015)</a>. There is a vignette on <a href="https://cran.r-project.org/web/packages/cmcR/vignettes/decisionRuleDescription.html">Decision Rules</a> and <a href="https://cran.r-project.org/web/packages/cmcR/vignettes/cmcR_plotReproduction.html">another vignette</a> reproducing the study by Song et al.</p>
<p><img src="cmcR.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=envi">envi</a> v0.1.6: Provides tools for environmental interpolation using occurrence data, covariates, kernel density-based estimation, and spatial relative risk. See <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7577">Davies et al. (2018)</a> for details on spatial relative risk, <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780090616">Bithell (1990)</a> for kernel density estimation and <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780101112">Bithell (1991)</a> for estimating relative risk. The <a href="https://cran.r-project.org/web/packages/envi/vignettes/vignette.html">vignette</a> provides background and examples.</p>
<p><img src="envi.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=PAMpal">PAMpal</a> v0.9.14: Provides tools for loading and processing passive acoustic data, including functions to read <a href="https://www.pamguard.org/">Pamguard</a> data, process, and export data. See <a href="https://asa.scitation.org/doi/10.1121/1.2743157">Oswald et al (2007)</a>, <a href="https://asa.scitation.org/doi/10.1121/10.0001229">Griffiths et al (2020)</a>, and <a href="https://asa.scitation.org/doi/full/10.1121/1.3479549">Baumann-Pickering et al (2010)</a> for background. Look <a href="https://taikisan21.github.io/PAMpal/">here</a> for the installation guide and tutorial.</p>
<p><img src="PAMpal.png" height = "300" width="500"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=bpcs">bpcs</a> v1.0.0: Implements models for the analysis of paired comparison data using <code>Stan</code> including random effects, generalized model for predictors and order effect Bayesian versions of the Bradley-Terry model. See <a href="https://www.jstor.org/stable/2334029?origin=crossref&seq=1">Bradley & Terry (1952)</a>, <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1970.10481082">Davidson (1970)</a>, and <a href="https://www.jstatsoft.org/article/view/v076i01">Carpenter et al. (2017)</a> for background and the <a href="https://cran.r-project.org/web/packages/bpcs/vignettes/a_get_started.html">vignette</a> for an overview.</p>
<p><a href="https://cran.r-project.org/package=brolgar">brolgar</a> v0.1.0: Provides a framework of tools to summarise, visualise, and explore longitudinal data and includes methods for calculating features and summary statistics and sampling individual series. See <a href="https://arxiv.org/abs/2012.01619">Tierney, Cook & Prvan</a> and the <a href="https://cran.r-project.org/web/packages/brolgar/vignettes/getting-started.html">Getting Started Guide</a> to get going. There are also vignettes <a href="https://cran.r-project.org/web/packages/brolgar/vignettes/exploratory-modelling.html">exploratory modelling</a>, finding <a href="https://cran.r-project.org/web/packages/brolgar/vignettes/finding-features.html">features</a>, identifying <a href="https://cran.r-project.org/web/packages/brolgar/vignettes/id-interesting-obs.html">interesting observations</a>, <a href="https://cran.r-project.org/web/packages/brolgar/vignettes/longitudinal-data-structures.html">data structures</a>, <a href="https://cran.r-project.org/web/packages/brolgar/vignettes/mixed-effects-models.html">mixed effects models</a>, and
<a href="https://cran.r-project.org/web/packages/brolgar/vignettes/visualisation-gallery.html">visualisation</a>.</p>
<p><img src="brolgar.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=MASSExtra">MASSExtra</a> v1.0.2: Provides enhancements, extensions and additions (such as Gramm-Schmidt orthogonalisation and generalised eigenvalue problems) to the <code>MASS</code> package with convenient default settings and user interfaces. See the <a href="https://cran.r-project.org/web/packages/MASSExtra/vignettes/rationale.pdf">vignette</a>.</p>
<p><img src="MASSExtra.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=motifr">motifr</a> v1.0.0: Provides tools for motif analysis in multi-level networks to visualize multi-level networks, count multi-level network motifs and compare motif occurrences to baseline models. See the <a href="https://cran.r-project.org/web/packages/motifr/vignettes/motif_zoo.html">motif zoo</a> and <a href="https://cran.r-project.org/web/packages/motifr/vignettes/random_baselines.html">Baseline model comparisons</a> to get started.</p>
<p><img src="motifr.svg" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=OptCirClust">OptCirClust</a> v0.0.3: Provides fast (runtime = O(K N log^2 N), optimal, reproducible clustering algorithms for circular, periodic, or framed data based on a core algorithm for optimal framed clustering. There are vignettes on <a href="https://cran.r-project.org/web/packages/OptCirClust/vignettes/CircularGenomes.html">Circular genome clustering</a>, <a href="https://cran.r-project.org/web/packages/OptCirClust/vignettes/Performance.html">Performance</a>, <a href="https://cran.r-project.org/web/packages/OptCirClust/vignettes/Tutorial_CirClust.html">Circular Clustering</a>, and <a href="https://cran.r-project.org/web/packages/OptCirClust/vignettes/Tutorial_FramedClust.html">Framed Clusterine</a>.</p>
<p><img src="OptCirClust.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=pflamelet">pflamelet</a> v0.1.1: Provides functions to compute the persistence flamelets, a statistical tool for exploring the Topological Invariants of Scale-Space families introduced in <a href="https://arxiv.org/abs/1709.07097">Padellini and Brutti (2017)</a>.</p>
<p><a href="https://cran.r-project.org/package=PRDA">PRDA</a> v1.0.0: Implements the <em>Design Analysis</em> proposed by <a href="https://journals.sagepub.com/doi/10.1177/1745691614551642">Gelman & Carlin (2014)</a> which combines the evaluation of Power-Analysis with other inferential-risks. See also <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2019.02893/full">Altoè et al. (2020)</a> and <a href="https://psyarxiv.com/q9f86/">Bertoldo et al. (2020)</a> for background and the vignettes <a href="https://cran.r-project.org/web/packages/PRDA/vignettes/PRDA.html">PRDA</a>, <a href="https://cran.r-project.org/web/packages/PRDA/vignettes/prospective.html">Prospective</a> and <a href="https://cran.r-project.org/web/packages/PRDA/vignettes/retrospective.html">Retrospective</a>.</p>
<p><img src="PRDA.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=puls">puls</a> v0.1.1: Supplements the <code>fda</code> and <code>fda.use</code> packages by providing a method for clustering functional data using subregion information of the curves. See the <a href="https://cran.r-project.org/package=puls">vignette</a> for an example and references.</p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=coro">coro</a> v1.0.1: Provides <em>coroutines</em>, a family of functions that can be suspended and resumed later on. This includes async functions (which await) and generators (which yield). See the <a href="https://cran.r-project.org/web/packages/coro/vignettes/generator.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=dataReporter">dataReporter</a> v1.0.0: Provides functions to auto generate a customizable data report showing potential errors in a data set. See <a href="https://www.jstatsoft.org/article/view/v090i06">Petersen & Ekstrøm (2019)</a> for background, and the <a href="https://cran.r-project.org/web/packages/dataReporter/vignettes/extending_dataReporter.html">vignette</a> for examples.</p>
<p><a href="https://cran.r-project.org/package=DescrTab2">DescrTab2</a> v2.0.3: Provides functions to create descriptive statistics tables for continuous and categorical variables. There are vignettes on <a href="https://cran.r-project.org/web/packages/DescrTab2/vignettes/maintenance_guide.html">Maintenance</a>, <a href="https://cran.r-project.org/web/packages/DescrTab2/vignettes/usage_guide.html">Usage</a>, and <a href="https://cran.r-project.org/web/packages/DescrTab2/vignettes/validation_report.html">Validation</a>.</p>
<p><a href="https://cran.r-project.org/package=libr">libr</a> v1.1.1: Provides functions to create data libraries, generate data dictionaries, and simulate a data step. There is an <a href="https://cran.r-project.org/web/packages/libr/vignettes/libr.html">Introduction</a>, and vignettes on library <a href="https://cran.r-project.org/web/packages/libr/vignettes/libr-basics.html">operations</a> and <a href="https://cran.r-project.org/web/packages/libr/vignettes/libr-management.html">management</a>, and <a href="https://cran.r-project.org/web/packages/libr/vignettes/libr-datastep.html">Data Step</a> operations and the <a href="https://cran.r-project.org/web/packages/libr/vignettes/libr-management.html">enhanced equality</a> operator.</p>
<p><a href="https://cran.r-project.org/package=outsider">outsider</a> v0.1.1: Allows users to install and run external command-line programs in R through use of <a href="https://www.docker.com/">Docker</a> and online repositories. Look <a href="https://docs.ropensci.org/outsider/">here</a> for package information.</p>
<p><img src="outsider.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=srcr">srcr</a> v1.0.0: Provides a simple tool to abstract connection details, including secret credentials, out of your source code and manage configurations for frequently-used database connections. See the <a href="https://cran.r-project.org/web/packages/srcr/vignettes/Managing_data_sources.html">vignette</a>.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=ComplexUpset">ComplexUpset</a> v1.0.3: Provides functions to create Upset plots which offer improvements over Venn Diagrams for set overlap visualizations.</p>
<p><img src="upset.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=nmaplateplot">nmaplateplot</a> v1.0.0: Provides a graphical display of results from network meta-analysis (NMA) which is suitable for outcomes like odds ratios, risk ratios, risk differences, and standardized mean differences. See the <a href="https://cran.r-project.org/web/packages/ComplexUpset/vignettes/Examples_R.html">vignette</a> for examples. <a href="https://cran.r-project.org/web/packages/nmaplateplot/vignettes/nmaplateplot-intro.html">vignette</a> for examples.</p>
<p><img src="nmaplateplot.svg" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=PantaRhei">PantaRhei</a> v0.1.2: Provides functions to produce <a href="https://en.wikipedia.org/wiki/Sankey_diagram#:~:text=Sankey%20diagrams%20are%20a%20type,proportional%20to%20the%20flow%20rate.&text=Sankey%20diagrams%20emphasize%20the%20major,quantities%20within%20defined%20system%20boundaries.">Sankey diagrams</a> which are used to visualize the flow of conservative substances through a system. See the <a href="https://cran.r-project.org/web/packages/PantaRhei/vignettes/panta-rhei.html">vignette</a>.</p>
<p><img src="PantaRhei.png" height = "300" width="500"></p>
<script>window.location.href='https://rviews.rstudio.com/2021/01/29/dec-2020-top-40-new-cran-packages/';</script>
SEM Time Series Modeling
https://rviews.rstudio.com/2021/01/22/sem-time-series-modeling/
Fri, 22 Jan 2021 00:00:00 +0000https://rviews.rstudio.com/2021/01/22/sem-time-series-modeling/
<script src="/2021/01/22/sem-time-series-modeling/index_files/header-attrs/header-attrs.js"></script>
<!--
tested on Win10: R-4.0.3, rstudio-1.3.911, bimets-1.5.2
tested on Redhat7: R-3.5.3, rstudio-1.1.463, bimets-1.5.2
-->
<p><em>Andrea Luciani is a Technical Advisor for the Directorate General for Economics, Statistics and Research at the Bank of Italy, and co-author of the bimets package.</em></p>
<p>Structural Equation Models <a href="https://en.wikipedia.org/wiki/Structural_equation_modeling">(SEM)</a>, which are common in many economic modeling efforts, require fitting and simulating whole system of equations where each equation may depend on the results of other equations. Moreover, they often require combining time series and regression equations in ways that are well beyond what the <code>ts()</code> and <code>lm()</code> functions were designed to do. For example, one might want to account for an error auto-correlation of some degree in the regression, or force linear restrictions modeling coefficients.</p>
<p>In this post, we will show how to do structural equation modeling in R by working through the <a href="http://www.ipe.ro/rjef/rjef1_14/rjef1_2014p5-14.pdf">Klein Model</a> of the United States economy, one of the oldest and most elementary models of its kind.</p>
<p>These equations define the model:</p>
<p><span class="math inline">\(CN_t = \alpha_1 + \alpha_2 * P_t + \alpha_3 * P_{t-1} + \alpha_4 * ( WP_t + WG_t )\)</span></p>
<p><span class="math inline">\(I_t = \beta_1 + \beta_2 * P_t + \beta_3 * P_{t-1} - \beta_4 * K_{t-1}\)</span></p>
<p><span class="math inline">\(WP_t = \gamma_1 + \gamma_2 * ( Y_t + T_t - WG_t ) + \gamma_3 * ( Y_{t-1} + T_{t-1} - WG_{t-1} ) + \gamma_4 * Time\)</span></p>
<p><span class="math inline">\(P_t = Y_t - ( WP_t + WG_t )\)</span></p>
<p><span class="math inline">\(K_t = K_{t-1} + I_t\)</span></p>
<p><span class="math inline">\(Y_t = CN_t + I_t + G_t - T_t\)</span></p>
<p>Given:</p>
<p><span class="math inline">\(CN\)</span> as private consumption expenditure;<br />
<span class="math inline">\(I\)</span> as investment;<br />
<span class="math inline">\(WP\)</span> as wage bill of the private sector (demand for labor);<br />
<span class="math inline">\(P\)</span> as profits;<br />
<span class="math inline">\(K\)</span> as stock of capital goods;<br />
<span class="math inline">\(Y\)</span> as gross national product;<br />
<span class="math inline">\(WG\)</span> as wage bill of the government sector;<br />
<span class="math inline">\(Time\)</span> as an index of the passage of time, e.g. 1931 = zero;<br />
<span class="math inline">\(G\)</span> as government expenditure plus net exports;<br />
<span class="math inline">\(T\)</span> as business taxes.</p>
<p><span class="math inline">\(\alpha_i, \beta_j, \gamma_k\)</span> are coefficient to be estimated.</p>
<p>This system has only 6 equations, three of which must be fitted in order to assess the coefficients. It may not seem so difficult to solve this system, but the real complexity emerges if you look at the incidence graph in the following figure, wherein endogenous variables are plotted in blue and exogenous variables are plotted in pink.</p>
<p><img src="/2021/01/22/sem-time-series-modeling/index_files/figure-html/incidence_graph-1.png" width="672" /></p>
<p>Each edge states a simultaneous dependence from a variable to another, e.g. the <code>WP</code> equation depends on the current value of the <code>TIME</code> time series; complexity arises because in this model there are several circular dependencies, one of which is plotted in dark blue.</p>
<p>A circular dependency in the incidence graph of a model implies that the model is a “simultaneous” equations model and that it must be estimated by using ad-hoc procedures; moreover it can be simulated, i.e. performing a forecast, only by using an iterative algorithm.</p>
<p>If we search for “simultaneous equations” inside the <a href="https://cran.r-project.org/web/views/Econometrics.html">Econometrics Task View</a> web page we can find two results: the <a href="https://cran.r-project.org/web/packages/systemfit/index.html">systemfit</a> and the <a href="https://cran.r-project.org/web/packages/bimets/index.html">bimets</a> packages.</p>
<p>The <a href="https://cran.r-project.org/web/packages/systemfit/index.html">systemfit</a> package is a powerful tool for econometric estimation of simultaneous systems of linear and nonlinear equations, but it only provides fitting procedures, thus it cannot be used in our example in order to work out a forecast.</p>
<p>On the other hand, the <a href="https://cran.r-project.org/web/packages/bimets/index.html">bimets</a> package implements, among others, simulation and forecasting procedures; as stated into the <a href="https://cran.r-project.org/web/packages/bimets/vignettes/bimets.pdf">vignette</a> it allows users to write down the model in a natural way, to test several strategies and to focus on the econometric analysis, without overly dealing with coding.</p>
<p>Time series projection, linear restrictions and error auto-correlation can be triggered directly in the model definition, so let us try to define a similar but more complex Klein model by using a <a href="https://cran.r-project.org/web/packages/bimets/index.html">bimets</a> compliant syntax:</p>
<pre class="r"><code>#load library
library(bimets)
#define the Klein model
kleinModelDef <- "
MODEL
COMMENT> Modified Klein Model 1 of the U.S. Economy with PDL,
COMMENT> autocorrelation on errors, restrictions and conditional equation evaluations
COMMENT> Consumption with autocorrelation on errors
BEHAVIORAL> cn
TSRANGE 1923 1 1940 1
EQ> cn = a1 + a2*p + a3*TSLAG(p,1) + a4*(wp+wg)
COEFF> a1 a2 a3 a4
ERROR> AUTO(2)
COMMENT> Investment with restrictions
BEHAVIORAL> i
TSRANGE 1923 1 1940 1
EQ> i = b1 + b2*p + b3*TSLAG(p,1) + b4*TSLAG(k,1)
COEFF> b1 b2 b3 b4
RESTRICT> b2 + b3 = 1
COMMENT> Demand for Labor with PDL
BEHAVIORAL> wp
TSRANGE 1923 1 1940 1
EQ> wp = c1 + c2*(y+t-wg) + c3*TSLAG(y+t-wg,1) + c4*time
COEFF> c1 c2 c3 c4
PDL> c3 1 2
COMMENT> Gross National Product
IDENTITY> y
EQ> y = cn + i + g - t
COMMENT> Profits
IDENTITY> p
EQ> p = y - (wp+wg)
COMMENT> Capital Stock with IF switches
IDENTITY> k
EQ> k = TSLAG(k,1) + i
IF> i > 0
IDENTITY> k
EQ> k = TSLAG(k,1)
IF> i <= 0
END
"
#load the model
kleinModel <- LOAD_MODEL(modelText = kleinModelDef)</code></pre>
<pre><code>## Analyzing behaviorals...
## Analyzing identities...
## Optimizing...
## Loaded model "kleinModelDef":
## 3 behaviorals
## 3 identities
## 12 coefficients
## ...LOAD MODEL OK</code></pre>
<p>The code is quite intuitive and uses explicit keywords in order to define equations, coefficients, parameters, etc. Users can easily:</p>
<ul>
<li><p>change the <code>TSRANGE</code> in order to fit the model in a custom time range per equation;</p></li>
<li><p>modify an equation <code>EQ</code> without changing any user procedure or code;</p></li>
<li><p>add or remove one or more linear restriction on the coefficients by using the keyword <code>RESTRICT</code>, e.g.<br />
<code>RESTRICT> -1.23*b2 + 8.9*b3 = 0.34</code><br />
<code>RESRTICT> b4 – 1.2*b1 = 5</code></p></li>
<li><p>add or remove an error auto-correlation structure with an arbitrary order by using the keyword:<br />
<code>ERROR></code></p></li>
</ul>
<p>Equations can contain advanced expressions, e.g.:</p>
<p><code>EQ> TSDELTA(i) = b1 + b2*EXP(p/1000) + b3*TSDELTALOG(TSLAG(p,1)) + b4*MOVAVG(TSLAG(k,1),5)</code></p>
<div id="model-estimation" class="section level3">
<h3>Model estimation</h3>
<p>Now, we define time series to be used in our example, and then we perform an estimation of the whole <code>kleinModel</code> by using the command <code>ESTIMATE()</code>:</p>
<pre class="r"><code>#define data
kleinModelData <- list(
cn =TIMESERIES(39.8,41.9,45,49.2,50.6,52.6,55.1,56.2,57.3,57.8,
55,50.9,45.6,46.5,48.7,51.3,57.7,58.7,57.5,61.6,65,69.7,
START=c(1920,1),FREQ=1),
g =TIMESERIES(4.6,6.6,6.1,5.7,6.6,6.5,6.6,7.6,7.9,8.1,9.4,10.7,
10.2,9.3,10,10.5,10.3,11,13,14.4,15.4,22.3,
START=c(1920,1),FREQ=1),
i =TIMESERIES(2.7,-.2,1.9,5.2,3,5.1,5.6,4.2,3,5.1,1,-3.4,-6.2,
-5.1,-3,-1.3,2.1,2,-1.9,1.3,3.3,4.9,
START=c(1920,1),FREQ=1),
k =TIMESERIES(182.8,182.6,184.5,189.7,192.7,197.8,203.4,207.6,
210.6,215.7,216.7,213.3,207.1,202,199,197.7,199.8,
201.8,199.9,201.2,204.5,209.4,
START=c(1920,1),FREQ=1),
p =TIMESERIES(12.7,12.4,16.9,18.4,19.4,20.1,19.6,19.8,21.1,21.7,
15.6,11.4,7,11.2,12.3,14,17.6,17.3,15.3,19,21.1,23.5,
START=c(1920,1),FREQ=1),
wp =TIMESERIES(28.8,25.5,29.3,34.1,33.9,35.4,37.4,37.9,39.2,41.3,
37.9,34.5,29,28.5,30.6,33.2,36.8,41,38.2,41.6,45,53.3,
START=c(1920,1),FREQ=1),
y =TIMESERIES(43.7,40.6,49.1,55.4,56.4,58.7,60.3,61.3,64,67,57.7,
50.7,41.3,45.3,48.9,53.3,61.8,65,61.2,68.4,74.1,85.3,
START=c(1920,1),FREQ=1),
t =TIMESERIES(3.4,7.7,3.9,4.7,3.8,5.5,7,6.7,4.2,4,7.7,7.5,8.3,5.4,
6.8,7.2,8.3,6.7,7.4,8.9,9.6,11.6,
START=c(1920,1),FREQ=1),
time=TIMESERIES(NA,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,0,
1,2,3,4,5,6,7,8,9,10,
START=c(1920,1),FREQ=1),
wg =TIMESERIES(2.2,2.7,2.9,2.9,3.1,3.2,3.3,3.6,3.7,4,4.2,4.8,
5.3,5.6,6,6.1,7.4,6.7,7.7,7.8,8,8.5,
START=c(1920,1),FREQ=1)
);
#load time series into the model object
kleinModel <- LOAD_MODEL_DATA(kleinModel,kleinModelData)</code></pre>
<pre><code>## Load model data "kleinModelData" into model "kleinModelDef"...
## ...LOAD MODEL DATA OK</code></pre>
<pre class="r"><code>#estimate the model
kleinModel <- ESTIMATE(kleinModel, quietly=TRUE)</code></pre>
<p>In order to reduce this blog post length we only show the output for a single estimation; anyhow, for each estimated equation the output is similar to the following:</p>
<pre class="r"><code>kleinModel <- ESTIMATE(kleinModel, eqList='cn')</code></pre>
<pre><code>##
## Estimate the Model kleinModelDef:
## the number of behavioral equations to be estimated is 1.
## The total number of coefficients is 4.
##
## _________________________________________
##
## BEHAVIORAL EQUATION: cn
## Estimation Technique: OLS
## Autoregression of Order 2 (Cochrane-Orcutt procedure)
##
## Convergence was reached in 6 / 20 iterations.
##
##
## cn = 14.83
## T-stat. 7.608 ***
##
## + 0.2589 p
## T-stat. 2.96 *
##
## + 0.01424 TSLAG(p,1)
## T-stat. 0.1735
##
## + 0.839 (wp+wg)
## T-stat. 14.68 ***
##
## ERROR STRUCTURE: AUTO(2)
##
## AUTOREGRESSIVE PARAMETERS:
## Rho Std. Error T-stat.
## 0.2542 0.2589 0.9817
## -0.05251 0.2594 -0.2024
##
##
## STATs:
## R-Squared : 0.9827
## Adjusted R-Squared : 0.9755
## Durbin-Watson Statistic : 2.256
## Sum of squares of residuals : 8.072
## Standard Error of Regression : 0.8201
## Log of the Likelihood Function : -18.32
## F-statistic : 136.2
## F-probability : 3.874e-10
## Akaike's IC : 50.65
## Schwarz's IC : 56.88
## Mean of Dependent Variable : 54.29
## Number of Observations : 18
## Number of Degrees of Freedom : 12
## Current Sample (year-period) : 1923-1 / 1940-1
##
##
## Signif. codes: *** 0.001 ** 0.01 * 0.05
##
##
## ...ESTIMATE OK</code></pre>
<p>The <code>ESTIMATE()</code> function can fit also non-simultaneous system and a single equation. Several predefined time series transformations are available in <a href="https://cran.r-project.org/web/packages/bimets/index.html">bimets</a>:</p>
<p>– Time series extension <code>TSEXTEND()</code><br />
– Time series merging <code>TSMERGE()</code><br />
– Time series projection <code>TSPROJECT()</code><br />
– Lag <code>TSLAG()</code><br />
– Lag differences: standard, percentage and logarithmic, i.e. <code>TSDELTA()</code>, <code>TSDELTAP()</code>, <code>TSDELTALOG()</code><br />
– Cumulative product <code>CUMPROD()</code><br />
– Cumulative sum <code>CUMSUM()</code><br />
– Moving average <code>MOVAVG()</code><br />
– Moving sum <code>MOVSUM()</code><br />
– Parametric (Dis)Aggregation <code>YEARLY()</code>, <code>QUARTERLY()</code>, <code>MONTHLY()</code>, <code>DAILY()</code><br />
– Time series data presentation <code>TABIT()</code></p>
</div>
<div id="forecasting" class="section level3">
<h3>Forecasting</h3>
<p>The <code>predict()</code> function of the <code>lm()</code> or <code>dyn$lm</code> linear model framework produces predicted values, obtained by evaluating the regression function with new data: it is a popular function among the R users.</p>
<p>Unfortunately, it does not help in our example: as we said before, in order to forecast a simultaneous model that presents circular dependencies in the incidence graph, we cannot merely assess the right-hand side of the equations, as the <code>predict.lm</code> function does; in our case we need an iterative algorithm.</p>
<p>The <code>predict.lm</code> equivalent function that allows to forecast our simultaneous model is the <code>SIMULATE()</code> function. On the other hand the <code>SIMULATE()</code> function can also solve non-simultaneous models and gives the same results as the <code>predict.lm</code> function.</p>
<p>In addition, as in the Capital Stock <code>k</code> equation in our example, the <code>SIMULATE()</code> function can conditionally evaluate an identity during a simulation, depending on the value of a logical expression (e.g. for each simulation period the <code>k</code> equation changes depending on the <code>i</code> current value). Thus, it is possible to have a model alternating between two or more equation specifications for each simulation period, depending upon results from other equations.</p>
<p>Structural stability, multiplier analysis and endogenous targeting are additional capabilities coded in <a href="https://cran.r-project.org/web/packages/bimets/index.html">bimets</a> but not described in this post.</p>
<p>In order to forecast the model up to 1944, we need to extend exogenous time series by using the <code>TSEXTEND()</code> function. In this example, we perform simple extensions:</p>
<pre class="r"><code>#we need to extend exogenous variables up to 1944
kleinModel$modelData <- within(kleinModel$modelData,{
wg = TSEXTEND(wg, UPTO=c(1944,1),EXTMODE='CONSTANT')
t = TSEXTEND(t, UPTO=c(1944,1),EXTMODE='LINEAR')
g = TSEXTEND(g, UPTO=c(1944,1),EXTMODE='CONSTANT')
k = TSEXTEND(k, UPTO=c(1944,1),EXTMODE='LINEAR')
time = TSEXTEND(time,UPTO=c(1944,1),EXTMODE='LINEAR')
})</code></pre>
<p>A call to the <code>SIMULATE()</code> function will solve our simultaneous system of equations:</p>
<pre class="r"><code>#forecast model
kleinModel <- SIMULATE(kleinModel
,simType='FORECAST'
,TSRANGE=c(1941,1,1944,1)
,simConvergence=0.00001
,simIterLimit=100
,quietly=TRUE
)</code></pre>
<p>The historical GNP (original referred as “Net national income, measured in billions of 1934 dollars” , pg. 141 in “<a href="https://cowles.yale.edu/sites/default/files/files/pub/mon/m11-all.pdf">Economic Fluctuations in the United States 1921-1941</a>” by L. R. Klein, Wiley and Sons Inc., New York, 1950) is shown in figure, along with the simulation and the forecast.</p>
<pre class="r"><code>#get forecasted GNP
TABIT(kleinModel$simulation$y)</code></pre>
<pre><code>##
## DATE, PER, kleinModel$simulation$y
##
## 1941, 1 , 125.3
## 1942, 1 , 172.5
## 1943, 1 , 185.6
## 1944, 1 , 141.1</code></pre>
<p><img src="/2021/01/22/sem-time-series-modeling/index_files/figure-html/plot_ts-1.png" width="672" /></p>
<p>Disclaimer: <em>The views and opinions expressed in this page are those of the author and do not necessarily reflect the official policy or position of the Bank of Italy. Examples of analysis performed within these pages are only examples. They should not be utilized in real-world analytic products as they are based only on very limited and dated open source information. Assumptions made within the analysis are not reflective of the position of the Bank of Italy.</em></p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2021/01/22/sem-time-series-modeling/';</script>
A Custom Forest Plot from Wonderful Wednesdays
https://rviews.rstudio.com/2021/01/15/wonderful-wednesdays-forest-plot/
Fri, 15 Jan 2021 00:00:00 +0000https://rviews.rstudio.com/2021/01/15/wonderful-wednesdays-forest-plot/
<p><em>Waseem Medhat is a Statistical Programmer and Computational Experimentalist who resides in Alexandria, Egypt</em></p>
<p>This post takes a closer look at the forest plot that was mentioned in a <a href="https://rviews.rstudio.com/2021/01/11/wonderful-wednesdays/">previous post</a> introducing PSI’s Wonderful Wednesdays events. It describes a custom version of a forest plot with additional bands to visualize heterogeneity between studies in a meta-analysis that was part of a project submitted to the Wonderful Wednesdays challenge hosted by PSI and reviewed by statisticians in the organization. Find more information
<a href="https://vis-sig.github.io/blog/posts/2020-12-03-wonderful-wednesdays-december-2020/">here</a>. The plot is built with JavaScript using the
<a href="https://d3js.org/">D3.js</a> library and wrapped in a
<a href="https://shiny.rstudio.com/">Shiny</a> app with the help of the
<a href="https://rstudio.github.io/r2d3/">R2D3</a> package.</p>
<h2 id="background-problem">Background problem</h2>
<p>The problem around this visualization is specific to meta-analysis, which is the statistical pooling of the results of multiple studies (e.g. a multi-center clinical trial) to obtain a single, more powerful estimate. The choice of pooling model (fixed effect vs. random effects) depends on the heterogeneity of effect size between studies. So, the main question that I wanted to answer with this visualization is:</p>
<p><em>“What graphical tools can be used to assess heterogeneity?”</em></p>
<p>Like any statistical graphic, the purpose of the visualization is to complement the statistical measures of heterogeneity, like I^2^, to give a more complete picture.</p>
<h2 id="plot-description">Plot description</h2>
<p><img src="https://waseem-medhat.netlify.app/post/forest-plot-with-heterogeneity-bands_files/forest_plot_with_bands.png" alt="" /></p>
<p>A lot of the plot components are directly comparable to the typical <a href="https://en.wikipedia.org/wiki/Forest_plot">forest
plot</a>, which is very popular in the medical field as visualization tool in meta-analyses. Its main features are:</p>
<ul>
<li>A square for each point estimate for a study. The square size is proportional
to the sample size (i.e. weight) of that study.</li>
<li>A line for each confidence interval of the effect size in a study.</li>
<li>Diamonds that represent the pooled estimate using either fixed-effect or random-effects model. The diamond width represents the confidence interval around the pooled estimate.</li>
<li>The plot is usually combined with a tabular display of the numbers represented by the plot.</li>
</ul>
<p>My own additions are:</p>
<ul>
<li>Colored bands to give a better visualization of the heterogeneity between studies. There is a band for each study, with a width equal to its confidence interval. All the bands are semi-transparent and overlayed over each other so that more overlapping produces darker areas.</li>
<li>More attention to annotations than the typical plots, providing a title and a subtitle with the interventions and the outcome, respectively. Another label is also added to show which direction represents the “positive” effect.</li>
</ul>
<h2 id="shiny-app-description">Shiny app description</h2>
<p><img src="https://waseem-medhat.netlify.app/post/forest-plot-with-heterogeneity-bands_files/forest_plot_with_bands_shiny.png" alt="" /></p>
<p>As a proof of concept for a viable product, I wrapped the plot in a Shiny app which provides additional interactive features:</p>
<ul>
<li>Selection of summary measure, which can be expanded to include more than the
odds ratio and risk ratio.</li>
<li>Control over the plot dimensions. Giving this control to the user allows the plot to be conveniently visible in different screen sizes and deliverable forms (e.g. a report or a dashboard).</li>
<li>Help button that shows a guide for interpretation. This makes the information available on-demand instead having it a separate tab.</li>
</ul>
<h2 id="technologies-and-packages">Technologies and packages</h2>
<h3 id="d3-js-javascript">D3.js (JavaScript)</h3>
<p>The plot itself (and associated tabular display) was built using D3. Being a JavaScript library, D3 works with web technologies: HTML, CSS, and especially SVG. It has a lot of low-level tools that bind data to SVG shapes and change the shape properties accordingly. One particular advantage of using web technologies in this visualization is that CSS allows semi-transparent elements to “blend” colors in multiple ways, which allowed me to choose a blend mode that emphasizes the overlap.</p>
<h3 id="r2d3">R2D3</h3>
<p>R2D3 was the main wrapper around the D3 visualization. Beside the obvious
advantage of introducing an interface between R and D3 and allowing its
rendering in Shiny, it makes some steps easier like giving the data to the plot and making the plot take as much space as possible inside its container. Because of this, initial variables like <code>data</code>, <code>width</code>, <code>height</code>, and (the container) <code>svg</code> are provided by R2D3 and are not declared in the JavaScript code.</p>
<h3 id="shiny">Shiny</h3>
<p>It does not need an introduction at this point: Shiny is the de facto standard for R-based web applications. With the R2D3 package, I have available <code>d3Output()</code> and <code>renderD3()</code> to render the D3 plot just like any typical output in Shiny. Other Shiny packages I used are
<a href="https://github.com/cwthom/shinyhelper">shinyhelper</a>, which provides a help button and rich modal dialogs for help content, and <a href="http://rstudio.github.io/shinythemes/">shinythemes</a> to change the appearance of the app.</p>
<p>By wrapping the visualization in a Shiny app, this project became a prototype that can be taken in a lot of different directions to include more effect sizes and other meta-analysis techniques, or maybe even add another module that imports the data, performs the meta-analysis, and send the results to this module to be visualized.</p>
<h2 id="links">Links</h2>
<ul>
<li>Live version of the Shiny app: <a href="https://waseem-medhat.shinyapps.io/forest_plot_with_bands/">https://waseem-medhat.shinyapps.io/forest_plot_with_bands/</a></li>
<li>GitHub: <a href="https://github.com/waseem-medhat/forest_plot_with_bands">https://github.com/waseem-medhat/forest_plot_with_bands</a></li>
<li>Wonderful Wednesdays December submissions: <a href="https://vis-sig.github.io/blog/posts/2020-12-03-wonderful-wednesdays-december-2020/">https://vis-sig.github.io/blog/posts/2020-12-03-wonderful-wednesdays-december-2020/</a></li>
</ul>
<script>window.location.href='https://rviews.rstudio.com/2021/01/15/wonderful-wednesdays-forest-plot/';</script>
Wonderful Wednesdays
https://rviews.rstudio.com/2021/01/11/wonderful-wednesdays/
Mon, 11 Jan 2021 00:00:00 +0000https://rviews.rstudio.com/2021/01/11/wonderful-wednesdays/
<p>For almost a year now, the PSI Visualization Special Interest Group <a href="https://www.psiweb.org/sigs-special-interest-groups/visualisation">(VIS SIG)</a> has been conducting a monthly graduate-level seminar on creating effective statistical visualizations that is open to everyone. <a href="https://www.psiweb.org/sigs-special-interest-groups/visualisation/welcome-to-wonderful-wednesdays">Wonderful Wednesdays</a> is a unique collegial event. Every month the SIG publishes a link to a new data set and issues a challenge to produce visualizations that effectively illustrate some specific aspects of the data. Anyone can submit an entry coded in the language of their choice. Submissions that are received by the deadline are then critiqued in free webinar that takes place roughly thirty days later. You don’t have to make a submission to attend the webinar.</p>
<p>The process is well organized and straightforward. The figure below illustrates the process and time liness for the for the webinar that will happen this week on Wednesday, January 13th.</p>
<p><img src="ww1.png" height = "300" width="500"></p>
<p>Click <a href="https://attendee.gotowebinar.com/register/3242063276946783247">here</a> to register for the webinar.</p>
<p>Here is the <a href="https://www.psiweb.org/vod/item/psi-vissig-wonderful-wednesday-10-meta-analysis#video_490750250">link</a> to the December webinar where the challenge was to visualize the heterogeneity among the data used for a meta-analysis of seven studies undertaken to show a reduction in hypertension.</p>
<p>The following image is a traditional forest plot showing a comparison of the odds ratios for the seven studies. (If you are not familiar with this plot type look <a href="https://s4be.cochrane.org/blog/2016/07/11/tutorial-read-forest-plot/">here</a> for some tips on how to interpret it.)</p>
<p><img src="forest.png" height = "300" width="500"></p>
<p>This image comes from a shiny app that is critiqued at the by the VIS-SIG statisticians towards the beginning of the webinar. The experts liked the clean look and labeling on the plot, but had mixed feelings about the colored bands which are meant to show regions where the studies overlap. (Darker color indicates more overlap.) Here are the links to the <a href="https://waseem-medhat.shinyapps.io/forest_plot_with_bands/">Shiny App</a> and the <a href="https://vis-sig.github.io/blog/posts/2020-12-03-wonderful-wednesdays-december-2020/#example1%20code">code</a> and also to the <a href="https://vis-sig.github.io/blog/posts/2020-12-03-wonderful-wednesdays-december-2020/">blog post</a> that reviews all of the submissions for the December challenge.</p>
<p>Visit the <a href="https://vis-sig.github.io/blog/">VIS-SIG Blog</a> to finds posts and code for the submissions for all of Wonderful Wednesday events so far.</p>
<script>window.location.href='https://rviews.rstudio.com/2021/01/11/wonderful-wednesdays/';</script>
COVID-19 Data: The Long Run
https://rviews.rstudio.com/2021/01/06/covid-19-data-the-long-run/
Wed, 06 Jan 2021 00:00:00 +0000https://rviews.rstudio.com/2021/01/06/covid-19-data-the-long-run/
<p>The world seems to have moved to a new phase of paying attention to COVID-19. We have gone from pondering daily plots of case counts, to puzzling through models and forecasts, and are now moving on to the vaccines and the science behind them. For data scientists, however, the focus needs to remain on the data and the myriad issues and challenges that efforts to collect and curate COVID data have uncovered. My intuition is that not only will COVID-19 data continue to be important for quite some time in the future, but that efforts to improve the quality of this data will be crucial for successfully dealing with the next pandemic.</p>
<p>An incredible amount of work has been done by epidemiologists, universities, government agencies and data journalists to collect, organize, and reconcile data from thousands of sources. Nevertheless, the experts caution that there is much yet to be done.</p>
<p>Roni Rosenfeld, head of the Machine Learning Department of the School of Computer Science at Carnegie Mellon University and project lead for the <a href="https://delphi.cmu.edu/about/">Delphi Group</a> put it this way in a recent <a href="https://www.niss.org/news/copss-niss-webinar-delphi%E2%80%99s-covidcast-project-featured">COPSS-NISS webinar</a>:</p>
<blockquote>
<p>Data is a big problem in this pandemic. Availability of high quality, comprehensive, geographically detailed data is very far from where it should be.</p>
</blockquote>
<p>There are over <a href="https://www.aha.org/statistics/fast-facts-us-hospitals">6,000</a> hospitals in the United States, and over <a href="https://en.wikipedia.org/wiki/Lists_of_hospitals#:~:text=These%20are%20links%20to%20lists,164%2C500%20hospitals%20worldwide%20in%202015">160,000</a> hospitals worldwide. Many of these are collecting COVID-19 data yet there few standards for recording cases, dealing with missing data, updating case count data, and coping with the time lag between recording and reporting cases. <a href="https://journals.lww.com/epidem/Fulltext/2019/09000/Nowcasting_the_Number_of_New_Symptomatic_Cases.16.aspx">Nowcasting</a> epidemiological and heath care data has become a vital field of statistical research.</p>
<p>The following <a href="https://cmu-delphi.github.io/covidcast/talks/copss-niss/talk.html#(4)">slide</a> from the COPSS-NISS webinar shows a hierarchy of relevant COVID data organized on the Severity Pyramid that epidemiologists use to study disease progression.</p>
<p><img src="pyramid.svg" height = "400" width="600"></p>
<p>The Delphi Group is making fundamental contributions to the long term improvement of COVID data by archiving the data shown in such a way that versions can be retrieved by date, and also by collecting <a href="https://rviews.rstudio.com/2020/10/13/delphi-s-covidcast-project/">massive data sets</a> of leading indicators.</p>
<p>The webinar is well worth watching, and I highly recommend listening through the Q&A session at the end. The speakers explain the importance of nowcasting and Professor Rosenfeld presents a vision of making epidemic forecasting comparable to weather forecasting. It seems to me that this would be a worthwhile project to help advance.</p>
<p>Note that the Delphi’s <a href="https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html">COVID-19 indicators</a>, probably the nation’s largest public repository of diverse, geographically-detailed, real-time indicators of COVID activity in the US, are freely available through the <a href="https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html">public API</a> which is easily accessible to R and Python users.</p>
<p>Also note that R users can contribute to <a href="https://www.r-consortium.org/">R Consortium</a> sponsored COVID related projects that include the <a href="https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html">COVID-19 Data Hub</a> an organized archive of global COVID-19 case count data, and the <a href="https://tasks.repidemicsconsortium.org/#/">RECON COVID-19 Challenge</a>, an open source project to improve epidemiological tools.</p>
<script>window.location.href='https://rviews.rstudio.com/2021/01/06/covid-19-data-the-long-run/';</script>
November: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2020/12/22/november-top-40-new-cran-packages/
Tue, 22 Dec 2020 00:00:00 +0000https://rviews.rstudio.com/2020/12/22/november-top-40-new-cran-packages/
<p>Two hundred ninety-two new packages made it to CRAN in November. Picking forty was unusually difficult. Nevertheless, here are my “Top 40” selections in twelve categories: Archaeology, Computational Methods, Data, Epidemiology, Games, Machine Learning, Mathematics, Medicine, Statistics, Time Series, Utilities, and Visualization. R developers continue to extend the reach of R. November featured a new package on Archaeology, one of only seventeen I could find on CRAN <code>pkgsearch::pkg_search(query="Archaeology ",size=200)</code>, as well as a package that wraps Python’s <code>chess</code> package.</p>
<p>Looking back over the last twelve months my impression is that R continues to grow in the life sciences. Packages that I have classified as belonging to the categories Epidemiology, Genomics, or Medicine have comprised between ten and fourteen percent of the packages I have reviewed each month.</p>
<h3 id="archaeology">Archaeology</h3>
<p><a href="https://cran.r-project.org/package=archeofrag">archeofrag</a> v0.6.0: Implements methods based on graphs and graph theory for the stratigraphic analysis of fragmented objects in archaeology using “refitting” relationships between fragments scattered in stratigraphic layers. See the <a href="https://cran.r-project.org/web/packages/archeofrag/vignettes/archeofrag-vignette.html">vignette</a>.</p>
<h3 id="computational-methods">Computational Methods</h3>
<p><a href="https://cran.r-project.org/package=ADtools">ADtools</a> v0.5.4: Implements the forward-mode automatic differentiation for multivariate functions using the matrix-calculus notation from <a href="https://onlinelibrary.wiley.com/doi/book/10.1002/9781119541219">Magnus and Neudecker (2019)</a>. See the <a href="https://cran.r-project.org/web/packages/ADtools/vignettes/introduction-to-ADtools.html">vignette</a> for an introduction.</p>
<p><a href="https://cran.r-project.org/package=ML2Pvae">ML2Pvae</a> v1.0.0: Provides functions to create a variational autoencoder (VAE) for parameter estimation in Item Response Theory (IRT) which allows straight-forward construction, training, and evaluation. Only minimal knowledge of <code>tensorflow</code> or <code>keras</code> is required. See <a href="https://ieeexplore.ieee.org/document/8852333">Curi et al. (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/ML2Pvae/vignettes/ml2p_vae_vignette.pdf">vignette</a> for an overview.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=campfin">campfin</a> v1.0.4: Provides tools to explore and normalize American campaign finance data. This package was created by the Investigative Reporting Workshop to facilitate work on <a href="https://publicaccountability.org/">The Accountability Project</a>. See the <a href="https://cran.r-project.org/web/packages/campfin/vignettes/normalize-geography.html">vignette</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=cpsvote">cpsvote</a> v0.1.0: Provides automated methods for downloading, recoding, and merging selected years of the Current Population Survey’s Voting and Registration Supplement, a large national survey about registration, voting, and non-voting in <a href="http://www.electproject.org/home/voter-turnout/voter-turnout-data">United States federal elections</a>. There are vignettes on <a href="https://cran.r-project.org/web/packages/cpsvote/vignettes/basics.html">basics</a>, <a href="https://cran.r-project.org/web/packages/cpsvote/vignettes/background.html">background</a>, <a href="https://cran.r-project.org/web/packages/cpsvote/vignettes/voting.html">voting</a>, and <a href="https://cran.r-project.org/web/packages/cpsvote/vignettes/add-variables.html">adding variables</a>.</p>
<p><a href="https://cran.r-project.org/package=geogenr">geogenr</a> v1.0.0: Allows users to access geodatabasees and obtain information from the American Community Survey <a href="https://www.census.gov/programs-surveys/acs">(ACS)</a>. See the <a href="https://cran.r-project.org/web/packages/geogenr/vignettes/geogenr.html">vignette</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=openSkies">openSkies</a> v0.99.8: Provides a client interface to the <a href="https://opensky-network.org">OpenSky</a> API that allows retrieval of flight information, as well as aircraft state vectors. See the <a href="https://cran.r-project.org/web/packages/openSkies/vignettes/openSkies.html">vignette</a>.</p>
<p><img src="openSkies.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=salem">salem</a> v0.2.0: Access data on all 152 accused witches from the 1692 <a href="https://www.tulane.edu/~salem/index.html">Salem Witch Trials</a>. There is an <a href="https://cran.r-project.org/web/packages/salem/vignettes/introduction.html">Introduction</a> and a <a href="https://cran.r-project.org/web/packages/salem/vignettes/recreating_analyses.html">vignette</a> reproducing an analysis.</p>
<p><img src="salem.png" height = "400" width="600"></p>
<h3 id="epidemiology">Epidemiology</h3>
<p><a href="https://cran.r-project.org/package=oxcgrt">oxcgrt</a> v0.1.0: Implements an interface to the Oxford COVID-19 Government Response Tracker <a href="https://covidtracker.bsg.ox.ac.uk/">(OxCGRT)</a>. There are vignettes on <a href="https://cran.r-project.org/web/packages/oxcgrt/vignettes/calculate.html">calculating incices</a> and <a href="https://cran.r-project.org/web/packages/oxcgrt/vignettes/retrieve.html">retrieving data</a>.</p>
<p><a href="https://cran.r-project.org/package=PandemicLP">PandemicLP</a> v0.2.0: Implements the <a href="http://est.ufmg.br/covidlp/home/pt/">CovidL</a> methodology for long-term epidemic and pandemic prediction. There is an <a href="https://cran.r-project.org/web/packages/PandemicLP/vignettes/PandemicLP.html">Introduction</a> and a <a href="https://cran.r-project.org/web/packages/PandemicLP/vignettes/PandemicLP_SumRegions.html">case study</a>.</p>
<p><a href="https://cran.r-project.org/package=SEIRfansy">SEIRfansy</a> v1.1.0: Implements the Extended Susceptible-Exposed-Infected-Recovery Model for handling high false negative rate and symptom based administration of diagnostic tests. See <a href="https://www.medrxiv.org/content/10.1101/2020.09.24.20200238v1">Bhaduri et al. (2020)</a> and the <a href="https://github.com/umich-biostatistics/SEIRfans">GitHub site</a> for examples.</p>
<p><img src="SEIRfansy.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=trendeval">trendeval</a> v0.0.1: Provides a coherent interface for evaluating models fit with the <a href="https://www.repidemicsconsortium.org/">RECON</a> <a href="https://CRAN.R-project.org/package=trending"><code>trending</code></a> package. See <a href="https://cran.r-project.org/web/packages/trendeval/readme/README.html">README</a> to get started.</p>
<p><img src="trendeval.png" height = "300" width="500"></p>
<h3 id="games">Games</h3>
<p><a href="https://cran.r-project.org/package=chess">chess</a> v1.0.1: Implements an “opinionated” wrapper around the <code>python-chess</code> library allowing users to read and write PGN files as well as create and explore game trees such as the ones seen in chess books. See the vignettes <a href="https://cran.r-project.org/web/packages/chess/vignettes/chess.html">chess</a>, <a href="https://cran.r-project.org/web/packages/chess/vignettes/games.html">games</a>, and <a href="https://cran.r-project.org/web/packages/chess/vignettes/advanced.html">advanced</a>.</p>
<p><a href="https://cran.r-project.org/package=codebreaker">codebreaker</a> v0.0.2: Inspired by <a href="https://www.archimedes-lab.org/mastermind.html">Mastermind</a>, the package implements a logic game in the style of the early 1980s home computers that can be played in the R console. Can you break the code? See <a href="https://cran.r-project.org/web/packages/codebreaker/readme/README.html">README</a> to start playing.</p>
<p><img src="codebreaker.png" height = "200" width="400"></p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=fastai">fastai</a> v2.0.2: Implements functions to simplify training neural networks based on best practices developed at <a href="https://www.fast.ai/">fast.ai</a>. See the <a href="https://github.com/henry090/fastai">website</a> to get started and the twenty-three vignettes which include <a href="https://cran.r-project.org/web/packages/fastai/vignettes/audio.html">Audio Classification</a>, <a href="https://cran.r-project.org/web/packages/fastai/vignettes/multilabel.html">Multilabel Classification</a> and <a href="https://cran.r-project.org/web/packages/fastai/vignettes/medical_dcm.html">Medical Images</a>.</p>
<p><img src="fastai.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=mikropml">mikropml</a> v0.0.2: Implements the ML pipeline described in <a href="https://mbio.asm.org/content/11/3/e00434-20">Topçuoğlu et al. (2020)</a> For building machine learning models for classification and regression problems. There is an <a href="https://cran.r-project.org/web/packages/mikropml/vignettes/introduction.html">Introduction</a> and an <a href="https://cran.r-project.org/web/packages/mikropml/vignettes/paper.html">Overview</a>.</p>
<p><img src="mikropml.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=stacks">stacks</a> v0.1.0: Implements a grammar of model stacking for <code>tidymodels</code>. There is a <a href="https://cran.r-project.org/web/packages/stacks/vignettes/basics.html">Getting Started Guide</a> and a <a href="https://cran.r-project.org/web/packages/stacks/vignettes/classification.html">vignette</a> on classification.</p>
<p><img src="stacks.png" height = "400" width="600"></p>
<h3 id="mathematics">Mathematics</h3>
<p><a href="https://cran.r-project.org/package=BaseSet">BaseSet</a> v0.0.14: Implements a class and methods to work with sets, doing intersection, union, complementary sets, power sets, cartesian product and other set operations in a “tidy” way. See the <a href="https://cran.r-project.org/web/packages/BaseSet/vignettes/basic.html">Introduction</a>, and the vignettes <a href="https://cran.r-project.org/web/packages/BaseSet/vignettes/advanced.html">Advanced Examples</a> and <a href="https://cran.r-project.org/web/packages/BaseSet/vignettes/fuzzy.html">Fuzzy Sets</a>.</p>
<p><a href="https://cran.r-project.org/src/contrib/Archive/viscomplexr">viscomplexr</a> v1.1.0: Provides functions to create phase portraits of functions in the complex number plane. See the <a href="https://cran.r-project.org/web/packages/viscomplexr/vignettes/viscomplexr-vignette.html">vignette</a> to get started.</p>
<p><img src="viscomplexr.png" height = "400" width="600"></p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=causalCmprsk">causalCmprisk</a> v1.0.0: Provides functions to estimate average treatment effects of two static treatment regimes on time-to-event outcomes with competing events. The method uses propensity scores weighting for emulation of baseline randomization. See the <a href="https://cran.r-project.org/web/packages/causalCmprsk/vignettes/cmp_rsk_RHC.html">vignette</a>.</p>
<p><img src="causalCmprisk.png" height = "250" width="450"></p>
<p><a href="https://cran.r-project.org/package=eventglm">eventglm</a> v1.0.2 Implements methods for doing event history regression for marginal estimands, including cumulative incidence the restricted mean survival, as described in the methodology reviewed in <a href="https://journals.sagepub.com/doi/10.1177/0962280209105020">Andersen & Perme (2010)</a>. See the <a href="https://cran.r-project.org/web/packages/eventglm/vignettes/example-analysis.html">vignette</a> for examples.</p>
<p><img src="eventglm.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=IPDfromKM">IPDfromKM</a> v0.1.10: Implements a method to reconstruct individual patient data from Kaplan-Meier (KM) survival curves, visualize and assess the accuracy of the reconstruction, and perform secondary analysis on the reconstructed data. The package also implements iterative KM estimation algorithm proposed in <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-12-9">Guyot (2012)</a>.</p>
<p><a href="https://cran.r-project.org/package=packDAMipd">packDAMipd</a> v0.1.2: Provides functions to construct both time-homogenous and time-dependent Markov models for cost-effectiveness analyses, perform decision analyses, and conduct deterministic and probabilistic sensitivity analyses. There are vignettes on <a href="https://cran.r-project.org/web/packages/packDAMipd/vignettes/Deterministic_sensitivity_analysis.html">deterministic</a> and <a href="https://cran.r-project.org/web/packages/packDAMipd/vignettes/Probabilstic_sensitivity_analysis.html">probabilistic</a> sensitivity analyses, <a href="https://cran.r-project.org/web/packages/packDAMipd/vignettes/Simple_sick_sicker.html">simple</a> “sick-sicker” models, <a href="https://cran.r-project.org/web/packages/packDAMipd/vignettes/Sick_sicker_age_dependent.html">age-dependent</a> “sick-sicker” models, and <a href="https://cran.r-project.org/web/packages/packDAMipd/vignettes/cycle_dependent.html">cycle dependent</a> models.</p>
<p><img src="packDAMipd.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=reconstructKM">reconstructKM</a> v0.3.0: Provides functions for reconstructing individual-level data (time, status, arm) from Kaplan-MEIER curves published in academic journals. See <a href="https://www.nejm.org/doi/10.1056/NEJMc1808567">Sun et al. (2018)</a> for background and the <a href="https://cran.r-project.org/web/packages/reconstructKM/vignettes/introduction.html">vignette</a> for the reconstruction procedure.</p>
<p><img src="reconstructKM.png" height = "400" width="600"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=ceser">ceser</a> v1.0.0: Implements the Cluster Estimated Standard Errors method proposed in <a href="https://www.cambridge.org/core/journals/political-analysis/article/abs/corrected-standard-errors-with-clustered-data/F2332E494290725256181955B9BC7428">Jackson (2020)</a> to compute clustered standard errors of linear coefficients in regression models with grouped data. See the <a href="https://cran.r-project.org/web/packages/ceser/vignettes/ceser.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=gfilmm">gfilmm</a> v2.0.2: Implements generalized Fiducial inference for normal linear mixed models. Fiducial inference is similar to Bayesian inference in the sense that it represents the uncertainty about the parameters with a probability distribution. However, it does not require a prior. See <a href="https://projecteuclid.org/euclid.aos/1351602538">Cisewski and Hannig (2012)</a> for background and the <a href="https://cran.r-project.org/web/packages/gfilmm/vignettes/the-gfilmm-package.html">vignette</a> for examples.</p>
<p><img src="gfilmm.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=hdpGLM">hdpGLM</a> v1.0.0: Implements MCMC algorithms to estimate the Hierarchical Dirichlet Process Generalized Linear Model presented in paper <a href="https://www.cambridge.org/core/journals/political-analysis/article/abs/modeling-contextdependent-latent-effect-heterogeneity/B7B0AF067DF97A1A8F0B50646EF64F24">Ferrari (2020)</a>. See the <a href="https://cran.r-project.org/web/packages/hdpGLM/vignettes/hdpGLM.html">vignette</a> for examples.</p>
<p><img src="hdpGLM.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=latrend">latrend</a> v1.0.1: Implements a framework for clustering longitudinal datasets in a standardized way. There is a <a href="https://cran.r-project.org/web/packages/latrend/vignettes/demo.html">Demo</a> vignette and vignettes on implementing <a href="https://cran.r-project.org/web/packages/latrend/vignettes/custom.html">new models</a> and <a href="https://cran.r-project.org/web/packages/latrend/vignettes/validation.html">validating</a> cluster models.</p>
<p><img src="latrend.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=mixComp">mixComp</a> v0.1-1: Implements methods to estimate the order of mixture distributions. See the <a href="https://cran.r-project.org/web/packages/mixComp/vignettes/mixComp.html">vignette</a> for an introduction to mixture models and and extended list of references.</p>
<p><img src="mixComp.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=monoClust">monoClust</a> v1.2.0:
Implements the monothetic clustering algorithm for continuous data described in <a href="https://www.sciencedirect.com/science/article/abs/pii/S0167865598000877?via%3Dihub">Chavent (1998)</a>. See the <a href="https://cran.r-project.org/web/packages/monoClust/vignettes/monoclust.html">vignette</a>.</p>
<p><img src="monoClust.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=potential">potential</a> v0.1.0: Implements the potential model for measuring social influences described in <a href="https://science.sciencemag.org/content/93/2404/89">Stewart (1941)</a>. See the <a href="https://cran.r-project.org/web/packages/potential/vignettes/potential.html">vignette</a> for an introduction.</p>
<p><img src="potential.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=sftrack">sftrack</a> v0.5.2: Implements classes for tracking and movement data, building on <code>sf</code> spatial infrastructure, and early theoretical work from <a href="https://www.amazon.com/Quantitative-Analysis-Movement-Population-Redistribution/dp/0996139508">Turchin (1998)</a>, and <a href="https://www.sciencedirect.com/science/article/abs/pii/S1574954108000654?via%3Dihub">Calenge et al. (2009)</a>. There is an <a href="https://cran.r-project.org/web/packages/sftrack/vignettes/sftrack1_overview.html">Overview</a> along with the vignettes <a href="https://cran.r-project.org/web/packages/sftrack/index.html">Reading in an sftrack</a>, <a href="https://cran.r-project.org/web/packages/sftrack/vignettes/sftrack3_workingwith.html">Structure</a>, <a href="https://cran.r-project.org/web/packages/sftrack/vignettes/sftrack4_groups.html">Fantastic Groups</a>, and <a href="https://cran.r-project.org/web/packages/sftrack/vignettes/sftrack5_spatial.html">Getting Spatial</a>.</p>
<p><img src="sftrack.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=simrec">simrec</a> v1.0.0: Provides functions to simulate recurrent event data with a non-constant baseline hazard and possibly risk-free intervals and competing events. See <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-015-0005-2">Jahn-Eimermacher et al. (2015)</a> for background and the <a href="https://cran.r-project.org/web/packages/simrec/vignettes/simrec-vignette.html">vignette</a> for an introduction.</p>
<p><img src="simrec.png" height = "400" width="600"></p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://cran.r-project.org/package=modeltime.resample">modeltime.resample</a> v0.1.0: A <code>modeltime</code> extension which implements forecast resampling tools to asses time-based model performance and stability for time series, panel data, and cross-sectional time series. There is a <a href="https://cran.r-project.org/web/packages/modeltime.resample/vignettes/getting-started.html">Getting Started</a> guide and a vignette on <a href="https://cran.r-project.org/web/packages/modeltime.resample/vignettes/panel-data.html">Resampling</a>.</p>
<p><img src="resample.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=tfarima">tfarima</a> v0.1.1: Provides tools to build customized transfer functions and ARIMA models with multiple operators and parameter restrictions. see <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1983.10478005">Bell & Hilmer (1983)</a> and <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1975.10480264">Box & Tiao (1973)</a> for background and the <a href="https://cran.r-project.org/web/packages/tfarima/vignettes/tfarima.pdf">vignette</a> for some theory and examples.</p>
<p><img src="tfarima.png" height = "300" width="500"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=getDTeval">getDTeval</a> v0.0.1: Provides functions to translate statements that use <code>get()</code> or <code>eval()</code> to improve run-time efficiency. See the <a href="https://cran.r-project.org/web/packages/getDTeval/vignettes/Introduction_to_getDTeval.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=lineup2">lineup2</a> v0.2-5: Provides tools for detecting and correcting sample mix-ups between two sets of measurements, such as between gene expression data on two tissues. There is a <a href="https://cran.r-project.org/web/packages/lineup2/vignettes/lineup2.html">vignette</a>.</p>
<p><img src="lineup2.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=sdcLog">sdcLog</a> v0.1.0: Tools for researchers to explicitly show that their results comply to rules for statistical disclosure control imposed by research data centers. The methods used are described in <a href="https://ec.europa.eu/eurostat/cros/system/files/dwb_standalone-document_output-checking-guidelines.pdf">Bond et al. (2015)</a>. There is an <a href="https://cran.r-project.org/web/packages/sdcLog/vignettes/sdcLog.html">Introduction</a> and a vignette on <a href="https://cran.r-project.org/web/packages/sdcLog/vignettes/options.html">options</a>.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=leaflet.multiopacity">leaflet.multiopacity</a> v0.1.1: Extends <code>leaflet</code> by adding a widget to control the opacity of multiple layers. There are vignettes for using the package with <a href="https://cran.r-project.org/web/packages/leaflet.multiopacity/vignettes/usage-leaflet.html">leaflet</a> and <a href="https://cran.r-project.org/web/packages/leaflet.multiopacity/vignettes/usage-leafletProxy.html">leafletProxy</a>.</p>
<p><a href="https://cran.r-project.org/package=mapboxer">mapboxer</a> v0.4.0: Provides access to <a href="https://docs.mapbox.com/mapbox-gl-js/api/">Mapbox GL JS</a>, an open source JavaScript library that uses <a href="https://get.webgl.org/">WebGL</a> to render interactive maps via the <code>htmlwidgets</code> package. Visualizations can be used from the R console, in R Markdown documents and in Shiny apps. See the <a href="https://cran.r-project.org/web/packages/mapboxer/vignettes/mapboxer.html">vignette</a> to get started.</p>
<p><img src="mapboxer.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/12/22/november-top-40-new-cran-packages/';</script>