R Views
https://rviews.rstudio.com/
Recent content on R Views
Hugo  gohugo.io
enus
Tue, 13 Oct 2020 00:00:00 +0000

Help Delphi's COVIDcast Project fight the pandemic!
https://rviews.rstudio.com/2020/10/13/delphiscovidcastproject/
Tue, 13 Oct 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/10/13/delphiscovidcastproject/
<p>The <a href="https://delphi.cmu.edu/">Delphi</a> epidemiological forecasting group at Carnegie Mellon University is undertaking a massive effort to develop leading indicators for COVID19 outbreaks, and if you are an R or Python developer you can help. Delphi is working with both Facebook and Google to analyze the data from daily surveys that ask respondents if they (or people they know) are experiencing COVID like symptoms. The responses permit Delphi to construct a <em>% CLI incommunity</em> signal at the county level across the United States that is being used to improve forecasts and inform public health officials. The <a href="https://delphi.cmu.edu/blog/2020/08/26/covid19symptomsurveysthroughfacebook/">Facebook Survey</a> reaches approximately 74,000 people each day, and at its peak, over 1.2 million people responded to the <a href="https://delphi.cmu.edu/blog/2020/09/18/covid19symptomsurveysthroughgoogle/">Google Survey</a> in a single day.</p>
<p>The aggregated data is publicly available daily through Delphi’s <a href="https://cmudelphi.github.io/delphiepidata/api/covidcast.html">COVIDcast API</a>, and fully deidentified individual survey responses available to researchers who agree to the <a href="https://dataforgood.fb.com/docs/covid19symptomsurveyrequestfordataaccess/">data use terms</a>. Also note that the data from the Facebook survey is never seen by Facebook. The survey is advertised through Facebook but hosted on a Delphi platform.</p>
<p>You can view the <a href="https://covidcast.cmu.edu/?sensor=doctorvisitssmoothed_adj_cli&level=county&date=20201008&signalType=value&encoding=color&mode=overview&region=42003">dashboard</a> for the COVIDcast realtime COVID19 indicators.</p>
<p><img src="delphi.png" height = "600" width="100%"></p>
<p>This is open data science at its best. It is a sophisticated project for the public good: conceived and managed by experts, informed by big data and careful about data privacy that makes its work publicly available  and you can become a part of it. The COVIDcast API is easily accessible through both R <a href="https://cmudelphi.github.io/covidcast/covidcastR/">covidcast</a> and Python <a href="https://pypi.org/project/covidcast/">covidcast</a> packages and the Delphi team would welcome your help working through the <a href="https://github.com/cmudelphi/covidcast/issues">issues</a> on their GitHub repo. Issues are categorized as being relevant to either R or Python, and several are flagged as <em>good first issues</em>.</p>
<p>One outstanding aspect of the Delphi Group is that this team really makes an effort to communicate. They not only make their data and results available, they also try to help people understand the data science. They share their ideas, delve into the underlying statistics, present the challenges, and describe what’s working and what’s not. Delphi recently started a <a href="https://delphi.cmu.edu/blog/2020/08/26/covid19symptomsurveysthroughfacebook/">blog</a>, and I don’t think you will find a better chronicle of data science in action. My favorite post so far, <em>Can Symptoms Surveys Improve COVID19 Forecasts?</em>, can be found <a href="https://delphi.cmu.edu/blog/2020/09/21/cansymptomssurveysimprovecovid19forecasts/">here</a>.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/10/13/delphiscovidcastproject/';</script>

Fake Survival Data for the Disease Progression Model
https://rviews.rstudio.com/2020/10/08/fakedatafortheillnessdeathmodel/
Thu, 08 Oct 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/10/08/fakedatafortheillnessdeathmodel/
<p>In a <a href="https://rviews.rstudio.com/2020/09/09/fakedatawithr/">previous post</a>, I showed some examples of simulating fake data from a few packages that are useful for common simulation tasks and indicated that I would be following up with a look at simulating survival data. A tremendous amount of work in survival analysis has been done in R<sup>1</sup> and it will take some time to explore what’s out there. In this first post, I am just going to jump into the ocean of ideas and see if I can fish out and interesting example.</p>
<p><a href="https://link.springer.com/article/10.2165/0001905319981304000003">Markov models</a> are commonly used in Health Care Economics to model the progression of a disease, and the efficacy and potential benefits of various treatments. One popular approach is to consider cohorts of patients who move through the three states of being <em>healthy</em> (no disease progression), <em>diseased</em> (some level of disease progression) and <em>dead</em>.</p>
<p>The following figure illustrates the process. (I will explain the labeling on the arrows below).</p>
<p><img src="/post/20201002fakedatafortheillnessdeathmodel/index_files/figurehtml/unnamedchunk21.png" width="672" /></p>
<p>These kinds of models are commonly called multistate models in the survival literature. In the simplest case, disease progression might be modeled as a discrete time <a href="https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf">Markov chain</a> where patients move from statetostate according to a matrix of transition probabilities which govern how the process develops at discrete time intervals. However, for many studies, limiting transitions to discrete, uniform intervals is a little too simplistic. For example, in most cases, the exact time when a patient “progresses” from healthy to deceased is not observed. To account for this, modelers frequently consider <a href="http://u.math.biu.ac.il/~amirgi/CTMCnotes.pdf">Continuous Time Markov Chain</a> which allow modeling the distribution of time spent in each state as well as the statetostate transitions.</p>
<p>One way to define a continuous time Markov chain is as a continuous time process that takes values in a discrete state space and obeys the Markov property where the transition to a future state depends only on the present and not on the past.</p>
<p>A continuoustime stochastic process <span class="math inline">\(X_{t}, t \geq 0\)</span> with discrete state space S is a continuoustime Markov chain if:
<span class="math display">\[P(X_{t+s}=j \:\: X_{s}=i), X_u = x_u, 0 \leq u < s) = P(X_{t+s}=j \:  \: X_{s}=i)\]</span> <span class="math display">\[ \forall s,t \geq 0 \:, i, j, x_{u} \in S, \: 0 \leq u < s \]</span>
If the process does not depend on the the particular value of <em>s</em> (the time when the process is in state <em>i</em>) then it is said to be <em>time homogeneous</em>. For a very readable account of how the definition above along with the assumption of time homogeneity ensure both the Markov property and that the time the process spends in the various states will be exponentially distributed, see Chapter 7 of <a href="https://www.amazon.com/IntroductionStochasticProcessesRobertDobrow/dp/1118740653/ref=sr_1_1?dchild=1&keywords=stochastic+processes+in+r&qid=1601771046&s=books&sr=11">Dobrow (2016)</a>.</p>
<p>One more bit of theory before we get to the example: unlike discrete time Markov chains, the development of a continuous time process is not driven by a transition matrix. Instead, state transition probabilities are generated by a matrix, <em>Q</em>, that gives the instantaneous rates of going from one state to another. Transition probabilities for any time, <em>t</em>, are then calculated from <em>Q</em> using <a href="https://cran.rproject.org/web/packages/expm/index.html">matrix exponentiation</a>.</p>
<p><span class="math display">\[P(t)=e^Q\]</span>
The following is the <em>Q</em> matrix for our three state disease progression model. Notice, that this is not a stochastic matrix: the rows sum to 0 not to 1. The basic idea is that the rate of flow into a state <em>i</em> is equal to the flow out of <em>i</em>. The final row is all zeroes in our <em>Q</em> matrix because death is an <em>absorbing state</em> and there are not transitions back to <em>healthy</em> from <em>diseased</em>.</p>
<p><span class="math display">\[Q = \begin{pmatrix}
\ (q_{12} + q_{13}) & q_{12} & q_{13}) \\
\ 0 & q_{23} & q_{23} \\
\ 0 & 0 & 0
\end{pmatrix} \]</span></p>
<p>Armed with a little bit of theory, let’s see how continuous time Markov chains can be used both to simulate survival data and also to fit a model to the fake data.</p>
<div id="generatingsimulatedsurvivaldata" class="section level3">
<h3>Generating Simulated Survival Data</h3>
<p>The following is essentially the example on page 12 of the pdf for the <a href="https://CRAN.Rproject.org/package=genSurv">genSurv</a> package<sup>2</sup> listed in the CRAN Survival Task View. This shows how to use the <code>genTHMM()</code> function to simulate data from a time homogeneous, continuous time Markov Chain. In the code below, the <code>model.cens</code> parameter indicates that censoring is accomplished via a uniform distribution over the interval [0, <code>cens.par</code>]. A covariate is generated by a uniform distribution over the interval [0, <code>covar</code>] and enters the model through the equation:</p>
<p><span class="math display">\[q_{i,j} = \lambda_{i,j} exp(\beta_{i,j} \cdot v)\]</span>
where <span class="math inline">\(\lambda_{i,j}\)</span> is the base rate, parameter <code>rate</code> for the <code>genTHMM()</code> function and <span class="math inline">\(\beta_{i,j}\)</span> are the regression coefficients, <code>beta</code> in the function. In the code below, we use the <code>covariate</code> output to create a <code>sex</code> covariate.</p>
<pre class="r"><code>set.seed(1234)
thmmdata < genTHMM( n=100, model.cens="uniform", # censorship model
cens.par = 20,
beta = c(0.01,0.08,0.05),
covar = 1,
rate = c(0.1,0.05,0.08) )
df < thmmdata %>% mutate(sex = if_else(covariate <= .5,0,1 ))
df < df %>% mutate_if(is.numeric, round, 3)
head(df,11)</code></pre>
<pre><code>## PTNUM time state covariate sex
## 1 1 0.000 1 0.114 0
## 2 1 2.183 2 0.114 0
## 3 1 2.265 3 0.114 0
## 4 2 0.000 1 0.233 0
## 5 2 0.284 2 0.233 0
## 6 2 1.396 3 0.233 0
## 7 3 0.000 1 0.283 0
## 8 3 8.600 2 0.283 0
## 9 3 18.469 2 0.283 0
## 10 4 0.000 1 0.267 0
## 11 4 3.734 1 0.267 0</code></pre>
<p>For more on the theory underlying the <code>genSurv</code> package have a look at the paper <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2692556/">MeiraMechado et al. (2009)</a>.</p>
</div>
<div id="fittingthesurvivalmodel" class="section level3">
<h3>Fitting the Survival Model</h3>
<p>The code in this section fits a continuous time, Markov chain survival model to the data generated above using the <a href="https://cran.rproject.org/package=msm"><code>msm</code></a> package<sup>3</sup> and indicates how one might go about examining the output.</p>
<p>First, let’s look at the transitions between states that occurred for the simulated patients.</p>
<pre class="r"><code>st < statetable.msm(state, PTNUM,data = thmmdata)
st</code></pre>
<pre><code>## to
## from 1 2 3
## 1 28 50 22
## 2 0 25 25</code></pre>
<p>We see, for example, 50 progressed to the diseased state and 22 patient went directly from being <em>healthy</em> to <em>dead</em>. 25 patients who progressed to disease, subsequently died.</p>
<p>Next, we set up the Q matrix of instantaneous transition rates described above,</p>
<pre class="r"><code>Q < matrix(c(0, 1, 1, 0, 0 , 1, 0, 0 , 0), nrow = 3, byrow = TRUE)
rownames(Q) < c("S1", "S2", "S3")
colnames(Q) < c("S1", "S2", "S3")
Q</code></pre>
<pre><code>## S1 S2 S3
## S1 0 1 1
## S2 0 0 1
## S3 0 0 0</code></pre>
<p>fit the model, and plot the survival curves for states <em>S1</em> and <em>S2</em> using the “old school” prebuilt plot method.</p>
<pre class="r"><code>fit < msm( state ~ time, subject=PTNUM, data = df,
qmatrix = Q, gen.inits = TRUE, covariates = ~ sex)
plot(fit)</code></pre>
<p><img src="/post/20201002fakedatafortheillnessdeathmodel/index_files/figurehtml/unnamedchunk61.png" width="672" /></p>
<p>The default print method for the mode fit shows the transition intensities with the hazard ratio of the covariate.</p>
<pre class="r"><code>fit</code></pre>
<pre><code>##
## Call:
## msm(formula = state ~ time, subject = PTNUM, data = df, qmatrix = Q, gen.inits = TRUE, covariates = ~sex)
##
## Maximum likelihood estimates
## Baselines are with covariates set to their means
##
## Transition intensities with hazard ratios for each covariate
## Baseline sex
## S1  S1 0.28724 (0.37477,0.2202)
## S1  S2 0.24020 ( 0.17891, 0.3225) 0.8992 (0.49769,1.625)
## S1  S3 0.04703 ( 0.02057, 0.1076) 0.2106 (0.04132,1.073)
## S2  S2 0.09756 (0.14312,0.0665)
## S2  S3 0.09756 ( 0.06650, 0.1431) 0.6558 (0.30473,1.411)
##
## 2 * loglikelihood: 413.3
## [Note, to obtain old print format, use "printold.msm"]</code></pre>
<p>We can get the transition rates for sex = 0,</p>
<pre class="r"><code>qmatrix.msm(fit, covariates = list(sex = 0))</code></pre>
<pre><code>## S1 S2
## S1 0.3583 (0.51092,0.25130) 0.2537 ( 0.16167, 0.39804)
## S2 0 0.1211 (0.20856,0.07037)
## S3 0 0
## S3
## S1 0.1046 ( 0.04986, 0.21962)
## S2 0.1211 ( 0.07037, 0.20856)
## S3 0</code></pre>
<p>and for sex = 1.</p>
<pre class="r"><code>qmatrix.msm(fit, covariates = list(sex = 1))</code></pre>
<pre><code>## S1 S2
## S1 0.25013 (0.357720,0.17490) 0.22810 ( 0.155472, 0.33464)
## S2 0 0.07945 (0.136411,0.04627)
## S3 0 0
## S3
## S1 0.02204 ( 0.005169, 0.09396)
## S2 0.07945 ( 0.046269, 0.13641)
## S3 0</code></pre>
<p>and <code>msm</code> also allows us to calculate the transition function <span class="math inline">\(P(t)\)</span> for arbitrary times.</p>
<pre class="r"><code>pmatrix.msm(fit, t= 13.3)</code></pre>
<pre><code>## S1 S2 S3
## S1 0.02192 0.3182 0.6599
## S2 0.00000 0.2732 0.7268
## S3 0.00000 0.0000 1.0000</code></pre>
<p>Finally, we look at the mean sojourn times for patients in the <em>healthy</em> and <em>diseased</em> states. Normally, for a process that can transition in and out of states this means the average time spent in the state each time it is visited. For our model, patients, only go forward through the chain, there is no getting better, so the sojourn for S2 is essentially the average amount of time patients spent in the <em>diseased</em> state.</p>
<pre class="r"><code>sojourn.msm(fit)</code></pre>
<pre><code>## estimates SE L U
## S1 3.481 0.4725 2.668 4.542
## S2 10.251 2.0045 6.987 15.038</code></pre>
</div>
<div id="afewremarks" class="section level3">
<h3>A Few Remarks</h3>
<p><sup>1</sup>The work done in R on survival analysis, and partially embodied in the two hundred thirtythree packages listed in the CRAN <a href="https://cran.rproject.org/web/views/Survival.html">Survival Analysis Task View</a>, constitutes a fundamental contribution to statistics. There is enough material here for a lifetime of study. Even confining oneself to a tour of the eleven packages listed in the simulation section would be a significant undertaking.</p>
<p><sup>2</sup> <code>genSurv</code> is a pretty bare bones package having just seven functions and little explanatory text. If it was not listed on the CRAN Task View, it would have been easy pass by. Nevertheless, I have only shown a small portion of what it can do.</p>
<p><sup>3</sup><code>msm</code> is an example of why R is a treasury of statistical knowledge. Not only does the package offer an impressive array of capabilities for analyzing multistate Markov models in continuous time, the basic documentation, the package’s pdf, includes references to quite a few of the fundamental papers.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/10/08/fakedatafortheillnessdeathmodel/';</script>

R Package Integration with Modern Reusable C++ Code Using Rcpp  Part 6
https://rviews.rstudio.com/2020/09/28/rpackageintegrationwithmodernreusableccodeusingrcpppart6/
Mon, 28 Sep 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/09/28/rpackageintegrationwithmodernreusableccodeusingrcpppart6/
<p>In this final post of the six part series on R package integration with modern reusable C++ Code using Rcpp, we will look at providing documentation for an R package. To review, we previously covered the following topics:</p>
<ul>
<li><a href="https://rviews.rstudio.com/2020/07/08/rpackageintegrationwithmodernreusableccodeusingrcpp/">Installation and configuration</a> of an <code>Rcpp</code> package project in RStudio</li>
<li><a href="https://rviews.rstudio.com/2020/07/14/rpackageintegrationwithmodernreusableccodeusingrcpppart2/">Design considerations</a> for integrating R with reusable C++ code</li>
<li><a href="https://rviews.rstudio.com/2020/07/31/rpackageintegrationwithmodernreusableccodeusingrcpppart3/">Rcpp interface code examples</a> that are exported to R</li>
<li><a href="https://rviews.rstudio.com/2020/08/18/rpackageintegrationwithmodernreusableccodeusingrcpppart4/">Creating an Rcpp Package Project</a> in RStudio, and building and distributing an R package</li>
<li><a href="https://rviews.rstudio.com/2020/08/24/rpackageintegrationwithmodernreusableccodeusingrcpppart5/">An introductory example</a> of integrating independent and standard C++ code in an R package, with sample code provided <a href="https://github.com/QuantDevHacks/RcppBlogCode/tree/master/CodePart05/src">on GitHub</a></li>
</ul>
<p>Let’s now look at how to provide html documentation that conforms to the familiar form one finds in R packages.</p>
<div id="documentation" class="section level2">
<h2>Documentation</h2>
<p>In the discussion that follows, we will look at a quick and relatively easy way to provide documentation for each of the C++ functions that is exported as an R function in an <code>Rcpp</code>based package. It should be noted that there are alternatives that are often preferred for packages that are submitted to CRAN for publication; for example, using the <a href="https://cran.rproject.org/web/packages/roxygen2/vignettes/roxygen2.html"><code>ROxygen2</code> package</a> for generating the documentation from inline statements in the interface code. This, however, involves sophistication and complexity beyond the scope of this “getting started” series.</p>
<p>The approach we will examine here involves filling a file template for each exported function, as well as a file that provides information common to the entire package. These file templates carry an <code>Rd</code> extension, as we shall see. The corresponding html files are then generated when you build your package. The good news is that <code>RStudio</code> makes this a relatively painless process.</p>
<p>If you create an <code>Rcpp</code> package project as discussed in the previous posts in this series, you will find the following two files provided by default under the <code>man</code> subdirectory:</p>
<div style="pagebreakafter: always;"></div>
<div class="figure">
<img src="man_subdirectory.png" alt="" />
<p class="caption">Default documentation files</p>
</div>
<p>These files are provided by default. You ultimately won’t need the <em>hello world</em> help file, but the <code>RcppProjectpackage.Rd</code> will contain common information for the entire package; this latter file will be introduced shortly.</p>
<div id="documentationforexportedfunctions" class="section level3">
<h3>Documentation for Exported Functions</h3>
<p>Let’s first cover how to create a help file for an exported <code>Rcpp</code> interface function. To show this, let’s go back to the sample code provided on GitHub for <a href="https://github.com/QuantDevHacks/RcppBlogCode">Part 5</a>. Suppose you wanted to provide documentation for the <code>rAdd(.)</code> function that is exported from the <code>CppInterface.cpp</code> file. How would you do this?</p>
<p>With RStudio, the solution is quite easy. First, you would select <code>New File/Documentation...</code> from the <code>File</code> menu at the top:</p>
<p><img src="RDocFileSelect.png" alt="Select a new documentation file" /><br />
You will then see an input panel as follows. Under <em>Topic name</em>, type in the function name <code>rAdd</code>, and be sure the selection under <em>Rd template</em> is set to <code>function</code> (the default):</p>
<div class="figure">
<img src="NewRDocumentationFile.png" alt="" />
<p class="caption">New R documentation file selection</p>
</div>
<p>After clicking <code>OK</code>, you will see the <code>rAdd.Rd</code> file appear in the <code>man</code> subdirectory in the RStudio files pane:</p>
<div class="figure">
<img src="NewDocumentationFileInMan.png" alt="" />
<p class="caption">New documentation file in <code>man</code> folder</p>
</div>
<p>Next, double click on this file to open it in RStudio. Note that RStudio is smart enough to glean the proper number of arguments from the function and place this information in the file:</p>
<div class="figure">
<img src="rAddRdFile_01.png" alt="" />
<p class="caption">Top of the created <code>rAdd.Rd</code> file</p>
</div>
<p>Work down the file section by section and replace the comments in green (beginning with a <code>%</code> character) with the relevant information as described to fill in the necessary information; also, the commented line: <code>% Also NEED an '\alias' for EACH other topic documented here.</code> can just be removed. The contents for each section should be placed inside the curly brackets <code>{ }</code> following each of the following items, as will be demonstrated in the example that follows these descriptions.</p>
<p><code>name</code> and <code>alias</code>: Just leave these as they are, with the function name.</p>
<p><code>title</code>: Provide brief description of what the function does .</p>
<p><code>description</code>: Provide a more detailed description of what the function does.</p>
<p><code>usage</code>: The easiest thing to do here is to leave it as is.</p>
<p><code>arguments</code>: You will notice the description of each argument <code>x</code> and <code>y</code> is nested. Just enter the type of each variable, and a short description of each if desired.</p>
<p><code>details</code>: This can be removed if no further details are required; otherwise, place additional details about the functionality here.</p>
<p><code>value</code>: Indicate the return type here.</p>
<p><code>references</code>: Place references you wish to cite here; otherwise, this can be removed if not needed.</p>
<p><code>author</code>: Provide the names of authors/contributors for this package function.</p>
<p><code>note</code> and <code>seealso</code>: Enter any additional information about the function; may be removed if not needed.</p>
<p><code>examples</code>: Place working example(s) of your function here. This should be completely selfcontained, including any input data (if needed).</p>
<p>After the examples section, remove the trailing comments at the bottom of the file.</p>
<p>A complete sample file could then be as follows. You could also use this as a blueprint for your own function documentation if you wish.</p>
<pre><code>\name{rAdd}
\alias{rAdd}
\title{Add two numbers}
\description{Adds two real numbers}
\usage{
rAdd(x, y)
}
\arguments{
\item{x}{A real number}
\item{y}{A real number}
}
\details{This is a trivial example}
\value{numeric}
\references{Courant and Hilbert, Methods of Mathematical Physics, Volumes 1 & 2}
\author{Fred Sanford}
\note{Nothing to say}
\seealso{https://rviews.rstudio.com/}
\examples{
x < 586.3
y < 922.6
rAdd(x, y)
}</code></pre>
<p>To generate the html help file, rebuild your project. When complete, and with the package reloaded into your current R session, look up <code>rAdd</code> in the usual help section in RStudio. The result should look like this:</p>
<div class="figure">
<img src="rAddGenDoc.png" width="400" alt="" />
<p class="caption">The generated help file for the <code>rAdd(.)</code> function</p>
</div>
<p>As a matter of best practice, you will want to follow the same instructions per the above for all of the remaining exported interface functions in your package. The remaining <code>.Rd</code> files have been uploaded to the <code>man</code> directory in the <a href="https://github.com/QuantDevHacks/RcppBlogCode/tree/master/CodePart06/man">GitHub repository for this post</a>. These help files accompany the source files that were <a href="https://github.com/QuantDevHacks/RcppBlogCode/tree/master/CodePart05/src">provided for Part 5</a>.</p>
</div>
<div id="documentationofcommonpackageinformation" class="section level3">
<h3>Documentation of Common Package Information</h3>
<p>A default file is also provided with an RStudio <code>Rcpp</code> project that is to contain an overall help file for the entire package; it is of the form <code>PackageNamepackage.Rd</code>. However, modifying this file for your particular package can sometimes be a source of build errors and package check errors, so I recommend just keeping it simple, at least until you become more accustomed to writing your own R packages. In particular, you will probably save yourself some trouble by removing the code examples section from this file and relegating them to the individual function help files, at least to get started.</p>
<p>As an example, the <code>RcppBlogCodepackage.Rd</code> file is also included in the GitHub folder for this post, as noted above. As you can see, this is short and sweet, but if applied to a realistic situation, it should be reasonably informative. The generated result is shown here:</p>
<div class="figure">
<img src="CommonPackageHelpFileHtml.png" alt="" />
<p class="caption">Generated help file containing common package information</p>
</div>
</div>
</div>
<div id="thecheckpackageutility" class="section level2">
<h2>The <code>Check Package</code> Utility</h2>
<p>After building and testing your package as discussed earlier on, there is one more task that one should run before deploying an R package, and this is the <em>Check Package</em> procedure, which can be found under the <code>Build</code> menu in RStudio:</p>
<div class="figure">
<img src="CheckPackageMenuChoice.png" alt="" />
<p class="caption">Starting the Check Package process</p>
</div>
<p>As described in the <a href="https://www.rdocumentation.org/packages/devtools/versions/1.3/topics/check">R documentation (devtools)</a>, <em>Check Package</em> “automatically builds and checks a source package, using all know best practices”. While this is more of a critical step for submitting a package to CRAN, even for internal deployment it’s a good idea to run it to be sure that your documentation is complete, and the examples in your documentation run properly. If you are missing an <code>.Rd</code> file for a function exported to your package, or if an example in your documentation does not run, the check procedure will flag an error.</p>
<p>The heavy details related to checking a package for CRAN submission are located <a href="https://cran.rproject.org/doc/manuals/rrelease/Rexts.html#Checkingandbuildingpackages">on the CRAN website</a>, and of course many other sources are available on the internet. These are beyond the scope of this series, but if at some point you are contemplating a CRAN submission, it will be necessary to work through the details.</p>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>This concludes our series on integrating independent, standard, and reusable C++ code in an R package using RStudio and the <code>Rcpp</code> package. Hopefully, this will help you avoid some of the frustrations one can encounter from the proverbial fire hose of ubiquitous information that often obscures the essentials at the outset. Happy packaging!</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/09/28/rpackageintegrationwithmodernreusableccodeusingrcpppart6/';</script>

August 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2020/09/22/august2020top40newcranpackages/
Tue, 22 Sep 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/09/22/august2020top40newcranpackages/
<p>One hundred fortysix new packages stuck to CRAN in August. Below, are my “Top 40” picks in eleven categories: Computational Methods, Data, Genomics, Insurance, Machine Learning, Mathematics, Medicine, Statistics, Time Series, Utilities and Visualization.</p>
<h3 id="computationalmethods">Computational Methods</h3>
<p><a href="https://CRAN.Rproject.org/package=dpseg">dpseg</a> v0.1.1: Implements an algorithm for piecewise linear segmentation of ordered data by a dynamic programming algorithm. See the <a href="https://cran.rproject.org/web/packages/dpseg/vignettes/dpseg.html">vignette</a>.</p>
<p><img src="dpseg.gif" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=qpmadr">qpmadr</a> v0.1.0: Implements the method outlined in <a href="https://link.springer.com/article/10.1007%2FBF02591962">Goldfarb & Idnani (1983)</a> for solving quadratic problems with linear inequality, equality, and box constraints.</p>
<p><a href="https://cran.rproject.org/package=WoodburyMatrix">WoodburyMatrix</a> v0.0.1: Implements a hierarchy of classes and methods for manipulating matrices formed implicitly from the sums of the inverses of other matrices, a situation commonly encountered in spatial statistics and related fields. See the <a href="https://cran.rproject.org/package=WoodburyMatrix">vignette</a> for details.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.rproject.org/package=neonstore">neonstore</a> v0.2.2: Provides to access to numerous National Ecological Observatory Network (<a href="https://data.neonscience.org/">NEON</a>) data sets through its <a href="https://data.neonscience.org/dataapi/">API</a>.</p>
<p><a href="https://cran.rproject.org/package=pdxTrees">pdxTrees</a> v0.4.0: A collection of datasets from Portland Parks and Recreation which inventoried every tree in over one hundred seventy parks and along the streets in ninetysix neighborhoods. See the <a href="https://cran.rproject.org/web/packages/pdxTrees/vignettes/pdxTreesvignette.html">vignette</a>.</p>
<p><img src="pdxTrees.gif" height = "300" width="500"></p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.rproject.org/package=hiphop">hiphop</a> v0.0.1: Implements a method to compare the genotypes of offspring with any combination of potential parents, and scores the number of mismatches of these individuals at biallelic genetic markers that can be used for paternity and maternity assignment. See <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/17550998.12665">Huisman (2017)</a> for background, and the <a href="https://cran.rproject.org/web/packages/hiphop/vignettes/intro_hiphop.html">vignette</a> for an introduction.</p>
<p><a href="https://cran.rproject.org/package=RapidoPGS">RapidoPGS</a> v1.0.2: Provides functions to quickly compute polygenic scores from GWAS summary statistics of either casecontrol or quantitative traits without LD matrix computation or parameter tuning. See <a href="https://www.biorxiv.org/content/10.1101/2020.07.24.220392v1">Reales et al. (2020)</a> for details and the <a href="https://cran.rproject.org/web/packages/RapidoPGS/vignettes/Computing_RapidoPGS.html">vignette</a> for examples.</p>
<h3 id="insurance">Insurance</h3>
<p><a href="https://cran.rproject.org/package=SynthETIC">SynthETIC</a> v0.1.0: Implements an individual claims simulator which generates synthetic data that emulates various features of nonlife insurance claims. Refer to <a href="https://arxiv.org/abs/2008.05693">Avanzi et al. (2020)</a> for background and see the <a href="https://cran.rproject.org/web/packages/SynthETIC/vignettes/SynthETICdemo.html">vignette</a> for examples.</p>
<p><img src="SynthETIC.png" height = "300" width="500"></p>
<h3 id="machinelearning">Machine Learning</h3>
<p><a href="https://cran.rproject.org/package=sparklyr.flint">sparklyr.flint</a> v0.1.1: Extends <code>sparklyr</code> to include <a href="https://github.com/twosigma/flint">Flint</a> time series functionality. Vignettes include <a href="https://cran.rproject.org/web/packages/sparklyr.flint/vignettes/importingtimeseriesdata.html">Importing Data</a> and <a href="https://cran.rproject.org/web/packages/sparklyr.flint/vignettes/workingwithtimeseriesrdd.html">Time Series RDD</a>.</p>
<p><a href="https://cran.rproject.org/package=torch">torch</a> v0.0.3: Provides functionality to define and train neural networks similar to <code>PyTorch</code> by <a href="https://arxiv.org/abs/1912.01703">Paszke et al (2019)</a> but written entirely in R. There are vignettes on <a href="https://cran.rproject.org/web/packages/torch/vignettes/extendingautograd.html">Extending Autograd</a>, <a href="https://cran.rproject.org/web/packages/torch/vignettes/indexing.html">Indexing tensors</a>, <a href="https://cran.rproject.org/web/packages/torch/vignettes/loadingdata.html">Loading data</a>, <a href="https://cran.rproject.org/web/packages/torch/vignettes/tensorcreation.html">Creating tensors</a>, and <a href="https://cran.rproject.org/web/packages/torch/vignettes/usingautograd.html">Using autograd</a>.</p>
<h3 id="mathematics">Mathematics</h3>
<p><a href="https://cran.rproject.org/package=gasper">gasper</a> v1.0.1: Provides the standard operations for signal processing on graphs including graph Fourier transform, spectral graph wavelet transform, visualization tools. See <a href="https://arxiv.org/abs/1906.01882">De Loynes et al. (2019)</a> for background and the package <a href="https://cran.rproject.org/web/packages/gasper/vignettes/gasper_vignette.pdf">vignette</a>.</p>
<p><img src="gasper.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=GeodRegr">GeodRegr</a> v0.1.0: Provides a gradient descent algorithm to find a geodesic relationship between realvalued independent variables and a manifoldvalued dependent variable (i.e. geodesic regression). Available manifolds are Euclidean space, the sphere, and Kendall’s 2dimensional shape space. See <a href="https://arxiv.org/abs/2007.04518">Shin & Oh (2020)</a>, <a href="https://link.springer.com/article/10.1007%2Fs112630120591y">Fletcher (2013)</a>, <a href="https://ieeexplore.ieee.org/document/6909742">Kim et al. (2104)</a>] for background.</p>
<p><a href="https://CRAN.Rproject.org/package=geos">geos</a> v0.0.1: Provides an R API to the Open Source Geometry Engine <a href="https://trac.osgeo.org/geos/">GEOS</a> library and a vector format with which to efficiently store <code>GEOS</code> geometries. See <a href="https://cran.rproject.org/web/packages/geos/readme/README.html">README</a> for an example.</p>
<p><a href="https://cran.rproject.org/package=pcSteiner">pcSteiner</a> v1.0.0: Provides functions for obtaining an approximate solution to the prize winning <em>Steiner Tree problem</em> that seeks a subgraph connecting a given set of vertices with the most expensive nodes and least expensive edges. This implementation uses a loopy belief propagation algorithm. There is a <a href="https://cran.rproject.org/web/packages/pcSteiner/vignettes/tutorial.pdf">Tutorial</a>.</p>
<p><a href="https://cran.rproject.org/package=TCIU">TCIU</a> v1.1.0: Provides the core functionality to transform longitudinal data to complextime (kime) data using analytic and numerical techniques, visualize the original timeseries and reconstructed kimesurfaces, perform model based (e.g., tensorlinear regression) and modelfree classification and clustering methods. See <a href="https://www.degruyter.com/view/title/576646">Dinov & Velev (2021)</a> for background. There are vignettes on <a href="https://cran.rproject.org/web/packages/TCIU/vignettes/tciuLTkimesurface.html">Laplace Transform and Kime Surface Transforms</a> and <a href="https://cran.rproject.org/web/packages/TCIU/vignettes/tciufMRIanalytics.html">Workflows of TCIU Analytics</a>.</p>
<p><img src="TCIU.png" height = "300" width="500"></p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.rproject.org/package=epigraphdb">epigraphdb</a> v0.2.1: Provides access to the <a href="https://epigraphdb.org">EpiGraphDB</a> platform. There is an <a href="https://cran.rproject.org/web/packages/epigraphdb/vignettes/usingepigraphdbrpackage.html">overview</a>, vignettes on the <a href="https://cran.rproject.org/web/packages/epigraphdb/vignettes/usingepigraphdbapi.html">API</a>, <a href="https://cran.rproject.org/web/packages/epigraphdb/vignettes/metafunctionalities.html">Platform Functionality</a>, <a href="https://cran.rproject.org/web/packages/epigraphdb/vignettes/metafunctionalities.html">Meta Functions</a> and three case studies on <a href="https://cran.rproject.org/web/packages/epigraphdb/vignettes/case1pleiotropy.html">SNP protein associations</a>, <a href="https://cran.rproject.org/web/packages/epigraphdb/vignettes/case2altdrugtarget.html">Drug Targets</a> and <a href="https://cran.rproject.org/web/packages/epigraphdb/vignettes/case3literaturetriangulation.html">Causal Evidence</a>.</p>
<p><a href="https://cran.rproject.org/package=raveio">raveio</a> v0.0.3: implements an interface to the <a href="https://openwetware.org/wiki/RAVE">RAVE</a> (R analysis and visualization of human intracranial electroencephalography data) project which aims at analyzing brain recordings from patients with electrodes placed on the cortical surface or inserted into the brain. See <a href="https://www.biorxiv.org/content/10.1101/2020.06.02.129676v1">Mafnotti et al. (2020)</a> for background.</p>
<p><a href="https://cran.rproject.org/package=tboot">tboot</a> v0.2.0: Provides functions to simulate clinical trial data with realistic correlation structures and assumed efficacy levels by using a tilted bootstrap resampling approach. There is a tutorial on <a href="https://cran.rproject.org/web/packages/tboot/vignettes/tboot.html">The Tilted Bootstrap</a> and another on <a href="https://cran.rproject.org/web/packages/tboot/vignettes/tboot_bmr.html">Bayesian Marginal Reconstruction</a>.</p>
<p><img src="tboot.png" height = "300" width="400"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.rproject.org/web/packages/BayesMRA/index.html">BayesMRA</a> v1.0.0: Fits sparse Bayesian multiresolution spatial models using Markov Chain Monte Carlo. See the <a href="https://cran.rproject.org/web/packages/BayesMRA/vignettes/mrasimulation.html">vignette</a>.</p>
<p><img src="BayesMRA.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=bsem">bsem</a> v1.0.0: Implements functions to allow structural equation modeling for particular cases using <code>rstan</code> that includes Bayesian semiconfirmatory factor analysis, confirmatory factor analysis, and structural equation models. See <a href="https://projecteuclid.org/euclid.aoas/1372338468">Mayrink (2013)</a> for background and the vignettes: <a href="https://cran.rproject.org/web/packages/bsem/vignettes/bsem.html">Get Started</a> and <a href="https://cran.rproject.org/web/packages/bsem/vignettes/exploringbsemclass.html">Exploring bsem class</a>.</p>
<p><img src="bsem.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=cyclomort">cyclomort</a> v1.0.2: Provides functions to do survival modeling with a periodic hazard function. See <a href="https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041210X.13305"> Gurarie et al. (2020)</a> and the <a href="https://cran.rproject.org/web/packages/cyclomort/vignettes/cyclomort.html">vignette</a> for details.</p>
<p><img src="cyclemort.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=ebmstate">ebmstate</a> v0.1.1: Implements an empirical Bayes, multistate Cox model for survival analysis. See <a href="https://academic.oup.com/biomet/articleabstract/78/4/719/288160?redirectedFrom=fulltext">Schall (1991)</a> for details.</p>
<p><a href="https://cran.rproject.org/package=fairmodels">fairmodels</a> v0.2.2: Provides functions to measure fairness for multiple models including measuring a model’s bias towards different races, sex, nationalities etc. There are <a href="https://cran.rproject.org/web/packages/fairmodels/vignettes/Basic_tutorial.html">Basic</a> and <a href="https://cran.rproject.org/web/packages/fairmodels/vignettes/Advanced_tutorial.html">Advanced</a> tutorials.</p>
<p><img src="fairmodels.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=MGMM">MGMM</a> v0.3.1: Implements clustering of multivariate normal random vectors with missing elements. Clustering is achieved by fitting a Gaussian Mixture Model (GMM). See <a href="https://www.biorxiv.org/content/10.1101/2019.12.20.884551v1">McCaw et al. (2019)</a> for details, and the <a href="https://cran.rproject.org/web/packages/MGMM/vignettes/MGMM.pdf">vignette</a> for examples.</p>
<p><a href="https://cran.rproject.org/package=rmsb">rmsb</a> v0.0.1: Is a Bayesian companion to the <code>rms</code> package which provides Bayesian model fitting, postfit estimation, and graphics, and implements Bayesian regression models whose fit objects can be processed by <code>rms</code> functions. Look <a href="https://hbiostat.org/R/rmsb/">here</a> for more information.</p>
<p><img src="rmsb.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=RoBMA">RoBMA</a> v1.0.4: Implements a framework for estimating ensembles of metaanalytic models (assuming either presence or absence of the effect, heterogeneity, and publication bias) and uses Bayesian model averaging to combine them. See <a href="https://psyarxiv.com/u4cns/">Maier et al. (2020)</a> for background and the vignettes: <a href="https://cran.rproject.org/web/packages/RoBMA/vignettes/CustomEnsembles.html">Fitting custom metaanalytic ensembles</a>,
<a href="https://cran.rproject.org/web/packages/RoBMA/vignettes/ReproducingBMA.html">Reproducing BMA</a>, and <a href="https://cran.rproject.org/web/packages/RoBMA/vignettes/WarningsAndErrors.html">Common warnings and errors</a>.</p>
<p><img src="RoBMA.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=tTOlr">tTOlr</a> v0.2: Implements likelihood ratio statistics for one and two sample ttests. There are two vignettes: Likelihood <a href="https://cran.rproject.org/web/packages/tTOlr/vignettes/fprVSp.html">Ratio and False Positive Risk</a> and <a href="https://cran.rproject.org/web/packages/tTOlr/vignettes/lookDeeper.html">Pvalues – Uses, abuses, and alternatives</a>.</p>
<p><img src="tTOlr.png" height = "400" width="600"></p>
<h3 id="timeseries">Time Series</h3>
<p><a href="https://cran.rproject.org/package=fable.prophet">fable.prophet</a> v0.1.0: Enables <a href="https://CRAN.Rproject.org/package=prophet">prophet</a> models to be used in tidyworkflows created with <a href="https://cran.rproject.org/package=fabletools">fabletools</a>. See the <a href="https://cran.rproject.org/web/packages/fable.prophet/vignettes/intro.html">vignette</a> for an introduction.</p>
<p><img src="fable.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=garma">garma</a> v0.9.3: Provides methods for estimating long memoryseasonal/cyclical Gegenbauer univariate time series processes. See <a href="https://projecteuclid.org/euclid.ss/1534147230">Dissanayake et al. (2018)</a> for background and the <a href="https://cran.rproject.org/web/packages/garma/vignettes/introduction.pdf">vignette</a> for the details of model fitting.</p>
<p><img src="garma.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=gratis">gratis</a> v0.2.0: Generates time series based on mixture autoregressive models. See <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sam.11461">Kang et al. (2020)</a> for background and the <a href="https://cran.rproject.org/web/packages/gratis/vignettes/QuickStart.html">vignette</a> for an introduction to the package.</p>
<p><a href="https://cran.rproject.org/package=rhosa">rhosa</a> v0.1.0: Implements higherorder spectra or polyspectra analysis for time series. <a href="https://www.sciencedirect.com/science/article/abs/pii/S016516849700217X?via%3Dihub">Brillinger & Irizarry (1998)</a> and <a href="https://dl.acm.org/doi/10.1145/355958.355961">Lii & Helland (1981)</a> for background and the <a href="https://cran.rproject.org/web/packages/rhosa/vignettes/quadratic_phase_coupling.html">vignette</a> for examples.</p>
<p><img src="rhosa.png" height = "400" width="600"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.rproject.org/package=DataEditR">DataEditR</a> v0.0.5: Implements an interactive editor to allow the interactive viewing, entering and editing of data in R. See the <a href="https://cran.rproject.org/web/packages/DataEditR/vignettes/DataEditR.html">vignette</a> for details.</p>
<p><img src="DataEditR.gif" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=equatiomatic">equatiomatic</a> v0.1.0: Simplifies writing <code>LaReX</code> formulas by providing a function that takes a fitted model object as its input and returns the corresponding <code>LaTeX</code> code for the model. There is an <a href="https://cran.rproject.org/web/packages/equatiomatic/vignettes/introequatiomatic.html">Introduction</a> and a vignette on <a href="https://cran.rproject.org/web/packages/equatiomatic/vignettes/tests_and_coverage.html">Tests and Coverage</a></p>
<p><img src="equatiomatic.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=starschemar">starschemar</a> v1.1.0: Provides functions to obtain star schema from flat tables. The <a href="https://cran.rproject.org/web/packages/starschemar/vignettes/starschemar.html">vignette</a> shows multiple examples.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.rproject.org/package=glow">glow</a> v0.10.1: Provides a framework for creating plots with glowing points. See the <a href="https://cran.rproject.org/web/packages/glow/vignettes/vignette.html">vignette</a> for examples.</p>
<p><img src="glow.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=graph3d">graph3d</a> v0.1.0 Implements a wrapper for the <code>JavaScript</code> library <code>visgraph</code> that enables users to create three dimensional interactive visualizations. Look <a href="https://github.com/stla/graph3d">here</a> for an example.</p>
<p><img src="graph3d.gif" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=jsTreeR">jsTreeR</a> v0.1.0: Provides functions to implement interactive trees for representing hierarchical data that can be included in <code>Shiny</code> apps and R markdown documents. Look <a href="https://github.com/stla/jsTreeR">here</a> for examples.</p>
<p><img src="jsTreeR.gif" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=KMunicate">KMunicate</a> v0.1.0: Provides functions to produce Kaplan–Meier plots in the style recommended following the KMunicate study by <a href="https://bmjopen.bmj.com/content/9/9/e030215">Morris et al. (2019)</a>. See the <a href="https://cran.rproject.org/web/packages/KMunicate/vignettes/KMunicate.html">vignette</a> for examples.</p>
<p><img src="KMunicate.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=rAmCharts4">rAmCharts4</a> v0.1.0: Provides functions to create <code>JavaScript</code> charts that can be included in <code>Shiny</code> apps and R Markdown documents, or viewed from the R console and RStudio viewer. Look <a href="https://github.com/stla/rAmCharts4">here</a> for examples.</p>
<p><img src="rAmCharts4.gif" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=tabularmaps">tabularmaps</a> v0.1.0: Provides functions for creating <em>tabular maps</em>, a visualization method for efficiently displaying data consisting of multiple elements by tiling them. When dealing with geospatial data, they corrects for differences in visibility between areas. Look <a href="https://github.com/uribo/tabularmaps">here</a> and at the <a href="https://cran.rproject.org/web/packages/tabularmaps/vignettes/tabularmaps01.html">vignette</a> for examples.</p>
<p><img src="tabularmaps.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/09/22/august2020top40newcranpackages/';</script>

Some Thoughts on R / Medicine 2020
https://rviews.rstudio.com/2020/09/16/somethoughtsonrmedicine2020/
Wed, 16 Sep 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/09/16/somethoughtsonrmedicine2020/
<script src="/rmarkdownlibs/htmlwidgets/htmlwidgets.js"></script>
<script src="/rmarkdownlibs/jquery/jquery.min.js"></script>
<link href="/rmarkdownlibs/leaflet/leaflet.css" rel="stylesheet" />
<script src="/rmarkdownlibs/leaflet/leaflet.js"></script>
<link href="/rmarkdownlibs/leafletfix/leafletfix.css" rel="stylesheet" />
<script src="/rmarkdownlibs/Proj4Leaflet/proj4compressed.js"></script>
<script src="/rmarkdownlibs/Proj4Leaflet/proj4leaflet.js"></script>
<link href="/rmarkdownlibs/rstudio_leaflet/rstudio_leaflet.css" rel="stylesheet" />
<script src="/rmarkdownlibs/leafletbinding/leaflet.js"></script>
<p>The third annual <a href="https://events.linuxfoundation.org/rmedicine/">R / Medicine Conference</a> was held online this year from August 27th to August 29th and was an unqualified success. The last minute pivot from small, inperson conference, which was to be held onsite at the Children’s Hospital of Philadelphia, <a href="https://www.chop.edu/">CHOP</a>, to a virtual event turned out to be a catalyst for positive change. Under the imaginative, and tireless leadership of conference chair <a href="https://www.chop.edu/doctors/kadaukestephan">Dr. Stephan Kadauke</a>, R / Medicine grew from a small R conference to become a medical conference with international reach. The map below shows that the conference attracted international attendees from fortythree countries. (Click on the markers to see country and number of attendees.)</p>
<div id="htmlwidget1" style="width:672px;height:480px;" class="leaflet htmlwidget"></div>
<script type="application/json" datafor="htmlwidget1">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["//{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"© <a href=\"http://openstreetmap.org\">OpenStreetMap<\/a> contributors, <a href=\"http://creativecommons.org/licenses/bysa/2.0/\">CCBYSA<\/a>"}]},{"method":"addMarkers","args":[[28.0000272,24.7761086,47.2000338,50.6402809,10.3333333,12.0753083,61.0666922,31.7613365,35.000074,2.8894434,49.8167003,55.670249,63.2467777,46.603354,51.0834196,null,47.1817585,22.3511148,52.865196,42.6384261,36.5748441,1.4419683,56.8406494,4.5693754,22.5000485,52.5001698,9.6000359,60.5000209,52.215933,40.0332629,64.6863136,25.6242618,null,1.357107,36.638392,7.8699431,39.3262345,59.6749712,46.7985624,10.8677845,38.9597594,54.7023545,39.7837304],[2.9999825,134.755,13.199959,4.6667145,53.2,1.6880314,107.9917071,71.3187697,104.999927,73.783892,15.4749544,10.3333283,25.9209164,1.8883335,10.4234469,null,19.5060937,78.6677428,7.9794599,12.674297,139.2394179,38.4313975,24.7537645,102.2656823,100.0000375,5.7480821,7.9999721,9.0999715,19.134422,7.8896263,97.7453061,42.3528328,null,103.8194992,127.6961188,29.6667897,4.8380649,14.5208584,8.2319736,60.9821067,34.9249653,3.2765753,100.4458825],null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},null,null,null,null,["Algeria: 1","Australia: 12","Austria: 2","Belgium: 1","Brazil: 4","Burkina Faso: 1","Canada: 27","Chile: 4","China: 2","Colombia: 5","Czech Republic: 1","Denmark: 1","Finland: 3","France: 7","Germany: 4","Hong Kong (China): 2","Hungary: 2","India: 7","Ireland: 4","Italy: 10","Japan: 2","Kenya: 1","Latvia: 1","Malaysia: 1","Mexico: 5","Netherlands: 2","Nigeria: 1","Norway: 6","Poland: 7","Portugal: 3","Russian Federation: 1","Saudi Arabia: 1","Scotland: 1","Singapore: 3","South Korea: 1","South Sudan: 1","Spain: 7","Sweden: 4","Switzerland: 6","Trinidad and Tobago: 2","Turkey: 1","United Kingdom: 30","USA: 399"],{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[31.7613365,64.6863136],"lng":[107.9917071,139.2394179]}},"evals":[],"jsHooks":[]}</script>
<p>Four hundred fiftytwo people attended the live event and four hundred thirtyone watched the replay. Equally impressive is that approximately sixteen percent of conference registrants supplied titles which suggest that they are medical doctors. This figure does not include medical students, nurses, or ancillary clinical staff.</p>
<p>Much of the success in attracting the target audience this year was very likely attributable to the effort that the organizers made to reach out to sponsors familiar to clinicians. In addition to the R Consortium, R Consortium member companies Procogia and RStudio, The Yale School of Public Health (the conference host for its first two years), this year R / Medicine attracted the Children’s Hospital of Philadelphia, the American Association of Clinical Chemistry, <a href="https://www.aacc.org/">AACC</a>, and the Association for Mass Spectrometry & Advances in the Clinical Lab, <a href="https://www.msacl.org/">MSACL</a>. The panel discussion <a href="https://vimeo.com/435365958">Integrating R into Clinical Practice</a>, held during virtual US MSACL 2020 Conference in July was particularly effective in spreading the word among clinicians.</p>
<p>Another key element of the success of the R / Medicine 2020 was that the <a href="https://events.linuxfoundation.org/rmedicine/program/committeemembers/">Program Committee</a> chaired by <a href="https://www.mayo.edu/research/faculty/atkinsonelizabethbethjms/bio00083520">Beth Atkinson</a> looked beyond the usual cadre of R luminaries and assembled a roster of speakers with deep experience in medical related applications. Keynote talks from <a href="https://www.youtube.com/watch?v=vEoMxWYgIo&list=PL4IzsxWztPdljYo7uE5G_R2PtYw3fUReo">Daniela Witten</a>, <a href="https://www.youtube.com/watch?v=MWDxEnlZFws&list=PL4IzsxWztPdljYo7uE5G_R2PtYw3fUReo&index=5">Robert Gentleman</a>, <a href="https://www.youtube.com/watch?v=D0rWe8bW5ss&list=PL4IzsxWztPdljYo7uE5G_R2PtYw3fUReo&index=18">Ewin Harrison</a>, and <a href="https://www.youtube.com/watch?v=kymTD3BsQpg&list=PL4IzsxWztPdljYo7uE5G_R2PtYw3fUReo&index=29">Patrick Mathias</a> ranged from a sophisticated analysis of a common mistake in predictive modeling and a glimpse into the future of computational genetics to recent work related to the COVID19 pandemic. Preconference workshops were delivered by Stephan Kadauke: (<em>Intro to R for Clinicians</em>), and Alison Hill (<em>Intro to Machine Learning with Tidymodels</em>). Other talks covered R education, reproducible research, operational clinical workflows, clinical reporting, and statistical analyses. My favorite talk title <em>The MD in .rmd: Teaching Clinical Data Analytics with R</em> by Ted Laderas perfectly captured the spirit of R / Medicine 2020.</p>
<p>All of the keynote and regular talks are available on the <a href="https://www.youtube.com/channel/UC_R5smHVXRYGhZYDJsnXTwg/playlists">R Consortium’s Youtube Channel</a>. Click on “PLAYLISTS” and select the first on the left.</p>
<p>The big takeaways from R / Medicine 2020 are that <strong>R is an established tool in clinical applications. Doctors are teaching doctors about R. And, as knowledge about R propagates, R use in clinical workflows is increasing.</strong>.</p>
<p>Be sure to mark your 2021 calendar with reminder about R / Medicine for sometime in late August. Also note that sixtyeight percent of the folks who filled out the post conference survey indicated that they would be interested in a hybrid event next year even if progress with mitigating the risks of COVID19 permits an in person event next year. So the R / Medicine team is going to be thinking hard about how to keep their international following. Moreover, the team hopes to be able to offer R / Medicine branded events throughout the year. Please watch the <a href="https://www.rconsortium.org/news/blog">R Consortium Blog</a> for news and updates.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/09/16/somethoughtsonrmedicine2020/';</script>

Fake Data with R
https://rviews.rstudio.com/2020/09/09/fakedatawithr/
Wed, 09 Sep 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/09/09/fakedatawithr/
<script src="/rmarkdownlibs/htmlwidgets/htmlwidgets.js"></script>
<script src="/rmarkdownlibs/jquery/jquery.min.js"></script>
<link href="/rmarkdownlibs/leaflet/leaflet.css" rel="stylesheet" />
<script src="/rmarkdownlibs/leaflet/leaflet.js"></script>
<link href="/rmarkdownlibs/leafletfix/leafletfix.css" rel="stylesheet" />
<script src="/rmarkdownlibs/Proj4Leaflet/proj4compressed.js"></script>
<script src="/rmarkdownlibs/Proj4Leaflet/proj4leaflet.js"></script>
<link href="/rmarkdownlibs/rstudio_leaflet/rstudio_leaflet.css" rel="stylesheet" />
<script src="/rmarkdownlibs/leafletbinding/leaflet.js"></script>
<script src="/rmarkdownlibs/leafletproviders/leafletproviders_1.9.0.js"></script>
<script src="/rmarkdownlibs/leafletprovidersplugin/leafletprovidersplugin.js"></script>
<p>Simulation is the foundation of computational statistics and a fundamental organizing principle of the R language. For example, few complex tasks are more compactly expressed in any programming language than <code>rnorm(100)</code>. But while many simulation tasks are trivial in R, simulating adequate and convincing synthetic or “fake” data is a task whose cognitive demands increases quickly as complexity moves beyond independent random draws from named probability distributions. In this post, I would like to highlight a few of the many R packages that are useful for simulating data.</p>
<p>Before we begin, let’s just list a few reasons why you may want to make fake data. Having some of these in mind should be useful for exploring the tools provided in the R packages listed below. Bear in mind that no one package is going to cover everything. But before simulating data on you own, it may be helpful to see where others thought it was worthwhile make the effort to write a package.</p>
<div id="whysimulatedata" class="section level3">
<h3>Why simulate data?</h3>
<ul>
<li>Simulation is part of exploratory data analysis. For example, when you may have data that comes from some kind of arrival process, it may be helpful to simulate Poisson arrivals to see if counts and arrival time distributions match with your data set. Having a generative model might really simplify your analysis, and knowing that you don’t have one may be critical too.</li>
<li>Carefully constructed fake data may be helpful in testing the limits of an algorithm or analysis. Everybody wants to publish an algorithm that really seems to explain some data set. Not too many people show you the edge conditions where their algorithm breaks down.</li>
<li>Making fake data may be a good way to begin a project before you even have any data. There is no better way to expose you assumptions than to write them down in code.</li>
<li>Fake data can be helpful when there are privacy concerns. Building a data set that reproduces some of the statistical properties of the real data while passing the eyeball test for being convincing may make all the difference in communicating your ideas.</li>
</ul>
</div>
<div id="rpackagesforsimulatingdata" class="section level3">
<h3>R Packages for Simulating Data</h3>
<p>Here are a few of R package that ought to be helpful in nearly every project where you need to manufacture fake data.</p>
<p><a href="https://cran.rproject.org/web/packages/bindata/index.html">bindata</a> is an “oldie but goodie” from R Core members Friedrich Leisch. Andreas Weingessel, and Kurt Hornik that goes back to 1999 and still gets about twentyfive downloads a day. The package does one really nice trick: it provides multiple ways to create columns of binary random variables that have a prespecified correlation structure.</p>
<p>Suppose you want to create a matrix whose columns are draws from binary random variables. You can begin by setting up a matrix, M, of “common probabilities”. In the simple example below <code>M[1,1]</code> specifies the probability that the random variable <code>V1</code> will be a 1, while <code>M[2,2]</code> does the same for variable <code>V2</code>. The off diagonal elements specify the joint probabilities that <code>V1</code> and <code>V2</code> are both one.</p>
<pre class="r"><code>M < matrix(c(.5, .2, .2, .5), nrow = 2, byrow = TRUE)
colnames(M) < c("V1","V2")
M</code></pre>
<pre><code>## V1 V2
## [1,] 0.5 0.2
## [2,] 0.2 0.5</code></pre>
<pre class="r"><code>res < rmvbin(100,commonprob = M)
head(res) </code></pre>
<pre><code>## [,1] [,2]
## [1,] 0 0
## [2,] 0 1
## [3,] 0 1
## [4,] 0 0
## [5,] 1 1
## [6,] 1 1</code></pre>
<p>For this example, it’s easy to check the results by plotting a couple of histograms. The tricky part is deciding on the correlation structure. There are some limitations on what the common probabilities can be, but the authors provide a helper function: <code>check.commonprob(M)</code> so you can check ahead of time.</p>
<p><a href="https://cran.rproject.org/package=charlatan">charlatan</a> from Scott Chamberlain and the rOpenSci team allows you to make convincing fake addresses, person names, dates, times, coordinates, currencies, DOIs’, jobs, phone numbers, DNA sequences and more.</p>
<p>Here is a small example of simulating gene sequences.</p>
<pre class="r"><code>ch_gene_sequence(n = 10)</code></pre>
<pre><code>## [1] "CTNNTTCTNCACNTCCNNATCGGNGACNCG" "CAGGACCAGNCCNNNTGAAAGTCANCNTCT"
## [3] "CGNTTNAGTGATGCANGNCGCAAACCGTGC" "TGCNAACTATTCGGANNTCAGNCTTCTTNC"
## [5] "NNCGGAGTTGNGTNAGNCNCGCGTTCGGNT" "NTCCCCTACACNNTTAANTTGNTATNTCGG"
## [7] "GNGACNAAAGCANNNGGTAAGGTACTNNNA" "AGCGGNTNCNGGCGGCATNAGNCCNCNCTC"
## [9] "AGAANNCNTGTGGACCGNNCNGNNAGCNTA" "TANCNCTCTTNNGNAANGNNTNNANACAGC"</code></pre>
<p>And here is another small example showing how to simulate random longitude and latitude coordinates. (These very likely point to some places you’ve never been that might make good vacation destinations when you can actually go somewhere again.)</p>
<pre class="r"><code>set.seed(1234)
locations < as.data.frame(cbind(ch_lon(n=10),ch_lat(n=10)))
names(locations) < c("lon", "lat")
leaflet(locations) %>% addProviderTiles("Stamen.Watercolor") %>%
addMarkers(~lon, ~lat)</code></pre>
<div id="htmlwidget1" style="width:672px;height:480px;" class="leaflet htmlwidget"></div>
<script type="application/json" datafor="htmlwidget1">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addProviderTiles","args":["Stamen.Watercolor",null,null,{"errorTileUrl":"","noWrap":false,"detectRetina":false}]},{"method":"addMarkers","args":[[34.846432521008,8.09547040611506,39.1079549537972,76.2180271698162,37.3831487540156,60.7132130675018,38.4798087598756,41.9722595997155,56.389897861518,48.1993361050263],[139.066771930084,44.0277857333422,39.338903836906,44.4165990035981,129.929538080469,50.5118179041892,176.581527711824,96.2818178348243,59.7901529632509,5.13041088357568],null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},null,null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[56.389897861518,76.2180271698162],"lng":[176.581527711824,129.929538080469]}},"evals":[],"jsHooks":[]}</script>
<p><a href="https://CRAN.Rproject.org/package=fabricatr">fabricatr</a> is part of the <a href="https://declaredesign.org/declare.pdf">DeclareDesign</a> suite of packages from Graeme Blair and his colleagues for “formally ‘declaring’ the analytically relevant features of a research design”. As the package authors put it: “(the package) helps you imagine your data before you collect it”, and provides functions for building hierarchical data structures and correlated data. To see some of what it can do, please have a look at this <a href="https://rviews.rstudio.com/2019/07/01/imagineyourdatabeforeyoucollectit/">R Views post</a> by the package authors that works through examples of hierarchical data, longitudinal data and intraclass correlation. There is also a <a href="https://declaredesign.org/r/fabricatr/articles/">tutorial</a>.</p>
<p><a href="https://cran.rproject.org/package=fakeR">fakeR</a> from Lily Zhang and Dustin Tingley helps to solve the problem of making fake data that matches your real data. It simulates data from a data set with various data types. It randomly samples character and factor data from contingency tables and numeric and ordered data from multivariate distributions that account for the between column correlations among the variables. There are also functions to simulate stationary time series.</p>
<p>In this example, we simulate data from the <code>USArrests</code> data set.</p>
<pre class="r"><code>set.seed(1234)
state_names < rownames(USArrests)
df < tibble(state_names)
data < USArrests
rownames(data) < NULL
sim_data < simulate_dataset(data)</code></pre>
<pre><code>## [1] "Numeric variables. No ordered factors..."</code></pre>
<pre class="r"><code>fake_arrests < cbind(df, sim_data)
head(fake_arrests)</code></pre>
<pre><code>## state_names Murder Assault UrbanPop Rape
## 1 Alabama 3.96 179.8 77.89 7.28
## 2 Alaska 10.64 209.3 57.91 19.32
## 3 Arizona 3.01 87.8 54.51 7.86
## 4 Arkansas 5.48 175.9 79.31 22.17
## 5 California 4.81 104.5 55.52 31.75
## 6 Colorado 6.88 131.7 58.60 21.03</code></pre>
<p><a href="https://cran.rproject.org/package=GenOrd">GenOrd</a> by Bapiero and Ferrari implements gaussian copula based procedure for generating samples from discrete random variables with prescribed correlation matrix and marginal distributions.</p>
<p>The following is a slightly annotated version of Example 2 given on page nine of the package pdf. The problem is to draw samples from four different discrete random variables with different numbers of categories that have different specified uniform marginal distributions but conform to a specified correlation structure.</p>
<p>The example begins by specifying the marginal distributions and then running the function <code>corrcheck()</code> to get the upper and lower ranges for a feasible correlation matrix. Note that the random variables are set to have respectively 2, 3, 4, and 5 categories and that it is not necessary to specify the final 1 in each vector of cumulative marginal probabilities.</p>
<pre class="r"><code>k < 4 #number of random variables
marginal < list(0.5, c(1/3,2/3), c(1/4,2/4,3/4), c(1/5,2/5,3/5,4/5))
corrcheck(marginal)</code></pre>
<pre><code>## [[1]]
## 4 x 4 Matrix of class "dsyMatrix"
## [,1] [,2] [,3] [,4]
## [1,] 1.0000 0.8165 0.8944 0.8485
## [2,] 0.8165 1.0000 0.9129 0.9238
## [3,] 0.8944 0.9129 1.0000 0.9487
## [4,] 0.8485 0.9238 0.9487 1.0000
##
## [[2]]
## 4 x 4 Matrix of class "dsyMatrix"
## [,1] [,2] [,3] [,4]
## [1,] 1.0000 0.8165 0.8944 0.8485
## [2,] 0.8165 1.0000 0.9129 0.9238
## [3,] 0.8944 0.9129 1.0000 0.9487
## [4,] 0.8485 0.9238 0.9487 1.0000</code></pre>
<p>Given the bounds, we select a feasible correlation matrix, Sigma, and generate n samples of the four random variables.</p>
<pre class="r"><code>Sigma < matrix(c(1,0.5,0.4,0.3,0.5,1,0.5,0.4,0.4,0.5,1,0.5,0.3,0.4,0.5,1),
k, k, byrow=TRUE)
Sigma</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 1.0 0.5 0.4 0.3
## [2,] 0.5 1.0 0.5 0.4
## [3,] 0.4 0.5 1.0 0.5
## [4,] 0.3 0.4 0.5 1.0</code></pre>
<pre class="r"><code>set.seed(1)
n < 1000 # sample size
m < ordsample(n, marginal, Sigma)
head(m,10)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 2 3 4 2
## [2,] 1 2 3 1
## [3,] 2 3 2 5
## [4,] 1 1 1 1
## [5,] 2 1 2 2
## [6,] 1 3 3 5
## [7,] 1 3 1 1
## [8,] 1 1 3 3
## [9,] 1 2 3 2
## [10,] 2 3 3 2</code></pre>
<p>Finally, check how well the simulated correlation structure agrees with the specified correlation matrix, and compare the empirical cumulative marginal probabilities with the specified probabilities. Things look pretty good.</p>
<pre class="r"><code>cor(m) # compare it with Sigma</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 1.0000 0.5199 0.4352 0.3097
## [2,] 0.5199 1.0000 0.4907 0.3812
## [3,] 0.4352 0.4907 1.0000 0.4690
## [4,] 0.3097 0.3812 0.4690 1.0000</code></pre>
<pre class="r"><code>cumsum(table(m[,4]))/n # compare it with the fourth marginal specified above</code></pre>
<pre><code>## 1 2 3 4 5
## 0.189 0.392 0.592 0.793 1.000</code></pre>
<p><a href="https://cran.rproject.org/package=MultiOrd">MultiOrd</a> by Amatya, Demirtas and Gao generates multivariate ordinal data given marginal distributions and a correlation matrix using the method proposed in <a href="https://www.tandfonline.com/doi/abs/10.1080/10629360600569246">Demirtas (2006)</a>.</p>
<p><a href="https://cran.rproject.org/package=PoisBinOrdNonNor">PoisBinOrdNonNor</a> by Demirtas et al. uses the power polynomial method described in <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.5362">Demirtas (2012)</a> to generate count, binary, ordinal, and continuous random variables, with specified correlations and marginal properties.</p>
<p><a href="https://cran.rproject.org/package=SimMultiCorrData">SimMultiCorrData</a> by Allison Cynthia Fialkowski is a <em>tour de force</em> of a package that was built using ideas pioneered in some of the packages listed above, and was part of her <a href="https://www.uab.edu/soph/home/newsevents/news/drallisonfialkowskireceivesthe11thannualcharlesrkatholidistinguisheddissertationaward">award winning</a> PhD thesis. A motivating goal of the package is to enable users to make synthetic data useful in clinical applications. This includes generating data to match theoretical probability distributions, and also to mimic empirical data sets. The package is notable not only for the sophisticated capabilities it provides, but also for the technical documentation and references to source materials.</p>
<p><code>SimMultiCorrData</code> provides tools to simulate random variables from continuous, ordinal, categorical, Poisson and negative binomial distributions with precision and error estimates. It exhibits attention to numerical detail that is rare in the machine learning literature, and impressive even by R’s standards. For example, there are several functions to evaluate and visualize the quality of the synthetic data.</p>
<p>The package provides three main simulation functions. <code>nonnormvar1()</code> is preferred for simulating a single, continuous random variable, <code>rcorrvar()</code> and <code>rcorrvar2()</code> are two different methods for generating correlated ordinal, continuous, Poisson, and negative binomial random variables that match a specified correlation structure.</p>
<p>The following example is a strippeddown, annotated version of the example provided in the package pdf for the <code>rcorrvar()</code> function. The example simulates one ordinal random variable (<code>k_cat</code>), two continuous random variables (<code>k_cont</code> one is logistic and the other Weibull), one Poisson random variable (<code>k_pois</code>), and one negative binomial random variable (<code>k_nb</code>). The portion replicated below concludes with generating the data. The version in the package pdf also works through the process of evaluating the fake data.</p>
<pre class="r"><code># Binary, Ordinal, Continuous, Poisson, and Negative Binomial Variables
options(scipen = 999) # decides when to switch between fixed or exponential notation
seed < 1234
n < 10000 # number of random draws to simulate
Dist < c("Logistic", "Weibull") # the 2 continuous rvs to be simulated
#Params list: first element location and scale for logistic rv
# second element shape and scale for Weibull rv
Params < list(c(0, 1), c(3, 5))
# Calculate theoretical parameters for Logistic rv including 4th, 5th and 6th moments
Stcum1 < calc_theory(Dist[1], Params[[1]])
# Calculate theoretical parameters for a Weibull rv
Stcum2 < calc_theory(Dist[2], Params[[2]])
Stcum < rbind(Stcum1, Stcum2)
rownames(Stcum) < Dist
colnames(Stcum) < c("mean", "sd", "skew", "skurtosis", "fifth", "sixth")
Stcum # matrix of parameters for continuous random variables</code></pre>
<pre><code>## mean sd skew skurtosis fifth sixth
## Logistic 0.000 1.814 0.0000 1.2000 0.000 6.857
## Weibull 4.465 1.623 0.1681 0.2705 0.105 0.595</code></pre>
<pre class="r"><code># Six is a list of vectors of correction values to add to the sixth cumulants
# if no valid pdf constants are found
Six < list(seq(1.7, 1.8, 0.01), seq(0.10, 0.25, 0.01))
marginal < list(0.3) # cum prob for Weibull rv, Logistic assumed = 1
lam < 0.5 # Constant for Poisson rv
size < 2 # Size parameter for Negative Binomial RV
prob < 0.75 # prob of success
Rey < matrix(0.4, 5, 5) # target correlation matrix: the order is important
diag(Rey) < 1
# Make sure Rey is within upper and lower correlation limits
# and that a valid power method exists
valid < valid_corr(k_cat = 1, k_cont = 2, k_pois = 1, k_nb = 1,
method = "Polynomial", means = Stcum[, 1],
vars = Stcum[, 2]^2, skews = Stcum[, 3],
skurts = Stcum[, 4], fifths = Stcum[, 5],
sixths = Stcum[, 6], Six = Six, marginal = marginal,
lam = lam, size = size, prob = prob, rho = Rey,
seed = seed)</code></pre>
<pre><code>##
## Constants: Distribution 1
##
## Constants: Distribution 2
## All correlations are in feasible range!</code></pre>
<p>Simulate the data</p>
<pre class="r"><code>Sim1 < rcorrvar(n = n, k_cat = 1, k_cont = 2, k_pois = 1, k_nb = 1,
method = "Polynomial", means = Stcum[, 1],
vars = Stcum[, 2]^2, skews = Stcum[, 3],
skurts = Stcum[, 4], fifths = Stcum[, 5],
sixths = Stcum[, 6], Six = Six, marginal = marginal,
lam = lam, size = size, prob = prob, rho = Rey,
seed = seed)</code></pre>
<pre><code>##
## Constants: Distribution 1
##
## Constants: Distribution 2
##
## Constants calculation time: 0.141 minutes
## Intercorrelation calculation time: 0.008 minutes
## Error loop calculation time: 0 minutes
## Total Simulation time: 0.149 minutes</code></pre>
<p>Unpack the simulated data</p>
<pre class="r"><code>ord_rv < Sim1$ordinal_variables; names(ord_rv) < "cat_rv"
cont_rv < Sim1$continuous_variables; names(cont_rv) < c("logistic_rv", "Weibull_rv")
pois_rv < Sim1$Poisson_variables; names(pois_rv) < "Poisson_rv"
neg_bin_rv < Sim1$Neg_Bin_variables; names(neg_bin_rv) < "Neg_Bin_rv"
fake_data < tibble(ord_rv, cont_rv, pois_rv, neg_bin_rv)
fake_data</code></pre>
<pre><code>## # A tibble: 10,000 x 5
## cat_rv logistic_rv Weibull_rv Poisson_rv Neg_Bin_rv
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.496 2.29 0 3
## 2 2 1.67 4.39 0 0
## 3 2 2.48 5.43 0 1
## 4 2 1.02 3.86 2 2
## 5 2 4.26 5.98 1 2
## 6 2 0.945 2.33 0 1
## 7 1 4.02 3.45 0 0
## 8 2 1.31 7.78 1 1
## 9 1 3.87 1.85 0 0
## 10 1 1.01 6.28 0 1
## # … with 9,990 more rows</code></pre>
<p>Well, that’s it for now. My short list of R packages for simulating data is far from being exhaustive and does not include some really good stuff, but it covers the basics. Hopefully, it will motivate you to acquire the good habit of using simulation to learn more about your data, or make your life easier if you have already acquired this habit. I hope to return to this topic again in the near future, and explore simulating survival data.</p>
<p>If you have any favorite R packages for simulating data please let me know.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/09/09/fakedatawithr/';</script>

Crowd Counting Consortium Crowd Data and Shiny Dashboard
https://rviews.rstudio.com/2020/08/31/crowdcountingconsortiumcrowddataandshinydashboard/
Mon, 31 Aug 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/08/31/crowdcountingconsortiumcrowddataandshinydashboard/
<p><em>Jay Ulfelder, PhD, serves as Program Manager for the Nonviolent Action Lab, part of the Carr Center for Human Rights Policy at the Harvard Kennedy School. He has used R to work at the intersection of social science and data science for nearly two decades.</em></p>
<p>Where are people in the United States protesting in 2020, and what are they protesting about? How large have those crowds been? How many protesters have been arrested or injured? And how does this year’s groundswell of protest activity compare to the past several years, which had already produced some of the <a href="https://www.washingtonpost.com/news/monkeycage/wp/2017/02/07/thisiswhatwelearnedbycountingthewomensmarches/">largest singleday gatherings in U.S. history</a>?</p>
<p>These are the kinds of questions the <a href="https://sites.google.com/view/crowdcountingconsortium/home">Crowd Counting Consortium</a> (CCC) Crowd Dataset helps answer. Begun after the 2017 Women’s March by Professors <a href="https://www.ericachenoweth.com/">Erica Chenoweth</a> (Harvard University) and <a href="https://jeremypressman.uconn.edu/">Jeremy Pressman</a> (University of Connecticut), the CCC’s database on political crowds has grown into one of the most comprehensive open sources of nearreal time information on protests, marches, demonstrations, strikes, and similar political gatherings in the contemporary United States. At the start of August 2020, the database included nearly 50,000 events. These data have been used in numerous academic and press pieces, including a recent <a href="https://www.nytimes.com/interactive/2020/07/03/us/georgefloydprotestscrowdsize.html"><em>New York Times</em></a> story on the historic scale of this year’s Black Lives Matter uprising.</p>
<p>As rich as the data are, they have been a challenge to use. The CCC shares its data on political crowds via a <a href="https://sites.google.com/view/crowdcountingconsortium/viewdownloadthedata">stack of monthly Google Sheets</a> with formats that can vary from sheet to sheet in small but confounding ways. Column names don’t always match, and certain columns have been added or dropped over time. Some sheets include separate tabs for specific macroevents or campaigns (e.g., a coordinated climate strike), while others group everything in a single sheet. And, of course, typos happen.</p>
<p>To make this tremendous resource more accessible to researchers, activists, journalists, and data scientists, the <a href="https://carrcenter.hks.harvard.edu/nonviolentsocialmovements">Nonviolent Action Lab</a> at Harvard’s <a href="https://carrcenter.hks.harvard.edu/">Carr Center for Human Rights Policy</a>—a new venture started by CCC cofounder Chenoweth—has created a <a href="https://github.com/nonviolentactionlab/crowdcountingconsortium">GitHub repository</a> to host a compiled, cleaned, and augmented version of the CCC’s database.</p>
<p>In addition to all the information contained in the monthly sheets, the compiled version adds two big feature sets.</p>
<ul>
<li><p><strong>Geolocation.</strong> After compiling the data, the Lab uses the <a href="https://cran.rproject.org/web/packages/googleway/index.html">googleway</a> package to run the cities and towns in which the events took place through Google Maps’s <a href="https://developers.google.com/maps/documentation/geolocation/overview">Geolocation API</a> and extracts geocoordinates from the results, along with clean versions of the locality, county, and state names associated with them.</p></li>
<li><p><strong>Issue Tags.</strong> In the original data, protesters’ claims—in other words, what the protest is about—are recorded by human coders in a loosely structured way. To allow researchers to group or filter the data by theme, the Nonviolent Action Lab maintains a dictionary of a few dozen major political issues in the U.S. (e.g., “guns”, “migration”, “racism”) and keyword– and keyphrasebased regular expressions associated with them. By mapping this dictionary over the claim strings, we generate a set of binary issue tags that gets added to the compiled database as a vector of semicolonseparated strings.</p></li>
</ul>
<p>To make the CCC data more accessible to a wider audience, the Lab has also built a <a href="https://nonviolentactionlab.shinyapps.io/cccdatadashboard/">Shiny dashboard</a> that lets users filter events in various ways and then map and plot the results. Users can filter by date range, year, or campaign as well as issue and political valence (proTrump, antiTrump, or neither).</p>
<p>The dashboard has two main tabs. The first uses the <a href="https://cran.rproject.org/web/packages/leaflet/index.html">leaflet</a> package to map the events with markers that contain summary details and links to the source(s) the human coders used to research them.</p>
<p><img src="map_view.png" /></p>
<p>The second tab uses <a href="https://cran.rproject.org/web/packages/plotly/index.html">plotly</a> and the <a href="https://hrbrmstr.github.io/streamgraph/">streamgraph</a> html widget package to render interactive plots of trends over time in the occurrence of the selected events, the number of participants in them, and the political issues associated with them.</p>
<p><img src="plot_view.png" /></p>
<p>The point of the Nonviolent Action Lab’s <a href="https://github.com/nonviolentactionlab/crowdcountingconsortium">repository</a> and <a href="https://nonviolentactionlab.shinyapps.io/cccdatadashboard/">dashboard</a> is to make the Crowd Counting Consortium’’s data more accessible and more useful to as wide an audience as possible. If you use either of these resources and find bugs or errors or have suggestions on how to improve them, please let us know.</p>
<ul>
<li><p>To provide feedback on the compiled data set or the Shiny dashboard, please open an issue on the GitHub repository, <a href="https://github.com/nonviolentactionlab/crowdcountingconsortium/issues">here</a>.</p></li>
<li><p>If you think you see an error in the CCC’s data or know about an event that it doesn’t cover, please use <a href="https://sites.google.com/view/crowdcountingconsortium/submitarecord">this form</a> to submit a correction or record via the CCC website.</p></li>
</ul>
<script>window.location.href='https://rviews.rstudio.com/2020/08/31/crowdcountingconsortiumcrowddataandshinydashboard/';</script>

July 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2020/08/27/july2020top40newcranpackages/
Thu, 27 Aug 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/08/27/july2020top40newcranpackages/
<p>One hundred sixtyone new packages made it to CRAN in July. Here are my “Top 40” picks in seven categories: Computational Methods, Data, Genomics, Machine Learning, Science, Statistics, and Utilities.</p>
<h3 id="computationalmethods">Computational Methods</h3>
<p><a href="https://cran.rproject.org/package=libgeos">libgeos</a> v3.813: Implements API access the Open Source Geometry Engine <a href="https://trac.osgeo.org/geos/">GEOS</a> which can be used to write highperformance C and C++ geometry operations. Look <a href="https://paleolimbot.github.io/libgeos/">here</a> for help.</p>
<p><a href="https://cran.rproject.org/package=gms">gms</a> v0.4.0: Implements a collection of tools to create and maintain modularized model written in the <a href="https://www.gams.com/">GAMS</a>modeling language.</p>
<p><a href="https://cran.rproject.org/package=LoopDetectR">LoopDetectR</a> v0.1.2: Provides functions to detect feedback loops (cycles, circuits) between species (nodes) in ordinary differential equation (ODE) models. See <a href="https://www.sciencedirect.com/science/article/pii/S163106910201452X?via%3Dihub">Thomas & Kaufman (2002)</a> for background and the <a href="https://cran.rproject.org/web/packages/LoopDetectR/vignettes/workflow_LoopDetectR.html">vignette</a> for information on how to use the package.</p>
<p><a href="https://cran.rproject.org/package=mrgsim.parallel">mrgsim.parallel</a> v0.1.1: Provides a parallel backend for the <a href="https://cran.rproject.org/package=mrgsolve">mrgsolve</a> ODE solver. Look <a href="https://github.com/kylebaron/mrgsim.parallel">here</a> for an example.</p>
<p><a href="https://cran.rproject.org/package=paropt">paropt</a> v0.1: Uses the <a href="https://computing.llnl.gov/projects/sundials">SUNDIALS</a> suite of nonlinear differential/algebraic equation solvers to optimize the parameters of ordinary differential equations. There is a <a href="https://cran.rproject.org/web/packages/paropt/vignettes/paropt.html">vignette</a>.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.rproject.org/package=chirps">chirps</a> v0.1.2: Implements an API client for the Climate Hazards Group InfraRed Precipitation with Station <a href="https://www.chc.ucsb.edu/data/chirps">CHIRPS</a> data: 35+ years of satellite imagery, and insitu station data used to create gridded rainfall time series for trend analysis and seasonal drought monitoring. See the <a href="https://cran.rproject.org/web/packages/chirps/vignettes/Overview.html">vignette</a> for an example.</p>
<p><img src="chirps.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=covid19mobility">covid19mobility</a> v0.1.1: Provides COVID19 mobility data scrapped from different sources including <a href="https://www.google.com/covid19/mobility/">Google</a>, <a href="https://www.apple.com/covid19/mobility">Apple</a> and others. There are vignettes on <a href="https://cran.rproject.org/web/packages/covid19mobility/vignettes/animating_covid19_mobility.html">Animating Covid19 Mobility Data</a>, <a href="https://cran.rproject.org/web/packages/covid19mobility/vignettes/apple_cities_across_space_and_change.html">How mobility data has changed in cities</a>, <a href="https://cran.rproject.org/web/packages/covid19mobility/vignettes/google_work_v_play.html">Work versus Home</a>, and <a href="https://cran.rproject.org/web/packages/covid19mobility/vignettes/plot_us_mobility.html">US mobility trends</a>.</p>
<p><img src="covid19mobility.gif" height = "200" width="400"></p>
<p><a href="https://cran.rproject.org/package=covidregionaldata">covidregionaldata</a> v0.5.0: Provides access to daily COVID19 time series data including cases, deaths, hospitalizations, and tests for several countries and subnational regions. See <a href="https://cran.rproject.org/web/packages/covidregionaldata/readme/README.html">README</a> to get started.</p>
<p><a href="https://cran.rproject.org/package=fec16">fec16</a> v0.1.1: Provides access to relational data from the United States 2016 federal election cycle as reported by the <a href="https://www.fec.gov/data/browsedata/?tab=bulkdata">Federal Election Commission</a> including information about candidates, committees, and a variety of different financial expenditures. See the <a href="https://cran.rproject.org/web/packages/fec16/vignettes/fec_vignette.html">vignette</a> for details.</p>
<p><img src="fec16.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=oxcovid19">oxcovid19</a> v0.1.1: Provides an interface to the <a href="https://covid19.eng.ox.ac.uk/">OxCOVID19</a> Database. There are vignettes on <a href="https://cran.rproject.org/web/packages/oxcovid19/vignettes/database_access.html">Database Access</a>, <a href="https://cran.rproject.org/web/packages/oxcovid19/vignettes/oxcovid19.html">The R API</a>, and <a href="https://cran.rproject.org/web/packages/oxcovid19/vignettes/visualisation_china.html">Visualization for China</a>.</p>
<p><a href="https://cran.rproject.org/package=palmerpenguins">palmerpenguins</a> v0.1.0: Provides size measurements, clutch observations, and blood isotope ratios for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. Look <a href="https://github.com/allisonhorst/palmerpenguins">here</a> to get started.</p>
<p><img src="penguins.png" height = "400" width="600"></p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.rproject.org/package=bioseq">bioseq</a> v0.1.1: Provides a toolbox for manipulating DNA, RNA and amino acid sequences including functions for detection, selection, replacement, transciption and translation. There is an <a href="https://cran.rproject.org/web/packages/bioseq/vignettes/introbioseq.html">Introduction</a> and a vignette on <a href="https://cran.rproject.org/web/packages/bioseq/vignettes/ref_database.html">Database Preparation</a>.</p>
<p><img src="bioseq.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=singleCellHaystack">singleCellHaystack</a> v0.3.2: Implements the singleCellHaystack algorithm as described in <a href="https://www.biorxiv.org/content/10.1101/557967v4">Vandenbon & Diez (2019)</a> for finding differentially expressed genes in singlecell transcriptome data. The <a href="https://cran.rproject.org/web/packages/singleCellHaystack/vignettes/a01_toy_example.html">vignette</a> offers and example.</p>
<p><img src="singleCellHaystack.png" height = "200" width="400"></p>
<h3 id="machinelearning">Machine Learning</h3>
<p><a href="https://CRAN.Rproject.org/package=image.binarization">image.binarization</a> v0.1.1: Implements algorithms to improve optical character recognition by <em>binarizing</em> images (turn a color or gray scale image into a black and white image). Look <a href="https://github.com/DIGIVUB/image.binarization">here</a> for and example.</p>
<p><img src="ib.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=image.ContourDetector">image.ContourDetector</a> v0.1.0: Implements the unsupervised smooth contour detection algorithm described in <a href="http://www.ipol.im/pub/art/2016/175/?utm_source=doi">von Gioi & Randall (2016)</a>. Look <a href="https://cran.rproject.org/web/packages/image.ContourDetector/readme/README.html">here</a> for an example.</p>
<p><img src="contour.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=image.CornerDetectionF9">image.CornerDetectionF9</a> v0.1.0: Implements the FAST9 corner detection algorithm explained in <a href="https://arxiv.org/abs/0810.2434">Rosten et al. (2008)</a>. Look <a href="https://cran.rproject.org/web/packages/image.CornerDetectionF9/readme/README.html">here</a> for an example.</p>
<p><a href="https://cran.rproject.org/package=image.CornerDetectionHarris">image.CornerDetectionHarris</a> v0.1.1: Implements the Harris Corner Detection algorithm described in <a href="http://www.ipol.im/pub/art/2018/229/?utm_source=doi">Sánchez et al (2018)</a>. Look <a href="https://github.com/bnosac/image/blob/master/presentationuser2017.pdf">here</a> for background, and <a href="https://github.com/bnosac/image">here</a> for an example.</p>
<p><img src="HarrisCorner.png" height = "200" width="400"></p>
<p><a href="https://cran.rproject.org/package=.image.LineSetmentDetector">image.LineSegmentDetector</a> v0.1.0: Implements the line segment detector algorithm described in <a href="http://www.ipol.im/pub/art/2012/gjmrlsd/?utm_source=doi">von Gioi et al (2012)</a>. Look <a href="https://cran.rproject.org/web/packages/image.LineSegmentDetector/readme/README.html">here</a> for an example.</p>
<p><img src="LineSegmentDetector.png" height = "400" width="600"></p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.rproject.org/package=freqtables">freqtables</a> v0.1.0: Provides functions to make tables of descriptive statistics (i.e., counts, percentages, confidence intervals) for categorical variables. Designed for the Tidyverse pipeline, it also provides functions to write results into Microsoft Word ® documents. There is a vignette on the <a href="https://cran.rproject.org/web/packages/freqtables/vignettes/descriptive_analysis.html">Tidyverse pipeline</a> and another on using the <a href="https://cran.rproject.org/web/packages/freqtables/vignettes/using_freq_test.html"><code>freq_test()</code> function</a>.</p>
<p><img src="freqtables.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=nbTransmission">nbTransmission</a> v1.1.1: Provides functions to estimate the relative transmission probabilities between cases in an infectious disease outbreak. See <a href="https://academic.oup.com/ije/articleabstract/49/3/764/5811379?redirectedFrom=fulltext">Leavitt et al. (2020)</a> for details, and the vignette for an <a href="https://cran.rproject.org/web/packages/nbTransmission/vignettes/nbTransmissionvignette.html">Introduction</a>.</p>
<p><img src="nbTransmission.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=precautionary">precautionary</a> v:0.12: Provides functions that enhance the design and simulation of phase 1 doseescalation trials by adding diagnostics to examine the safety characteristics of these designs in light of expected interindividual variation in pharmacokinetics and pharmacodynamics. See <a href="https://arxiv.org/abs/2004.12755">Norris (2020)</a> for background and the vignette for an <a href="https://cran.rproject.org/web/packages/precautionary/vignettes/Intro.html">Introduction</a>.</p>
<p><img src="precautionary.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=pspline.inference">pspline.inference</a> v1.0.2: Provides tools for making inferences about infectious disease outcomes using generalized additive (mixed) models with penalized basis splines (PSplines). See <a href="https://medrxiv.org/cgi/content/short/2020.07.14.20138180v1">Weinberger et al. (2020)</a> for background and the <a href="https://cran.rproject.org/web/packages/pspline.inference/vignettes/seasonal.html">vignette</a> to get started.</p>
<p><img src="pspline.svg" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=SITH">SITH</a> v1.0.1: Implements a threedimensional stochastic model of cancer growth and mutation similar to the one described in <a href="https://www.nature.com/articles/nature14971">Waclaw et al. (2015)</a> and allows for interactive 3D visualizations of the simulated tumor. See the vignette for an <a href="https://cran.rproject.org/web/packages/SITH/vignettes/SITH.html">Introduction</a>.</p>
<p><img src="SITH.png" height = "400" width="600"></p>
<h3 id="science">Science</h3>
<p><a href="https://cran.rproject.org/package=apsimx">apsimx</a> v1.946: Implements an interface to the <a href="https://www.apsim.info/">APSIM</a> framework for agricultural systems modeling and simulation. There is an <a href="https://cran.rproject.org/web/packages/apsimx/vignettes/apsimx.html">Introduction</a> and a vignette on <a href="https://cran.rproject.org/web/packages/apsimx/vignettes/apsimxscripts.html">Writing Scripts</a> .</p>
<p><a href="https://cran.rproject.org/package=cmstatr">cmstatr</a> v0.7.0: Implements the statistical methods commonly used for advanced composite materials in aerospace applications focusing on calculating basis values (lower tolerance bounds) for material strength properties, as well as performing the associated diagnostic tests. See <a href="https://joss.theoj.org/papers/10.21105/joss.02265">Kloppenborg (2020)</a> for details and the vignettes for a <a href="https://cran.rproject.org/web/packages/cmstatr/vignettes/cmstatr_Tutorial.html">Tutorial</a>, examples of <a href="https://cran.rproject.org/web/packages/cmstatr/vignettes/cmstatr_Graphing.html">Plotting Composite Material Data</a> and the <a href="https://cran.rproject.org/web/packages/cmstatr/vignettes/adktest.html">AndersonDarling Test</a>.</p>
<p><img src="cmstatr.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=hadron">hadron</a> v3.1.0: Provides a tool kit to perform statistical analyses of correlation functions generated from Lattice Monte Carlo simulations including functions to extract hadronic quantities from Lattice Quantum Chromodynamics simulations (<a href="https://www.sciencedirect.com/science/article/pii/S0010465508002270?via%3Dihub">Boucaud et al. (2008)</a>), to determine energy eigenvalues of hadronic states (<a href="https://iopscience.iop.org/article/10.1088/11266708/2009/04/094">Blossier et al. (2009)</a>, and <a href="https://inspirehep.net/literature/1792113">Fischer et al. (2020)</a>). There are vignettes on the <a href="https://cran.rproject.org/web/packages/hadron/vignettes/Two_Amplitudes_Model.pdf">Two Amplitudes Model</a>, <a href="https://cran.rproject.org/web/packages/hadron/vignettes/gevp.html">GEVP energy level extraction</a>, <a href="https://cran.rproject.org/web/packages/hadron/vignettes/hankel.html">The Hankel Method</a>, <a href="https://cran.rproject.org/web/packages/hadron/vignettes/jackknife_cov_and_missing_values.html">Jackknife Covariance and Missing Values</a>, and <a href="https://cran.rproject.org/web/packages/hadron/vignettes/jackknife_error_normalization.html">Jackknife Error Normalization</a>.</p>
<p><a href="https://cran.rproject.org/package=sarp.snowprofile">sarp.snowprofile</a> v1.0.0: Provides analysis and plotting tools for snow profile data produced from manual snowpack observations and physical snowpack models. The functions read multiple data formats, manipulate data, and produce stratigraphy and time series profiles. See the <a href="https://cran.rproject.org/web/packages/sarp.snowprofile/vignettes/sarp.snowprofile.html">vignette</a> for more information.</p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.rproject.org/package=fddm">fddm</a> v0.11: Implements the Diffusion Decision Model of <a href="https://www.mitpressjournals.org/doi/abs/10.1162/neco.2008.1206420">Ratcliff & McKoon (2008)</a> with across tial variable drift rate. It includes <code>C++</code> implementations of the approximations of <a href="https://www.sciencedirect.com/science/article/abs/pii/S0022249609000200?via%3Dihub">Navarro & Fuss (2009)</a> and <a href="https://www.sciencedirect.com/science/article/abs/pii/S0022249614000388?via%3Dihub">Gondan et al. (2014)</a>. There are vignettes on <a href="https://cran.rproject.org/web/packages/fddm/vignettes/benchmark.html">Benchmark Testing</a>, <a href="https://cran.rproject.org/web/packages/fddm/vignettes/example.html">Model Fitting</a>, <a href="https://cran.rproject.org/web/packages/fddm/vignettes/math.html">Mathematical Methods</a>, and <a href="https://cran.rproject.org/web/packages/fddm/vignettes/validity.html">Validation</a>.</p>
<p><a href="https://cran.rproject.org/package=GGMncv">GGMncv</a> v1.1.0: Provides functions to estimate Gaussian graphical models with nonconvex penalties, including atan, seamless L0, exponential, smooth integration of counting and absolute deviation, logarithm, Lq, smoothly clipped absolute deviation, minimax concave penalty, Lasso, and Adaptive lasso. See <a href="https://cran.rproject.org/web/packages/GGMncv/readme/README.html">README</a> for examples.</p>
<p><img src="GGMncv.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=LTRCforests">LTRCforests</a> v0.5.0: Implements the conditional inference forest and random survival forest algorithm to modeling lefttruncated rightcensored data with timeinvariant covariates, and (lefttruncated) rightcensored survival data with timevarying covariates. See <a href="https://arxiv.org/abs/2006.00567">Yao et al. (2020)</a>.</p>
<p><a href="https://cran.rproject.org/package=MoMPCA">MoMPCA</a> v1.0.0: Implements a method to cluster any count data matrix with a fixed number of variables, such as document/term matrices. Inference is done by means of a greedy Classification Variational Expectation Maximisation (CVEM) algorithm. See <a href="https://arxiv.org/abs/1909.00721">Jouvin et. al. (2020)</a> for more details and the <a href="https://cran.rproject.org/web/packages/MoMPCA/vignettes/MoMPCA.html">vignette</a> for an example..</p>
<p><img src="MoMPCA.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=nortsTest">nortsTest</a> v1.0.0: Implements four tests, Lobato and Velasco’s, Epps, Psaradakis and Vavra, and the random projections tests for assessing the normality of stationary process. See <a href="https://cran.rproject.org/web/packages/nortsTest/readme/README.html">README</a> for details.</p>
<p><img src="nortsTest.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=sptotal">sptotal</a> v0.1.0: Provides functions for predicting totals and weighted sums, or finite population block kriging, on spatial data using the methods in <a href="https://link.springer.com/article/10.1007/s106510070035y">Ver Hoef (2008)</a>. See the <a href="https://cran.rproject.org/web/packages/sptotal/vignettes/sptotalvignette.html">vignette</a> to get started.</p>
<p><img src="sptotal.png" height = "300" width="300"></p>
<p><a href="https://cran.rproject.org/package=ztpln">ztpln</a> v0.1.0: Provides functions for obtaining the density, random variates, and maximum likelihood estimates of the Zerotruncated Poisson lognormal distribution and its mixture distributions. See the <a href="https://cran.rproject.org/web/packages/ztpln/vignettes/ztpln.html">vignette</a> for details.</p>
<p><img src="ztpln.png" height = "400" width="600"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.rproject.org/package=cpp11">cpp11</a> v0.2.1: Provides a header only, <code>C++ 11</code> interface to R’s C interface and strives to be safe against long jumps from the C API and C++ exceptions, and to conform to normal R function semantics and support interactions with <code>ALTREP</code> vectors. There are vignettes on <a href="https://cran.rproject.org/web/packages/cpp11/vignettes/motivations.html">Motivations</a>, <a href="https://cran.rproject.org/web/packages/cpp11/vignettes/cpp11.html">Getting Started</a>, <a href="https://cran.rproject.org/web/packages/cpp11/vignettes/internals.html">cpp11 Internals</a>, and <a href="https://cran.rproject.org/web/packages/cpp11/vignettes/converting.html">Converting from Rcpp</a>.</p>
<p><a href="https://cran.rproject.org/package=v0.2.21">listdown</a>: Provides functions to programmatically create R Markdown documents from lists. There is a <a href="https://cran.rproject.org/web/packages/listdown/vignettes/listdown.html">vignette</a>.</p>
<p><a href="https://cran.rproject.org/package=oysteR">oysteR</a> v0.0.3: Provides functions to discover third party packages used in an R package and scan them for vulnerabilities using the <a href="https://ossindex.sonatype.org/">sonatype OSS INDEX</a>. Look <a href="https://github.com/sonatypenexuscommunity/oysteR">here</a> for information to get started.</p>
<p><a href="https://cran.rproject.org/package=rbibutils">rbibutils</a> v1.0.3: Provides functions to convert between a number of bibliography formats, including <code>BibTeX</code>, <code>BibLaTeX</code> and <code>Bibentry</code>, and includes a port of the <a href="https://sourceforge.net/projects/bibutils/">bibutils</a> utilities by Chris Putnam. Look <a href="https://geobosh.github.io/rbibutils/">here</a> for examples.</p>
<p><a href="https://cran.rproject.org/package=stickr">stickr</a> v0.3.1: Lets users download and use R hex stickers available in the <a href="https://github.com/rstudio/hexstickers">hexstickert</a> repository.</p>
<p><img src="stickr.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=supreme">supreme</a> v1.1.0: Implements a tool to help developers to visualize and understand the structure of <code>Shiny</code> applications. Look <a href="https://strboul.github.io/supreme/">here</a> for examples.</p>
<p><img src="supreme.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/08/27/july2020top40newcranpackages/';</script>

R Package Integration with Modern Reusable C++ Code Using Rcpp  Part 5
https://rviews.rstudio.com/2020/08/24/rpackageintegrationwithmodernreusableccodeusingrcpppart5/
Mon, 24 Aug 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/08/24/rpackageintegrationwithmodernreusableccodeusingrcpppart5/
<p><em>Daniel Hanson is a fulltime lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington.</em></p>
<p>In the <a href="https://rviews.rstudio.com/...part4/">previous post</a>, we went through the build processes for a simple package containing a single C++ <code>.cpp</code> file, using the <code>rcpp_hello_world.cpp</code> example that is included by default when one creates a new R package using <code>Rcpp</code> in RStudio.</p>
<p>What we’ll do now is examine a more realistic case which involves multiple reusable C++ files as well as the C++ interface files that export functions to R. For this example, you should download the C++ files from <a href="https://github.com/QuantDevHacks/RcppBlogCode/tree/master/CodePart05/src">this GitHub repository</a>. In addition, you should download a set of test functions in R, <a href="https://github.com/QuantDevHacks/RcppBlogCode/tree/master/TestsPart05">here (RStudioBlogTests.R)</a>. Please note that this file is <em>not</em> part of the package code; we will discuss this shortly.</p>
<div id="buildinganrcpppackagewithreusablec" class="section level2">
<h2>Building an <code>Rcpp</code> Package with Reusable C++</h2>
<p>The code you have downloaded from the first link is the same as that referenced in <a href="https://rviews.rstudio.com/2020/07/31/rpackageintegrationwithmodernreusableccodeusingrcpppart3/">Part 3 of this series</a>. You will now create an <code>Rcpp</code> package project in RStudio and import this code.</p>
<div id="createanrstudioproject" class="section level3">
<h3>Create an RStudio Project</h3>
<p>This procedure is the same as what we covered in the previous post. For convenience later, I suggest you name your project <code>RcppBlogCode</code>. Make sure to modify the <code>NAMESPACE</code> file as we discussed before. Also, while not mandatory, you may find it simpler to delete the <code>rcpp_hello_world.cpp</code> file; we will not use it in this example.</p>
</div>
<div id="importtheccodeintotheproject" class="section level3">
<h3>Import the C++ Code into the Project</h3>
<p>Next, copy or move the code you downloaded from the <code>/src</code> subdirectory on GitHub into the <code>/src</code> subdirectory of your <code>RStudio</code> project. When this is done, you should see the files present in this location in the files pane in RStudio:<br />
<img src="FilesPaneSrc.png" width="500" alt="C++ Source Code Files" /></p>
<p>Note that this source code contains:</p>
<ul>
<li><p>Declaration (header files) for the reusable C++ code</p>
<ul>
<li><code>NonmemberCppFcns.h</code>: Declarations of nonmember functions</li>
<li><code>ConcreteShapes.h</code>: Declarations of the <code>Square</code> and <code>Circle</code> classes</li>
</ul></li>
<li><p>Files containing implementations of functions and classes in the reusable C++ code</p>
<ul>
<li><code>NonmemberCppFcns.cpp</code>: Nonmember function implementations</li>
<li><code>ConcreteShapes.cpp</code>: <code>Square</code> and <code>Circle</code> class implementations</li>
</ul></li>
<li><p>C++ files that call reusable code and export functions to R (<code>.cpp</code> files only); each exported function is indicated by the <code>//</code> <code>[[Rcpp::export]]</code> tag</p>
<ul>
<li><code>CppInterface.cpp</code>: Calls nonmember functions in the code base</li>
<li><code>CppInterface2.cpp</code>: Creates and uses instances of classes in the code base</li>
</ul></li>
</ul>
</div>
<div id="buildandrunfunctionsintherpackage" class="section level3">
<h3>Build and Run Functions in the R Package</h3>
<p>Next, let’s build this package, just as we did for our simpler example in the previous post. Select <code>Clean and Rebuild</code> from the <code>Build</code> menu at the top of the RStudio IDE. When complete, R will restart, and the <code>RcppBlogCode</code> package will be loaded into your R session. You can now call functions from this package in the console; e.g. ,</p>
<ul>
<li>Calculate the product of the LCM and GCD of 10 and 20: <code>rProdLcmGcd(10, 20)</code></li>
<li>Calculate the area of a square with length 4: <code>squareArea(4)</code></li>
</ul>
<div class="figure">
<img src="RunInConsole.png" width="500" alt="" />
<p class="caption">Try package functions in console</p>
</div>
<p>Congratulations again! You have now successfully imported resuable C++ code, independent of R and <code>Rcpp</code>, and used it via an interface as an R function in your generated package. However, this of course is not a realistic scenario. What we really want is to use this package in a regular and independent R session.</p>
</div>
</div>
<div id="usethepackageindependentlyofthercppproject" class="section level2">
<h2>Use the Package Independently of the <code>Rcpp</code> Project</h2>
<p>So far, you have built an R package containing C++ code, and you have called a couple of functions from the console, but that of course is not what a real R user does. What you’ll want to do is open a new (and empty) RStudio instance. As the <code>RcppBlogCode</code> package is installed on your machine, all you need to do is load it as you would any other R package:</p>
<p><code>library(RcppBlogCode)</code></p>
<p>Next, open up the <code>RStudioBlogTests.R</code> test file that you downloaded at the outset. Again, make sure this file is located outside of the <code>Rcpp</code> package directory structure. You can now use the package functions that call the C++ nonmember functions internally in the package:</p>
<pre><code>rAdd(531, 9226)
(x < c(5:1))
rSortVec(x)
rProdLcmGcd(10, 20)
</code></pre>
<p>You can also run the R functions that create instances of the <code>Square</code> and <code>Circle</code> classes in the C++ code:</p>
<pre><code>squareArea(4)
circleArea(1)</code></pre>
<p>The actual mathematics here are obviously nothing exciting, but you should now see that you have a blueprint that takes in independent and reusable C++ source code, wraps its functionality in C++ interface functions that get exported by <code>Rcpp</code>, and allows you to use these exported functions just as you would any other R package functions.</p>
<p>Perhaps more interesting is to see how you can take the results of these package functions and use them in other R functions and packages. For this next example, you will need to install and load the <code>plotly</code> package. We will define vectors of square sides and circle radii, and compute the corresponding areas using our package functions. This is again nothing earthshattering, but we can then use them in <code>plotly</code> to generate some reasonably professional quality graphs:</p>
<pre><code>library(plotly)
squareSides < c(1:10)
circleRadii < c(1:10)
squareAreas < sapply(squareSides, squareArea)
circleAreas < sapply(circleRadii, circleArea)
### Use plotly to visualize some results:
dfSq < matrix(data=c(squareSides, squareAreas), nrow=length(squareSides), ncol=2)
dfSq < as.data.frame(dfSq)
colnames(dfSq) = c("Side", "Area")
rownames(dfSq) = NULL
plot_ly(dfSq, x = ~Side, y = ~ Area, type = "bar") %>%
layout(title="Areas of Squares")
dfCirc < matrix(data=c(circleRadii, circleAreas), nrow=length(squareSides), ncol=2)
dfCirc < as.data.frame(dfCirc)
colnames(dfCirc) = c("Radius", "Area")
rownames(dfCirc) = NULL
plot_ly(dfCirc, x = ~Radius, y = ~ Area, type = 'scatter',
mode = 'lines', color = 'red') %>%
layout(title="Areas of Circles")
</code></pre>
<div class="figure">
<img src="Areas.png" width="500" alt="" />
<p class="caption">Using results from package in Plotly graphs</p>
</div>
<p><strong>Remark 1:</strong> If you wish to go back and modify the package code, you will need to close your R session that is using the package; otherwise, the code will not build.</p>
<p><strong>Remark 2:</strong> As discussed last time, you can also export the package to a binary and deploy it on another machine. In general, this allows you to design and build your own <code>Rcpp</code> R package, and deploy it for your colleagues to use, if you wish.</p>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>You have now seen how to import standard and reusable C++ code into an R package using <code>Rcpp</code>, use the package in R, and deploy it as a binary. The one remaining topic to cover is documentation, which we will take up next time.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/08/24/rpackageintegrationwithmodernreusableccodeusingrcpppart5/';</script>

R Package Integration with Modern Reusable C++ Code Using Rcpp  Part 4
https://rviews.rstudio.com/2020/08/18/rpackageintegrationwithmodernreusableccodeusingrcpppart4/
Tue, 18 Aug 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/08/18/rpackageintegrationwithmodernreusableccodeusingrcpppart4/
<p><em>Daniel Hanson is a fulltime lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington.</em></p>
<p>In the <a href="https://rviews.rstudio.com/2020/07/31/rpackageintegrationwithmodernreusableccodeusingrcpppart3/">previous post</a> in this series, we looked at how to write interface files using <code>Rcpp</code> to call functions and instantiate classes in standard and reusable C++, with a code interface and reusable code examples shown in the discussion. My original plan for this week was to show how to import that code into an RStudio <code>Rcpp</code> project and build it into an R package, but as there are a number of steps in the setup and build process, we’ll first look at a very simple example to demonstrate these, and then we’ll turn our attention to importing the reusable C++ code next time.</p>
<p>The following discussion will be a step by step guide in the project configuration and build process with a single example <code>.cpp</code> file that is included by default when creating an <code>Rcpp</code> project in the RStudio IDE.</p>
<h2 id="creatinganrcpppackageprojectintherstudioide">Creating an Rcpp Package Project in the RStudio IDE</h2>
<p>Open the <a href="https://rstudio.com/products/rstudio/">RStudio IDE</a>, and select <code>New Project...</code> from the <code>File</code> menu at the top, and select <code>New Directory</code> as shown here:</p>
<p><img src="NewProjectWizardFirst.png" alt="New Project Wizard" /></p>
<p>Next, you will see the following selections, from which you should choose <code>R Package using Rcpp</code>. Be sure to make this selection, and not <code>R package</code> alone as shown above:</p>
<p><img src="NewProjectWizard.png" alt="Select R Package using Rcpp" /></p>
<p>Next, enter the desired directory path and new subdirectory name, and create the project; the subdirectory will be the name of your R package, e.g. <code>RcppProject</code>:</p>
<p><img src="PackageNameChoice.png" alt="Type in your package name" /></p>
<p>When finished, your RStudio session should look something like this:</p>
<p><img src="NewRcppProject.png" alt="RStudio Rcpp project" /></p>
<p>There is one more step to complete in order to ensure your interface functions will be exposed as R functions to your package users. In the <code>Files</code> pane at lower right, you should see a file called <code>NAMESPACE</code>:</p>
<p><img src="Files_NAMESPACE_Red.png" alt="The NAMESPACE file inside the package file structure" /></p>
<p>Double click on this file to open it in RStudio; you will see the following:</p>
<p><img src="Orig_NAMESPACE.png" alt="Original NAMESPACE file" /></p>
<p>Now, delete line 2, and then append a new line 3, as shown below. This will allow your tagged C++ interface files to be exportable to R. Leave line 4 blank, just as it is in the original. Then, save the file:</p>
<p><img src="Modified_NAMESPACE.png" alt="Updated NAMESPACE file" /></p>
<p><strong>Remark:</strong> There are more advanced ways to configure the <code>NAMESPACE</code> file when building an R package, which would require indepth explanation, distracting us from the main task of getting up and running with <code>Rcpp</code>. As such, we’ll just use this simple fix for our discussion.</p>
<h2 id="buildinganrpackage">Building an R Package</h2>
<p>Returning to the <code>Files</code> pane in the RStudio IDE (see <em>Figure 5</em> above), note the following sudirectories:</p>
<ul>
<li><code>man</code>: For documentation files (we will return to this in a later installment).</li>
<li><code>R</code>: For R code to be included in a package; for now, we will only be concerned with C++ code.</li>
<li><p><code>src</code>: This is where C++ code is located in the package:</p>
<ul>
<li>Both header (<code>.h</code>) and implementation (<code>.cpp</code>) files</li>
<li>Both interface and reusable C++ code files<br />
<br /></li>
</ul></li>
</ul>
<p>By clicking on the <code>src</code> subdirectory, you will see that there are two <code>C++</code> files that are present by default in a blank <code>RStudio Rcpp</code> project:</p>
<p><img src="DefaultCppFiles.png" alt="Default C++ files" /></p>
<ul>
<li><code>rcpp_hello_world.cpp</code>: Simple interface function included as an example, by default.</li>
<li><code>RcppExports.cpp</code>: This is a C++ file that is generated each time the <code>Rcpp</code> project is built in the RStudio environment. You need not be concerned about its contents, but it is crucial that you <em>never modify this file</em> on your own.<br />
<br /></li>
</ul>
<h3 id="thercpphelloworldcppfile">The <code>rcpp_hello_world.cpp</code> file</h3>
<p>Let’s look at this simple example first. It should look somewhat similar to the C++ interface files presented last time. The main difference is it does not call any functions in an external file; it simply returns an R <code>List</code> object to the function user in R. Note the <code>// [[Rcpp::export]]</code> tag; this will export the <code>rcpp_hello_world()</code> function to R:</p>
<pre><code>#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List rcpp_hello_world() {
CharacterVector x = CharacterVector::create( "foo", "bar" ) ;
NumericVector y = NumericVector::create( 0.0, 1.0 ) ;
List z = List::create( x, y ) ;
return z ;
}
</code></pre>
<h3 id="yourfirstrpackagewithccode">Your First R Package with C++ Code</h3>
<p>Now, let’s build the package with this single C++ function. To do this, from the <code>Build</code> menu at the top of the RStudio IDE, and select <code>Clean and Rebuild</code>. In the upper right hand pane in the IDE, you will then see the C++ being compiled, and the package being built.</p>
<p><img src="CleanAndRebuild.png" alt="Select 'Clean and Rebuild' to build the R package" /></p>
<p>When the build is complete, your R session will restart, and your package will be loaded into your current R session, as shown in the console at the bottom of the R Studio IDE:</p>
<p><img src="RestartingRSession.png" alt="R restarts and loads the package after the build is complete" /></p>
<p>Now, type in <code>rcpp_hello_world()</code> at the console prompt, and check your results. You should see the following:</p>
<p><img src="rcpp_hello_world().png" alt="Run 'rcpp_hello_world()' in R" /></p>
<p>Congratulations! You have just built your first R package with integrated C++, and you called the exported function from an R session. You can also check that the package contents have been placed in the usual <code>.../R4.0.x/library</code> directory, in a subdirectory with the package name, just like any other R package you load from CRAN.</p>
<h3 id="distributethepackageasabinary">Distribute the Package as a Binary</h3>
<p>You can also export the package in binary form to a <code>.zip</code> file on Windows, or a <code>tar.gz</code> file on the Mac or on Linux. To do this, again from the <code>Build</code> menu, select <code>Build Binary Package</code>.</p>
<p><img src="BuildBinaryPackage.png" alt="Build the package binary for distribution" /></p>
<p>You will again see the compile and build process in the upper right hand corner of the RStudio IDE. When complete, you can find your distributable file, e.g. <code>RcppProject.zip</code>, in the directory one level up from your project directory. To deploy it, either copy it to another machine with the same OS and R setup, or delete the package subdirectory, e.g. <code>.../R4.0.x/library/RcppProject</code>. Then, open an new RStudio session, and install the package just as you would any other package locally:<br />
\newpage</p>
<p><img src="InstallPackage.png" alt="Install package locally" /></p>
<p>Next, load the package in your session, e.g. <code>library(RcppProject)</code>, and then call the exported function again to verify it works.</p>
<h2 id="summary">Summary</h2>
<p>We have now covered the process of building an R package containing C++ code in RStudio IDE, by integrating the code into an <code>Rcpp</code> project. The C++ code in this case consisted of a single <code>.cpp</code> file, with a single interface function tagged for export to R, to keep the discussion focused more on the process itself. Next time, we will revisit the C++ code file examples in the previous post, and show how to integrate them into an R package. The process is essentially the same as above, but with multiple source code files and multiple interface files, it will involve some additional management and other details.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/08/18/rpackageintegrationwithmodernreusableccodeusingrcpppart4/';</script>

Monitor COVID19 at the COVID19 Forecast Hub
https://rviews.rstudio.com/2020/08/10/uscovid19forecasts/
Mon, 10 Aug 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/08/10/uscovid19forecasts/
<p>If you are looking for a place to monitor expert forecasts for United States weekly and cumulative COVID19 deaths, you can’t do any better than the <a href="https://reichlab.io/">Reich Lab</a> (University of Massachusetts) <a href="https://viz.covid19forecasthub.org/">COVID19 Forecast Hub</a>. The same is true for data scientists who may be seeking to publish their own forecasts and compare them with the work of their peers. Every Tuesday morning, the Hub publishes fourweek, national and state level forecasts from over thirtyfive different groups along with its own ensemble forecast. These forecasts can be examined via an effective <code>D3</code> interactive visualization, and are also available in a public <a href="https://github.com/reichlab/covid19forecasthub/blob/master/README.md">GitHub repository</a>. The ensemble forecast and all of the submitted forecasts are also passed on to the <a href="https://www.cdc.gov/coronavirus/2019ncov/coviddata/forecastingus.html">CDC</a> and the FiveThirtyEight <a href="https://projects.fivethirtyeight.com/covidforecasts/">forecast tracker</a>.</p>
<p>The visualization lets users select form a menu of forecasts to compare. By clicking on points marking the actual observed values on past dates, users can compare the accuracy of the various forecasts.</p>
<p>  </p>
<iframe src="https://viz.covid19forecasthub.org/" width="100%" height="600"></iframe>
<p>  </p>
<p>In addition to the visualization, the COVID19 Forecast Hub has several notable features:</p>
<ul>
<li>The COVID19 Forecast Hub <a href="https://covid19forecasthub.org/doc/team/">team</a> encourages <a href="https://covid19forecasthub.org/doc/participate/">participation</a> from any group that meets its standards, and provides a showcase for the <a href="https://covid19forecasthub.org/community/">Community</a>.</li>
<li>It conducts several automated tests, as well as some “human in the loop” tests, to screen forecasts for <a href="https://covid19forecasthub.org/doc/ensemble/">inclusion</a> in the ensemble. For this reason, you are unlikely to see forecasts that appear to float untethered from the actual data, or take off on precipitous increases or drastic decreases that appear to be “dramatically out of line with the historical data”.<br /></li>
<li>In addition to being available in the GitHub repository mentioned above, the details of the forecasts can also be downloaded programmatically from the <a href="https://zoltardata.com/project/44">Zoltar API</a>.</li>
<li>The ensemble is created by algorithms that score each model and average twentythree quantiles of the predictive distribution produced by each included forecast.</li>
<li>The submitted forecasts included in the ensemble cover a wide range of modeling methodologies and model types. There are epidemiological models conditioned on various assumptions about disease transmission and public behavior, as sell as unconditional time series and “curve fitting” models. There are SEIR models, deep learning models, agent based simulations and unique hybrid models.</li>
<li>In addition to the ensemble, the COVID19 Forecast Hub team also produces a simple, but surprisingly accurate <a href="https://zoltardata.com/model/302">baseline forecast</a>. (The median predictions of incidence at future time points is the most recently observed incidence.)</li>
</ul>
<p><img src="algo.png" height = "400" width="600"></p>
<p>For a detailed overview of the statistics and data science underlying the ensemble forecast and COVID19 Forecast Hub have a look at <a href="https://statds.org/events/webinar_dsa2020/schedule.html#reich">video</a> of the presentation Nicholas Reich recently gave at the ASAJDS <a href="https://statds.org/events/webinar_dsa2020/schedule.html#reich">Webinar Series</a>.</p>
<p>Finally, when you visit the COVID19 Forecast Hub, please plan to spend some time examining the individual forecasts. The statistical imagination and technique on display is astounding, and there is quite a bit of data science to be learned. This effort is representative of the thousands of statisticians, data scientists, programmers and researchers worldwide who are giving their best to help control this pandemic.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/08/10/uscovid19forecasts/';</script>

Party with R: How the Community Enabled Us to Write a Book
https://rviews.rstudio.com/2020/08/03/partywithrhowthecommunityenabledustowriteabook/
Mon, 03 Aug 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/08/03/partywithrhowthecommunityenabledustowriteabook/
<p><em>Isabella C. Velásquez, MS, is a data analyst committed to nonprofit work with the aim of reducing racial and socioeconomic inequities.</em></p>
<p>“<em>It is like a party all the time; nobody has to worry about giving one or being invited; it is going on every day in the street and you can go down or be part of it from your window.</em>”  Eleanor Clark. Without the R community, <a href="https://www.datascienceineducation.com/"><em>Data Science in Education Using R</em></a> would never have happened. Most evidently, we wouldn’t have met each other without the strong R presence on Twitter that sparked a conversation about data use in education. More importantly though, are the aspects of the R community that inspired that initial discussion and enabled us to complete a book for a broad and complex field.</p>
<p>In the <a href="https://datascienceineducation.com/c01.html">first chapter of our book</a>, we invite data practitioners in education to the party but the R community invited us to the party first. Like any successful party, certain elements had to exist for us to join in and end up having an amazing time.</p>
<p><strong>The Invitation</strong> 📩</p>
<p>As someone who started following the R community on Twitter after it was already well established and popular, I never felt the apprehension about having to ask how to get involved or interact with others. First, there are so many avenues. The R community offers many ways to let you in, whether it be replying to tweets, posting a question on community.rstudio.com, or sharing a blog post on R Weekly. Second, the R community welcomes users no matter their level. Whether it is a code snippet with a function you found cool, a blog post, or a personal side project, there are ways to engage that appeal to everybody. Members of the R community can interact how they want and as often as they want.</p>
<p>For us, it was exciting to meet on Twitter, talk about collaborating on an education data project, and then just get started on it. We felt welcome and encouraged to do so. Because people in the R community meet other users virtually and begin side projects all the time, we didn’t have to worry about whether something like this was possible: The invitation was already there.</p>
<p><strong>An Open Door</strong> 🚪</p>
<p>In our first blog post, we described what it’s like to learn by seeing someone do the thing you want to learn. One of the best things about the R community is you constantly get to see this in action. The R community not only holds open principles but actually exhibits them whenever possible. Users post their code, projects, and drafts constantly. Just by scanning the #rstats hashtag, one can discover something new.</p>
<p>Because others were open, we knew that we wanted to be open as well. We wrote our book on <a href="https://github.com/dataedu/datascienceineducation">Github</a>, showing all the many changes it went through until its final completion. By having it freely available on a <a href="https://www.datascienceineducation.com/">website</a>, we hope that it opens the door to others who’d like to learn what we did and work on their own open project as well.</p>
<p><strong>Food (for Thought)</strong> 🍕</p>
<p>The R community offers so many opportunities to get feedback, advice, and information from a wide variety of users. Early on in the book development, we had a lot of questions to nail down: who is the audience? “Data is” or “data are”? How do we describe “people who work in the education field and use data and want to get more effective at it”? Finding a common language was difficult, but we were able to do this by engaging others in the wider R community. We listened to the stories of many data scientists who work in education, then found common experiences we could describe in our writing.</p>
<p>As an example, we learned we weren’t the only ones challenged by learning a programming language while attending to full time jobs and personal lives. Knowing this, we made sure to <a href="https://datascienceineducation.com/c02.html#differentstrokesfordifferentdatascientistsineducation">discuss these challenges in our book</a> and offer various ways of engaging with the material based on the reader’s needs.</p>
<p><strong>Socializing</strong> 💬</p>
<p>Throughout the process of writing DSIEUR, we asked the R community several times for other types of feedback and suggestions. We also ran into some technical issues, especially when it came to preparing our manuscript with {bookdown} to meet our publisher’s specifications, and were able to get ideas on how to resolve them. We know that the R community is a safe and encouraging place to ask questions, and this enabled us to write a stronger book.</p>
<p><strong>The Next Day</strong> ☀️</p>
<p>The community participation throughout the DSIEUR process helped us define our goals and get feedback. Another wonderful aspect is that the R community engaged us back. Writing this R Views series is an example: Someone reached out to us and wanted us to reflect on what we discovered and share it with all of you. This type of engagement reminds us of what an inclusive and encouraging place the R community is and helps us come up with new ways of making sure others see the invitation as well.</p>
<p>Thank you for reading! 🎉 We’ll be back with the fourth post on “One Writer, Five Authors” in about two weeks. Until then, we’d love to know how else the R paRty has encouraged your work, both personally and professionally. You can reach us on Twitter: Emily <a href="https://twitter.com/ebovee09">@ebovee09</a>, Jesse <a href="https://twitter.com/kierisi">@kierisi</a>, Joshua <a href="https://twitter.com/jrosenberg6432">@jrosenberg6432</a>, Ryan <a href="https://twitter.com/RyanEs">@RyanEs</a>, and me <a href="https://twitter.com/ivelasq3">@ivelasq3</a>.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/08/03/partywithrhowthecommunityenabledustowriteabook/';</script>

R Package Integration with Modern Reusable C++ Code Using Rcpp  Part 3
https://rviews.rstudio.com/2020/07/31/rpackageintegrationwithmodernreusableccodeusingrcpppart3/
Fri, 31 Jul 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/07/31/rpackageintegrationwithmodernreusableccodeusingrcpppart3/
<p><em>Daniel Hanson is a fulltime lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington.</em></p>
<p>In the <a href="https://rviews.rstudio.com/2020/07/14/rpackageintegrationwithmodernreusableccodeusingrcpppart2/">previous post</a> in this series, we looked at some design considerations when integrating standard and reusable C++ code into an R package. Specific emphasis was on Rcpp’s role in facilitating a means of communication between R and the C++ code, particularly highlighting a few of the C++ functions in the <code>Rcpp</code> namespace that conveniently and efficiently pass data between an R <code>numeric</code> vector and a C++ <code>std::vector<double></code> object.</p>
<p>Today, we will look at a specific example of implementing the interface. We will see how to configure code that allows the use of standard reusable C++ in an R package, without having to modify it with any R or Rcppspecific syntax. It is admittedly a simple and toy example, but the goal is to provide a starting point that can be easily extended for more real world examples.</p>
<div id="thecode" class="section level2">
<h2>The Code</h2>
<p>To get started, let’s have a look at the code we will use for our demonstration. It is broken into three categories, consistent with the design considerations from the previous post:</p>
<ul>
<li>Standard and reusable C++: No dependence on R or Rcpp</li>
<li>Interface level C++: Uses functions in the <code>Rcpp</code> namespace</li>
<li>R functions exported by the interface level: Same names as in the interface level</li>
</ul>
<div id="standardandreusableccode" class="section level3">
<h3>Standard and Reusable C++ Code</h3>
<p>In this example, we will use a small set of C++ nonmember functions, and two classes. There is a declaration (header) file for the nonmember functions, say <code>NonmemberCppFcns.h</code>, and another with class declarations for two shape classes, <code>Square</code> and <code>Circle</code>, called <code>ConcreteShapes.h</code>. Each of these is accompanied by a corresponding implementation file, with file extension <code>.cpp</code>, as one might expect in a more realistic C++ code base.</p>
<div id="nonmembercfunctions" class="section level4">
<h4>Nonmember C++ Functions</h4>
<p>These functions are shown and described here, as declared in the following C++ header file:</p>
<pre><code>#include <vector>
// Adds two real numbers
double add(double x, double y);
// Sorts a vector of real numbers and returns it
std::vector<double> sortVec(std::vector<double> v);
// Computes the product of the LCM and GCD of two integers,
// using C++17 functions std::lcm(.) and std::gcd(.)
int prodLcmGcd(int m, int n);</code></pre>
<p>The last function uses recently added features in the C++ Standard Library, to show that we can use C++17.</p>
</div>
<div id="cclasses" class="section level4">
<h4>C++ Classes</h4>
<p>The two classes in our reusable code base are declared in the <code>ConcreteShapes.h</code> file, as shown and described here. Much like textbook C++ examples, we’ll write classes for two geometric shapes, each with a member function to compute the area of its corresponding object.</p>
<pre><code>#include <cmath>
class Circle
{
public:
Circle(double radius);
// Computes the area of a circle with given radius
double area() const;
private:
double radius_;
};
class Square
{
public:
Square(double side);
// Computes the area of a square with given side length
double area() const;
private:
double side_;
};</code></pre>
</div>
</div>
<div id="interfacelevelc" class="section level3">
<h3>Interface Level C++</h3>
<p>Now, the next step is to employ Rcpp, namely for the following essential tasks:</p>
<ul>
<li>Export the interface functions to R</li>
<li>Facilitate data exchange between R and C++ container objects</li>
</ul>
<p>An interface file containing functions designated for export to R does not require a header file with declarations; one can think of it as being analogous to a <code>.cpp</code> file that contains the <code>main()</code> function in a C++ executable project. In addition, the interface can be contained in one file, or split into multiple files. For demonstration, I have written two such files: <code>CppInterface.cpp</code>, which provides the interface to the nonmember functions above, and <code>CppInterface2.cpp</code>, which does the same for the two C++ classes.</p>
<div id="interfacetononmembercfunctions" class="section level4">
<h4>Interface to NonMember C++ Functions:</h4>
<p>Let’s first have a look the <code>CppInterface.cpp</code> interface file, which connects R with the nonmember functions in our C++ code base:</p>
<pre><code>#include "NonmemberCppFcns.h"
#include <vector>
#include <Rcpp.h>
// Nonmember Function Interfaces:
// [[Rcpp::export]]
int rAdd(double x, double y)
{
// Call the add(.) function in the reusable C++ code base:
return add(x, y);
}
// [[Rcpp::export]]
Rcpp::NumericVector rSortVec(Rcpp::NumericVector v)
{
// Transfer data from NumericVector to std::vector<double>
auto stlVec = Rcpp::as<std::vector<double>>(v);
// Call the reusable sortVec(.) function, with the expected
// std::vector<double> argument:
stlVec = sortVec(stlVec);
// Reassign the results from the vector<double> return object
// to the same NumericVector v, using Rcpp::wrap(.):
v = Rcpp::wrap(stlVec);
// Return as an Rcpp::NumericVector:
return v;
}
// C++17 example:
// [[Rcpp::export]]
int rProdLcmGcd(int m, int n)
{
return prodLcmGcd(m, n);
}</code></pre>
<div id="includeddeclarations" class="section level5">
<h5>Included Declarations:</h5>
<p>The <code>NonmemberCppFcns.h</code> declaration file is included at the top with <code>#include</code>, just as it would in a standalone C++ application, so that the interface will recognize these functions that reside in the reusable code base. The STL <code>vector</code> declaration is required, as we shall soon see. And, the key in making the interface work resides in the <code>Rcpp.h</code> file, which provides access to very useful C++ functions in the <code>Rcpp</code> namespace.</p>
</div>
<div id="functionimplementations" class="section level5">
<h5>Function implementations:</h5>
<p>Each of these functions is designated for export to R when the package is built, by placing the
<code>//</code> <code>[[Rcpp::export]]</code> tag just above the each function signature, as shown above. In this particular example, each interface function simply calls a function in the reusable code base. For example, the <code>rAdd(.)</code> function simply calls the <code>add(.)</code> function in the reusable C++ code. In the absence of a userdefined namespace, the interface function name must be different from the function it calls to prevent name clash errors during the build, so I have simply chosen to prefix an <code>r</code> to the name of each interface function.</p>
<p>Note that the <code>rSort(.)</code> function takes in an <code>Rcpp::NumericVector</code> object, <code>v</code>. This type will accept data passed in from R as a <code>numeric</code> vector and present it as a C++ object. Then, so that we can call a function in our code base, such as <code>sort(.)</code>, which expects a <code>std::vector<double></code> type as its input, <code>Rcpp</code> also provides the <code>Rcpp::as<.>(.)</code> function that facilitates the transfer of data from an <code>Rcpp::NumericVector</code> object to the STL container:</p>
<p><code>auto stlVec = Rcpp::as<std::vector<double>>(v);</code></p>
<p><code>Rcpp</code> also gives us a function that will transfer data from a <code>std::vector<double></code> type being returned from our reusable C++ code back into an <code>Rcpp::NumericVector</code>, so that the results can be passed back to R as a familiar <code>numeric</code> vector type:</p>
<p><code>v = Rcpp::wrap(stlVec);</code></p>
<p>As the <code>std::vector<double></code> object is the workhorse C++ STL containers in quantitative work, these two <code>Rcpp</code> functions are a godsend.</p>
<p><strong>Remark 1:</strong> There is no rule that says an interface function can only call a single function in the reusable code; one can use whichever functions or classes that are needed to get the job done, just like with any other C++ function. I’ve merely kept it simple here for demonstration purposes.</p>
<p><strong>Remark 2:</strong> The tag <code>// [[Rcpp::plugins(cpp17)]]</code> is sometimes placed at the top of a C++ source file in online examples related to <code>Rcpp</code> and C++17. I have not found this necessary in my own code, however, as long as the <code>Makeconf</code> file has been updated for C++17, as described in the <a href="https://rviews.rstudio.com/2020/07/08/rpackageintegrationwithmodernreusableccodeusingrcpp/">first post</a> in this series.</p>
</div>
</div>
<div id="interfacetocclasses" class="section level4">
<h4>Interface to C++ Classes:</h4>
<p>We now turn our attention to the second interface file, <code>CppInterface2.cpp</code>, which connects R with the C++ classes in our reusable code. It is shown here:</p>
<pre><code>#include "ConcreteShapes.h"
// Class Member Function Interfaces:
// Interface to Square member
// function area(.):
// [[Rcpp::export]]
double squareArea(double side)
{
Square sq(side);
return sq.area();
}
// Interface to Circle member
// function area(.):
// [[Rcpp::export]]
double circleArea(double radius)
{
Circle circ(radius);
return circ.area();
}</code></pre>
<p>This is again nothing terribly sophisticated, but the good news is it shows the process of creating instances of classes from the code base is not difficult at all. We first <code>#include</code> only the header file containing these class declarations; <code>Rcpp.h</code> is not required here, as we are not using any functions in the the Rcpp namespace.</p>
<p>To compute the area of a square, the <code>side</code> length is input in R as a simple <code>numeric</code> type and passed to the interface function as a C++ <code>double</code>. The <code>Square</code> object, <code>sq</code>, is constructed with the <code>side</code> argument, and its <code>area()</code> member function performs said calculation and returns the result. The process is trivally similar for the <code>circleArea(.)</code> function.</p>
</div>
</div>
<div id="rfunctionsexportedbytheinterfacelevel" class="section level3">
<h3>R Functions Exported by the Interface Level</h3>
<p>To wrap up this discussion, let’s look at the functions an R user will have available after we build the package in RStudio (coming next in this series). Each of these functions will be exported from their respective C++ interface functions as regular R functions, namely:</p>
<ul>
<li><code>rAdd(.)</code></li>
<li><code>rSortVec(.)</code></li>
<li><code>rProdLcmGcd(.)</code></li>
<li><code>squareArea(.)</code></li>
<li><code>circleArea(.)</code></li>
</ul>
<p>The package user will not need to know or care that the core calculations are being performed in C++. Visually, we can represent the associations as shown in the following diagram:</p>
<p><img src="CompositeCodeDiagram.png" alt = "Mapping R Package Functions to Reusable C++" height = "400" width="600"></p>
<p>The solid red line represents a “Chinese Wall” that separates our code base from the interface and allows us to maintain it as standard and reusable C++.</p>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>This concludes our example of configuring code that allows the use of standard reusable C++ in an R package, without having to modify it with any R or Rcppspecific syntax. In the next post, we will examine how to actually build this code into an R package by leveraging the convenience of Rcpp and RStudio, and deploy it for any number of R users. The source code will also be made available so that you can try it out for yourself.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/07/31/rpackageintegrationwithmodernreusableccodeusingrcpppart3/';</script>

June 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2020/07/27/june2020top40newcranpackages/
Mon, 27 Jul 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/07/27/june2020top40newcranpackages/
<p>Two hundred ninety new packages made it to CRAN in June. Here are my “Top 40” picks in ten categories: Computational Methods, Data, Genomes, Machine Learning, Medicine, Science, Statistics, Time Series, Utilization, and Visualization.</p>
<h3 id="computationalmethods">Computational Methods</h3>
<p><a href="https://CRAN.Rproject.org/package=Rfractran">Rfractran</a> v1.0 Implements the esoteric, Turing complete <a href="https://en.wikipedia.org/wiki/FRACTRAN">FRACTRAN</a> programming language invented by <a href="https://en.wikipedia.org/wiki/John_Horton_Conway">John Horton Conway</a>.</p>
<p><a href="https://cran.rproject.org/package=QGameTheory">QGameTheory</a> v0.1.2: Provides a general purpose toolbox for simulating quantum versions of game theoretic models See <a href="arXiv:quantph/0208069">Flitney and Abbott (2002)</a> for background. Models include the Penny Flip Game <a href="arXiv:quantph/98040100">Meyer (1998)</a>, the Prisoner’s Dilemma <a href="arXiv:quantph/0506219">Grabbe (2005)</a>, Two Person Duel <a href="arXiv:quantph/0305058">Flitney and Abbott (2004)</a>, Battle of the Sexes <a href="arXiv:quantph/0110096">Nawaz and Toor (2004)</a>, Hawk and Dove Game <a href="arXiv:quantph/0108075">Nawaz and Toor (2010)</a>, Newcomb’s Paradox <a href="arXiv:quantph/0202074">Piotrowski and Sladkowski (2002)</a> and the Monty Hall Problem <a href="arXiv:quantph/0109035">Flitney and Abbott (2002)</a>. Look <a href="https://github.com/indrag49/QGameTheory">here</a> for and introduction to the package.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.rproject.org/package=covid19dbcand">covid19dbcand</a> v0.1.0: Provides access seventyfive <a href="http://drugbank.ca/covid19">Drugbank</a> data sets containing information about possible treatment for COVID19.</p>
<p><a href="https://CRAN.Rproject.org/package=tidytuesdayR">tidytuesdayR</a> v1.0.1: Provides functions for downloading the <a href="https://www.tidytuesday.com/">Tidy Tuesday</a> data sets from R for Data Science Online Learning Community <a href="https://github.com/rfordatascience/tidytuesday">repository</a></p>
<p><a href="https://cran.rproject.org/package=us.census.geoheader">us.census.geoheader</a> v1.0.2: Implements a simple interface to the Geographic Header information from the <a href="https://catalog.data.gov/dataset/census2000summaryfile2sf2">2010 US Census Summary File 2</a>. See the <a href="https://cran.rproject.org/web/packages/us.census.geoheader/vignettes/atour.html">vignette</a> for details.</p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://CRAN.Rproject.org/package=dnapath">dnapath</a> v0.6.4: Provides functions to integrate pathway information into the differential network analysis of two gene expression datasets as described in <a href="https://www.nature.com/articles/s41598019419183">Grimes et al. (2019)</a>. There is an <a href="https://cran.rproject.org/web/packages/dnapath/vignettes/introduction_to_dnapath.html">Introduction</a> and a vignette on <a href="https://cran.rproject.org/web/packages/dnapath/vignettes/package_data.html">Datasets</a>.</p>
<p><img src="dnapath.png" height = "300" width="300"></p>
<p><a href="https://cran.rproject.org/package=TreeDist">TreeDist</a> v1.1.1: Implements measures of tree similarity, including the informationbased generalized RobinsonFoulds distances <a href="https://academic.oup.com/bioinformatics/articleabstract/doi/10.1093/bioinformatics/btaa614/5866976?redirectedFrom=fulltext">Smith, (2020)</a>, the <a href="https://academic.oup.com/bioinformatics/article/22/1/117/217975">Nye et al. (2006)</a> metric and other additional metrics. There are several vignettes: <a href="https://cran.rproject.org/web/packages/TreeDist/vignettes/GeneralizedRF.html">Generalized RobinsonFoulds distances</a>, <a href="https://cran.rproject.org/web/packages/TreeDist/vignettes/RobinsonFoulds.html">Extending the RobinsonFoulds metric</a>, <a href="https://cran.rproject.org/web/packages/TreeDist/vignettes/UsingTreeDist.html">Calculate tree similarity</a>, <a href="https://cran.rproject.org/web/packages/TreeDist/vignettes/information.html">Comparing splits using information theory</a>, and <a href="https://cran.rproject.org/web/packages/TreeDist/vignettes/usingdistances.html">Contextualizing tree distances</a>.</p>
<p><img src="TreeDist.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=volcano3D">volcano3D</a> v1.0.1: Implements interactive plotting for threeway differential expression analysis which is useful for discovering quantitative changes in expression levels between experimental groups. See <a href="https://www.cell.com/cellreports/fulltext/S22111247(19)310071?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2211124719310071%3Fshowall%3Dtrue">Lewis et al. (2019)</a> for background and the <a href="https://cran.rproject.org/web/packages/volcano3D/vignettes/Vignette.html">vignette</a>.</p>
<p><img src="volcano3D.png" height = "600" width="400"></p>
<h3 id="machinelearning">Machine Learning</h3>
<p><a href="https://CRAN.Rproject.org/package=boundingbox">boundingbox</a> v1.0.1: Provides functions to generate bounding boxes for image classification. See <a href="https://www.sciencedirect.com/science/article/pii/S1877050912007260?via%3Dihub">Ibrahim et al. (2012)</a> for background and the <a href="https://cran.rproject.org/web/packages/boundingbox/vignettes/boundingboxvignette.html">vignette</a> for and introduction to the package.</p>
<p><img src="boundingbox.jpeg" height = "400" width="400"></p>
<p><a href="https://CRAN.Rproject.org/package=corels">corels</a> v0.0.2: Implements the Certifiably Optimal RulE ListS (Corels)’ learner described in <a href="arXiv:1704.01701">Angelino et al. (2017)</a> which provides interpretable decision rules with an optimality guarantee. <a href="https://cran.rproject.org/web/packages/corels/readme/README.html">README</a> contains an example.</p>
<p><img src="corels.png" height = "400" width="400"></p>
<p><a href="https://cran.rproject.org/package=nntrf">nntrf</a> v0.1.0: Implements nonlinear dimension reduction by means of a neural network with hidden layers which can be useful as data preprocessing for machine learning methods that do not work well with many irrelevant or redundant features. See <a href="https://www.nature.com/articles/323533a0">Rumelhart et al. (1986)</a> for background and the <a href="https://cran.rproject.org/web/packages/nntrf/vignettes/nntrf.html">vignette</a>.</p>
<p><img src="nntrf.png" height = "200" width="400"></p>
<p><a href="https://cran.rproject.org/package=permimp">permimp</a> v1.00: Implements an addon to the <code>party</code> package, with a faster implementation of the partialconditional permutation importance for random forests. See <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/14712105825">Strobl et al. (2007)</a> for background and the <a href="https://cran.rproject.org/web/packages/permimp/vignettes/permimppackage.html">vignette</a> for an introduction.</p>
<p><img src="permimp.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=pluralize">pluralize</a> v0.2.0: Provides tools based on a <a href="https://github.com/blakeembrey/pluralize">JavaScript library</a> to create plural, singular, and regular forms of English words along with tools to augment the builtin rules to fit specialized needs. See the <a href="https://cran.rproject.org/web/packages/pluralize/vignettes/Whypluralize.html">vignette</a> for examples.</p>
<p><a href="https://CRAN.Rproject.org/package=triplot">triplot</a> v1.3.0: Provides model agnostic tools for exploring effects of correlated features in predictive models and calculating the importance of the groups of explanatory variables. <a href="arXiv:1806.08915">Biecek (2018)</a> for details and look <a href="https://github.com/ModelOriented/triplot">here</a> for an example.</p>
<p><img src="triplot.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=tfaddons">tfaddons</a> v0.10.0: Provides and interface to <a href="https://www.tensorflow.org/addons">TensorFlow Addons</a>. See the <a href="https://cran.rproject.org/web/packages/tfaddons/vignettes/NMT.html">vignette</a> for an example.</p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://CRAN.Rproject.org/package=BayesianReasoning">BayesianReasoning</a> v0.3.2: Provides functions to plot and help understand positive and negative predictive values (PPV and NPV), and their relationship with sensitivity, specificity, and prevalence. See <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.16512227.2006.00180.x">Akobeng (2007)</a> for a theoretical overview and <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01327/full">Navarrete et al. (2015)</a> for a practical explanation. There is an <a href="https://cran.rproject.org/web/packages/BayesianReasoning/vignettes/introduction.html">Introduction</a> and a vignette on <a href="https://cran.rproject.org/web/packages/BayesianReasoning/vignettes/PPV_NPV.html">Screening Tests</a>.</p>
<p><img src="BayesianReasoning.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=riskCommunicator">riskCommunicator</a> v0.1.0: Provides functions to estimate flexible epidemiological effect measures including both differences and ratios using the parametric Gformula. See <a href="https://www.sciencedirect.com/science/article/pii/0270025586900886?via%3Dihub">Robbins (1986)</a> and <a href="https://academic.oup.com/aje/article/169/9/1140/125286">Ahern et al. (2009)</a> for background. There is an <a href="https://cran.rproject.org/web/packages/riskCommunicator/vignettes/Vignette_manuscript.html">Introduction</a> and a <a href="https://cran.rproject.org/web/packages/riskCommunicator/vignettes/Vignette_newbieRusers.html">vignette</a> for newbie R users.</p>
<p><img src="RiskCommunicator.png" height = "400" width="600"></p>
<h3 id="science">Science</h3>
<p><a href="https://CRAN.Rproject.org/package=actel">actel</a> v1.0.0: Designed for studies where fish tagged with acoustic tags are expected to move through receiver arrays, this package combines the advantages of automatic sorting and checking of fish movements with the possibility for user intervention on tags that deviate from expected behavior. Calculations are based on <a href="https://www.researchgate.net/publication/256443823_Using_markrecapture_models_to_estimate_survival_from_telemetry_data">Perry et al. (2012)</a>. There are an astounding seventeen vignettes including: <a href="https://cran.rproject.org/web/packages/actel/vignettes/a0_workspace_requirements.html">Preparing your data</a>, <a href="https://cran.rproject.org/web/packages/actel/vignettes/a1_study_area.html">Structruing the Study Area</a>, and <a href="https://cran.rproject.org/web/packages/actel/vignettes/b0_explore.html">Explore</a>.</p>
<p><img src="actel.SVG" height = "400" width="600"></p>
<p><a href="https://CRAN.Rproject.org/package=safedata">safedata</a> v1.0.5: Provides access to data from the <a href="https://www.safeproject.net/">SAFE Project</a>, a large scale ecological experiment in Malaysian Borneo that explores the impact of habitat fragmentation and conversion on ecosystem function and services. There is an <a href="https://cran.rproject.org/web/packages/safedata/vignettes/overview.html">Overview</a> and an <a href="https://cran.rproject.org/web/packages/safedata/vignettes/using_safe_data.html">Introduction</a>.</p>
<p><img src="safedata.png" height = "400" width="500"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.rproject.org/package=causact">causact</a> v0.3.2: Built on <code>greta</code> and <code>TensorFlow</code>, this package enables users to define probabilistic models using directed acyclic graphs. See <a href="https://cran.rproject.org/web/packages/causact/readme/README.html">README</a> for examples.</p>
<p><img src="causact.png" height = "400" width="600"></p>
<p><a href="https://CRAN.Rproject.org/package=frechet">frechet</a> v0.1.0: Provides implementation of statistical methods for random objects lying in various metric spaces, which are not necessarily linear spaces including Fréchet regression for random objects with Euclidean predictors. See <a href="https://projecteuclid.org/euclid.aos/1547197235">Petersen and Müller (2019)</a> for the theory.</p>
<p><a href="https://cran.rproject.org/package=hmclearn">hmclearn</a> v0.0.3: Provide a framework for learning the intricacies of the Hamiltonian Monte Carlo. See <a href="arXiv:1701.02434">Michael (2017)</a> and <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08243">Thomas and Tu (2020)</a> for background. There are vignettes on <a href="https://cran.rproject.org/web/packages/hmclearn/vignettes/linear_mixed_effects_hmclearn.html">Linear mixed effects</a>, <a href="https://cran.rproject.org/web/packages/hmclearn/vignettes/linear_regression_hmclearn.html">linear regression</a>, <a href="https://cran.rproject.org/web/packages/hmclearn/vignettes/logistic_mixed_effects_hmclearn.html">logistic mixed effects</a>, <a href="https://cran.rproject.org/web/packages/hmclearn/vignettes/logistic_regression_hmclearn.html">logistic regression</a>, and <a href="https://cran.rproject.org/web/packages/hmclearn/vignettes/poisson_regression_hmclearn.html">poisson regrression</a>.</p>
<p><img src="hmclearn.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=mashr">mashr</a> v0.2.38: Implements the multivariate adaptive shrinkage (mash) method of <a href="https://www.nature.com/articles/s4158801802688">Urbut et al. (2019)</a> for estimating and testing large numbers of effects in many conditions (or many outcomes) There is an <a href="https://cran.rproject.org/web/packages/mashr/vignettes/intro_mash.html">Introduction</a> and vignettes on <a href="https://cran.rproject.org/web/packages/mashr/vignettes/eQTL_outline.html">eQTL studies</a>, <a href="https://cran.rproject.org/web/packages/mashr/vignettes/intro_correlations.html">Correlations</a>, <a href="https://cran.rproject.org/web/packages/mashr/vignettes/intro_mash_dd.html">Covariances</a>, <a href="https://cran.rproject.org/web/packages/mashr/vignettes/intro_mashcommonbaseline.html">Common Baseline</a>, <a href="https://cran.rproject.org/web/packages/mashr/vignettes/intro_mashnobaseline.html">No Common Baseline</a>, <a href="https://cran.rproject.org/web/packages/mashr/vignettes/mash_sampling.html">Sampling from Posteriors</a>, and <a href="https://cran.rproject.org/web/packages/mashr/vignettes/simulate_noncanon.html">Simulation</a>.</p>
<p><a href="https://CRAN.Rproject.org/package=molic">molic</a> v2.0.1: Implements the method of <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/sjos.12407">Lindskou et al. (2019)</a> to detect outliers in high dimensional, categorical data. There are vignettes on the <a href="https://cran.rproject.org/web/packages/molic/vignettes/outlier_intro.html">Outlier Model</a>, <a href="https://cran.rproject.org/web/packages/molic/vignettes/dermatitis.html">Detecting Skin Diseases</a>, and <a href="https://cran.rproject.org/web/packages/molic/vignettes/genetic_example.html">Genetic Data</a>.</p>
<p><img src="molic.png" height = "300" width="500"></p>
<p><a href="https://cran.rproject.org/package=multinma">multinma</a> v0.1.3: Uses <code>Stan</code> to fit network metaanalysis and network metaregression models for aggregate data, individual patient data, and mixtures of both. See <a href="https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssa.12579">Phillippo et al. (2020)</a> for background and the vignettes for examples:
<a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_atrial_fibrillation.html">Stroke prevention</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_bcg_vaccine.html">BCG Vaccine for Tuberculosis</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_blocker.html">Beta Blockers</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_diabetes.html">Diabetes</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_dietary_fat.html">Dietary Fat</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_parkinsons.html">Parkinson’s disease</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_plaque_psoriasis.html">Plaque Psoriasis</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_smoking.html">Smoking Cessation</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_statins.html">Statins</a>, <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_thrombolytics.html">Thrombolytic Treatments</a>, and <a href="https://cran.rproject.org/web/packages/multinma/vignettes/example_transfusion.html">neutropenia or neutrophil dysfunction</a>.</p>
<p><img src="multinma.png" height = "300" width="500"></p>
<p><a href="https://CRAN.Rproject.org/package=SCOUTer">SCOUTer</a> v1.0.0: Offers a new approach to simulating outliers by generating new observations defined by the statistics: Squared Prediction Error (SPE) and Hotelling’s $T^{2}$ statistic. See the <a href="https://cran.rproject.org/web/packages/SCOUTer/vignettes/demoscouter.html">vignette</a>.</p>
<p><img src="SCOUTer.png" height = "400" width="600"></p>
<h3 id="timeseries">Time Series</h3>
<p><a href="https://cran.rproject.org/package=bootUR">bootUR</a> v0.1.0: Provides functions to perform various bootstrap unit root tests for both individual time series (including augmented DickeyFuller test and union tests), multiple time series and panel data. See <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.14679892.2007.00565.x">Palm et al. (2008)</a> for background, and the <a href="https://cran.rproject.org/web/packages/bootUR/index.html">vignette</a> for an introduction and extensive references.</p>
<p><a href="https://cran.rproject.org/package=ChangePointTaylor">ChangePointTaylor</a> v0.1.0: Implements the change in mean detection method described in <a href="https://variation.com/wpcontent/uploads/changepointanalyzer/changepointanalysisapowerfulnewtoolfordetectingchanges.pdf">Taylor (2000)</a>. See the <a href="https://cran.rproject.org/web/packages/ChangePointTaylor/vignettes/ChangePointTaylorvignette.html">vignette</a>.</p>
<p><img src="cpt.png" height = "400" width="600"></p>
<p><a href="https://CRAN.Rproject.org/package=LOPART">LOPART</a> Implements the change point detection algorithm described in <a href="arXiv:2006.13967">Hocking and Srivastava (2020)</a>.</p>
<p><img src="LOPART.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=modeltime">modeltime</a> v0.0.2: Implements a time series forecasting framework for use with the <code>tidymodels</code> ecosystem, and ARIMA, Exponential Smoothing, and time series models from the <code>forecast</code> and <code>prophet</code> See <a href="https://otexts.com/fpp2/"><em>Forecasting Principles & Practice</em></a>, and <a href="https://research.fb.com/blog/2017/02/prophetforecastingatscale/"><em>Prophet: forecasting at scale</em></a> for background. These is a <a href="https://cran.rproject.org/web/packages/modeltime/vignettes/gettingstartedwithmodeltime.html">Getting Started Guide</a> and vignettes describing <a href="https://cran.rproject.org/web/packages/modeltime/vignettes/extendingmodeltime.html">Extension</a> and the <a href="https://cran.rproject.org/web/packages/modeltime/vignettes/modeltimemodellist.html">Model List</a>.</p>
<p><img src="modeltime.jpeg" height = "400" width="600"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.rproject.org/package=knitrdata">knitrdata</a> v0.5.0: Implements a data language engine for incorporating data directly in ‘rmarkdown’ documents so that they can be made completely standalone. See the <a href="https://cran.rproject.org/web/packages/knitrdata/vignettes/data_language_engine_vignette.html">vignette</a> for details.</p>
<p><a href="https://cran.rproject.org/package=lazyarray">lazyarray</a> v1.1.0: Implements multithreaded, serialized compressed arrays that fully utilizes modern solid state drives that allow users to quickly store large data while using limited memory. A lazyarray can be shared across multiple R sessions and multiple R sessions can simultaneously write to a same array. For more information, look <a href="https://github.com/dipterix/lazyarray">here</a>.</p>
<div align="center"><iframe width="684" height="385" src="https://www.youtube.com/embed/xX4YRAXYFxE" frameborder="0" allow="accelerometer; autoplay; encryptedmedia; gyroscope; pictureinpicture" allowfullscreen></iframe></div>
<p><a href="https://cran.rproject.org/package=officedown">officedown</a> v0.2.0: Provides functions to produce Microsoft Word documents from R Markdown. There are vignettes on <a href="https://cran.rproject.org/web/packages/officedown/vignettes/captions.html">Captions and References</a>, <a href="https://cran.rproject.org/web/packages/officedown/vignettes/lists.html">Lists</a>, <a href="https://cran.rproject.org/web/packages/officedown/vignettes/officer.html"><code>officer</code> Support</a>, <a href="https://cran.rproject.org/web/packages/officedown/vignettes/tables.html">Data Frame Printing</a>, and <a href="https://cran.rproject.org/web/packages/officedown/vignettes/yaml.html">YAML Headers</a>.</p>
<p><img src="officedown.png" height = "400" width="600"></p>
<p><a href="https://cran.rproject.org/package=r2dictionary">r2dictionary</a> v0.1: Allows users to directly search for definitions of terms from within the R environment. The source dictionary is an original work of The Online Plain Text English Dictionary (<a href="https://www.mso.anu.edu.au/~ralph/OPTED/">OPTED</a>). See the <a href="https://cran.rproject.org/web/packages/r2dictionary/vignettes/simple_samples.html">vignette</a>.</p>
<p><a href="https://CRAN.Rproject.org/package=rmdpartials">rmdpartials</a> v0.5.8: Enables the use of <code>rmarkdown</code> <em>partials</em> (<code>knitr</code> <em>child</em> documents) for making components of HTML, PDF and Word documents. See the <a href="https://cran.rproject.org/web/packages/rmdpartials/vignettes/rmdpartials.html">vignette</a> to get started.</p>
<p><a href="https://cran.rproject.org/package=tidycat">tidycat</a> v0.1.1: Provides functions to create additional rows and columns on <code>broom::tidy()</code> output to allow for easier control on categorical parameter estimates. The <a href="https://cran.rproject.org/web/packages/tidycat/vignettes/intro.html">vignette</a> contains examples</p>
<p><img src="tidycat.png" height = "400" width="600"></p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://CRAN.Rproject.org/package=ggdist">ggdist</a> v2.2.0: Provides primitives for visualizing distributions using <code>ggplot2</code> that are tuned for visualizing uncertainty in either a frequentist or Bayesian mode. Primitives include points with multiple uncertainty intervals, eye plots <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467985X.00120">Spiegelhalter (1999)</a>, density plots, gradient plots, dot plots <a href="https://www.tandfonline.com/doi/abs/10.1080/00031305.1999.10474474">Wilkinson (1999)</a>, quantile dot plots <a href="https://dl.acm.org/doi/10.1145/2858036.2858558">Kay et al. (2016)</a>,, complementary cumulative distribution function barplots, <a href="https://dl.acm.org/doi/10.1145/3173574.3173718">Fernandes et al. (2018)</a>, and fit curves with multiple uncertainty ribbons.</p>
<p><a href="https://CRAN.Rproject.org/package=loon.ggplot">loon.ggplot</a> v1.0.1: Provides a bridge between the <code>loon</code> and <code>ggplot2</code> packages. Users can turn static <code>ggplot2</code> plots into interactive <code>loon</code> plots and vice versa. There are vignettes on <a href="https://cran.rproject.org/web/packages/loon.ggplot/vignettes/ggplots2loon.html">ggplots > loon</a>, <a href="https://cran.rproject.org/web/packages/loon.ggplot/vignettes/loon2ggplots.html">loon > ggplots</a>, and on u sing <a href="Pipes">pipes</a>.</p>
<p><img src="loon.png" height = "300" width="300"></p>
<p><a href="https://cran.rproject.org/package=treeheatr">treeheatr</a> v0.1.0: Provides interpretable decision tree visualizations with the data represented as a heatmap at the tree’s leaf nodes. There is a <a href="https://cran.rproject.org/web/packages/treeheatr/vignettes/explore.html">vignette</a>.</p>
<p><img src="treeheatr.png" height = "300" width="500"></p>
<p><a href="https://CRAN.Rproject.org/package=tilemaps">tilemaps</a> v0.2.0: Implements the algorithm of <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13200">McNeill and Hale (2017)</a> for generating tilemaps. See the <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13200">vignette</a>.</p>
<p><img src="tilemaps.png" height = "400" width="600"></p>
<p><a href="https://CRAN.Rproject.org/package=wrGraph">wrGraph</a> v1.0.2: Provides enhancements to base R graphics for plotting highthroughput data including automatic segmenting of the current device (e.g. window) to accommodate multiple new plots, automatic checking for optimal location of legends in plots, small histograms inserted as legends, the generation of mouseover interactive html pages and more. See the <a href="https://cran.rproject.org/web/packages/wrGraph/vignettes/wrGraphVignette2.html">vignette</a>.</p>
<p><img src="wrGraph.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/07/27/june2020top40newcranpackages/';</script>

Building A Neural Net from Scratch Using R  Part 2
https://rviews.rstudio.com/2020/07/24/buildinganeuralnetfromscratchusingrpart2/
Fri, 24 Jul 2020 00:00:00 +0000
https://rviews.rstudio.com/2020/07/24/buildinganeuralnetfromscratchusingrpart2/
<p><em>Akshaj is a budding deep learning researcher who loves to work with R. He has worked as a Research Associate at the Indian Institute of Science and as a Data Scientist at KPMG India.</em></p>
<p>In the previous post, we went through the dataset, the preprocessing involved, traintest split, and talked in detail about the architecture of the model. We started build our neural net chunkbychunk and wrote functions for initializing parameters and running forward propagation.</p>
<p>In the this post, we’ll implement backpropagation by writing functions to calculate gradients and update the weights. Finally, we’ll make predictions on the test data and see how accurate our model is using metrics such as <code>Accuracy</code>, <code>Recall</code>, <code>Precision</code>, and <code>F1score</code>. We’ll compare our neural net with a logistic regression model and visualize the difference in the decision boundaries produced by these models.</p>
<p>Let’s continue by implementing our cost function.</p>
<div id="computecost" class="section level3">
<h3>Compute Cost</h3>
<p>We will use Binary Cross Entropy loss function (aka log loss). Here, <span class="math inline">\(y\)</span> is the true label and <span class="math inline">\(\hat{y}\)</span> is the predicted output.</p>
<p><span class="math display">\[ cost =  1/N\sum_{i=1}^{N} y_{i}\log(\hat{y}_{i}) + (1  y_{i})(\log(1  \hat{y}_{i})) \]</span></p>
<p>The <code>computeCost()</code> function takes as arguments the input matrix <span class="math inline">\(X\)</span>, the true labels <span class="math inline">\(y\)</span> and a <code>cache</code>. <code>cache</code> is the output of the forward pass that we calculated above. To calculate the error, we will only use the final output <code>A2</code> from the <code>cache</code>.</p>
<pre class="r"><code>computeCost < function(X, y, cache) {
m < dim(X)[2]
A2 < cache$A2
logprobs < (log(A2) * y) + (log(1A2) * (1y))
cost < sum(logprobs/m)
return (cost)
}</code></pre>
<pre class="r"><code>cost < computeCost(X_train, y_train, fwd_prop)
cost</code></pre>
<pre><code>## [1] 0.693</code></pre>
</div>
<div id="backpropagation" class="section level3">
<h3>Backpropagation</h3>
<p>Now comes the best part of this all: backpropagation!</p>
<p>We’ll write a function that will calculate the gradient of the loss function with respect to the parameters. Generally, in a deep network, we have something like the following.</p>
<p><img src="backprop_deep.png" alt = "Figure 3: Backpropagation with cache. Credits: deep learning.ai" height = "400" width="600"></p>
<p>The above figure has two hidden layers. During backpropagation (red boxes), we use the output cached during forward propagation (purple boxes). Our neural net has only one hidden layer. More specifically, we have the following:</p>
<p><img src="linear_backward.png" alt = "Figure 4: Backpropagation for a single layer. Credits: deep learning.ai" height = "200" width="400"></p>
<p>To compute backpropagation, we write a function that takes as arguments an input matrix <code>X</code>, the train labels <code>y</code>, the output activations from the forward pass as <code>cache</code>, and a list of <code>layer_sizes</code>. The three outputs <span class="math inline">\((dW^{[l]}, db^{[l]}, dA^{[l1]})\)</span> are computed using the input <span class="math inline">\(dZ^{[l]}\)</span> where <span class="math inline">\(l\)</span> is the layer number.</p>
<p>We first differentiate the loss function with respect to the weight <span class="math inline">\(W\)</span> of the current layer.</p>
<p><span class="math display">\[ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l1] T} \tag{8}\]</span></p>
<p>Then we differentiate the loss function with respect to the bias <span class="math inline">\(b\)</span> of the current layer.
<span class="math display">\[ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}\]</span></p>
<p>Once we have these, we calculate the derivative of the previous layer with respect to <span class="math inline">\(A\)</span>, the output + activation from the previous layer.</p>
<p><span class="math display">\[ dA^{[l1]} = \frac{\partial \mathcal{L} }{\partial A^{[l1]}} = W^{[l] T} dZ^{[l]} \tag{10}\]</span></p>
<p>Because we only have a single hidden layer, we first calculate the gradients for the final (output) layer and then the middle (hidden) layer. In other words, the gradients for the weights that lie between the output and hidden layer are calculated first. Using this (and chain rule), gradients for the weights that lie between the hidden and input layer are calculated next.</p>
<p>Finally, we return a list of gradient matrices. These gradients tell us the the small value by which we should increase/decrease our weights such that the loss decreases. Here are the equations for the gradients. I’ve calculated them for you so you don’t differentiate anything. We’ll directly use these values </p>
<ul>
<li><span class="math inline">\(dZ^{[2]} = A^{[2]}  Y\)</span></li>
<li><span class="math inline">\(dW^{[2]} = \frac{1}{m} dZ^{[2]}A^{[1]^T}\)</span></li>
<li><span class="math inline">\(db^{[2]} = \frac{1}{m}\sum dZ^{[2]}\)</span></li>
<li><span class="math inline">\(dZ^{[1]} = W^{[2]^T} * g^{[1]'} Z^{[1]}\)</span> where <span class="math inline">\(g\)</span> is the activation function.</li>
<li><span class="math inline">\(dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}\)</span></li>
<li><span class="math inline">\(db^{[1]} = \frac{1}{m}\sum dZ^{[1]}\)</span></li>
</ul>
<p>If you would like to know more about the math involved in constructing these equations, please see the references below.</p>
<pre class="r"><code>backwardPropagation < function(X, y, cache, params, list_layer_size){
m < dim(X)[2]
n_x < list_layer_size$n_x
n_h < list_layer_size$n_h
n_y < list_layer_size$n_y
A2 < cache$A2
A1 < cache$A1
W2 < params$W2
dZ2 < A2  y
dW2 < 1/m * (dZ2 %*% t(A1))
db2 < matrix(1/m * sum(dZ2), nrow = n_y)
db2_new < matrix(rep(db2, m), nrow = n_y)
dZ1 < (t(W2) %*% dZ2) * (1  A1^2)
dW1 < 1/m * (dZ1 %*% t(X))
db1 < matrix(1/m * sum(dZ1), nrow = n_h)
db1_new < matrix(rep(db1, m), nrow = n_h)
grads < list("dW1" = dW1,
"db1" = db1,
"dW2" = dW2,
"db2" = db2)
return(grads)
}</code></pre>
<p>As you can see below, the shapes of the gradients are the same as their corresponding weights i.e. <code>W1</code> has the same shape as <code>dW1</code> and so on. This is important because we are going to use these gradients to update our actual weights.</p>
<pre class="r"><code>back_prop < backwardPropagation(X_train, y_train, fwd_prop, init_params, layer_size)
lapply(back_prop, function(x) dim(x))</code></pre>
<pre><code>## $dW1
## [1] 4 2
##
## $db1
## [1] 4 1
##
## $dW2
## [1] 1 4
##
## $db2
## [1] 1 1</code></pre>
</div>
<div id="updateparameters" class="section level3">
<h3>Update Parameters</h3>
<p>From the gradients calculated by the <code>backwardPropagation()</code>, we update our weights using the <code>updateParameters()</code> function. The <code>updateParameters()</code> function takes as arguments the gradients, network parameters, and a learning rate.</p>
<p>Why a learning rate? Because sometimes the weight updates (gradients) are too large and because of that we miss the minima completely. Learning rate is a hyperparameter that is set by us, the user, to control the impact of weight updates. The value of learning rate lies between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span>. This learning rate is multiplied with the gradients before being subtracted from the weights.The weights are updated as follows where the learning rate is defined by <span class="math inline">\(\alpha\)</span>.</p>
<ul>
<li><span class="math inline">\(W^{[2]} = W^{[2]}  \alpha * dW^{[2]}\)</span></li>
<li><span class="math inline">\(b^{[2]} = b^{[2]}  \alpha * db^{[2]}\)</span></li>
<li><span class="math inline">\(W^{[1]} = W^{[1]}  \alpha * dW^{[1]}\)</span></li>
<li><span class="math inline">\(b^{[1]} = b^{[1]}  \alpha * db^{[1]}\)</span></li>
</ul>
<p>Updated parameters are returned by <code>updateParameters()</code> function. We take the gradients, weight parameters, and a learning rate as the input. <code>grads</code> and <code>params</code> are calculated above while we choose the <code>learning_rate</code>.</p>
<pre class="r"><code>updateParameters < function(grads, params, learning_rate){
W1 < params$W1
b1 < params$b1
W2 < params$W2
b2 < params$b2
dW1 < grads$dW1
db1 < grads$db1
dW2 < grads$dW2
db2 < grads$db2
W1 < W1  learning_rate * dW1
b1 < b1  learning_rate * db1
W2 < W2  learning_rate * dW2
b2 < b2  learning_rate * db2
updated_params < list("W1" = W1,
"b1" = b1,
"W2" = W2,
"b2" = b2)
return (updated_params)
}</code></pre>
<p>As we can see, the weights still maintain their original shape. This means we’ve done things correctly till this point.</p>
<pre class="r"><code>update_params < updateParameters(back_prop, init_params, learning_rate = 0.01)
lapply(update_params, function(x) dim(x))</code></pre>
<pre><code>## $W1
## [1] 4 2
##
## $b1
## [1] 4 1
##
## $W2
## [1] 1 4
##
## $b2
## [1] 1 1</code></pre>
</div>
<div id="trainthemodel" class="section level2">
<h2>Train the Model</h2>
<p>Now that we have all our components, let’s go ahead write a function that will train our model.</p>
<p>We will use all the functions we have written above in the following order.</p>
<ol style="liststyletype: decimal">
<li>Run forward propagation</li>
<li>Calculate loss</li>
<li>Calculate gradients</li>
<li>Update parameters</li>
<li>Repeat</li>
</ol>
<p>This <code>trainModel()</code> function takes as arguments the input matrix <code>X</code>, the true labels <code>y</code>, and the number of epochs.</p>
<ol style="liststyletype: decimal">
<li>Get the sizes for layers and initialize random parameters.</li>
<li>Initialize a vector called <code>cost_history</code> which we’ll use to store the calculated loss value per epoch.</li>
<li>Run a forloop:
<ul>
<li>Run forward prop.</li>
<li>Calculate loss.</li>
<li>Update parameters.</li>
<li>Replace the current parameters with updated parameters.</li>
</ul></li>
</ol>
<p>This function returns the updated parameters which we’ll use to run our model inference. It also returns the <code>cost_history</code> vector.</p>
<pre class="r"><code>trainModel < function(X, y, num_iteration, hidden_neurons, lr){
layer_size < getLayerSize(X, y, hidden_neurons)
init_params < initializeParameters(X, layer_size)
cost_history < c()
for (i in 1:num_iteration) {
fwd_prop < forwardPropagation(X, init_params, layer_size)
cost < computeCost(X, y, fwd_prop)
back_prop < backwardPropagation(X, y, fwd_prop, init_params, layer_size)
update_params < updateParameters(back_prop, init_params, learning_rate = lr)
init_params < update_params
cost_history < c(cost_history, cost)
if (i %% 10000 == 0) cat("Iteration", i, "  Cost: ", cost, "\n")
}
model_out < list("updated_params" = update_params,
"cost_hist" = cost_history)
return (model_out)
}</code></pre>
<p>Now that we’ve defined our function to train, let’s run it! We’re going to train our model, with 40 hidden neurons, for 60000 epochs with a learning rate of 0.9. We will print out the loss after every 10000 epochs.</p>
<pre class="r"><code>EPOCHS = 60000
HIDDEN_NEURONS = 40
LEARNING_RATE = 0.9
train_model < trainModel(X_train, y_train, hidden_neurons = HIDDEN_NEURONS, num_iteration = EPOCHS, lr = LEARNING_RATE)</code></pre>
<pre><code>## Iteration 10000  Cost: 0.3724
## Iteration 20000  Cost: 0.4081
## Iteration 30000  Cost: 0.3273
## Iteration 40000  Cost: 0.4671
## Iteration 50000  Cost: 0.4479
## Iteration 60000  Cost: 0.3074</code></pre>
<p><img src="/post/20200721buildinganeuralnetfromscratchusingrpart2/index_files/figurehtml/unnamedchunk101.png" width="672" /></p>
</div>
<div id="logisticregression" class="section level2">
<h2>Logistic Regression</h2>
<p>Before we go ahead and test our neural net, let’s quickly train a simple logistic regression model so that we can compare its performance with our neural net. Since, a logistic regression model can learn only linear boundaries, it will not fit the data well. A neuralnetwork on the other hand will.</p>
<p>We’ll use the <code>glm()</code> function in R to build this model.</p>
<pre class="r"><code>lr_model < glm(y ~ x1 + x2, data = train)
lr_model</code></pre>
<pre><code>##
## Call: glm(formula = y ~ x1 + x2, data = train)
##
## Coefficients:
## (Intercept) x1 x2
## 0.51697 0.00889 0.05207
##
## Degrees of Freedom: 319 Total (i.e. Null); 317 Residual
## Null Deviance: 80
## Residual Deviance: 76.4 AIC: 458</code></pre>
<p>Let’s now make generate predictions of the logistic regression model on the test set.</p>
<pre class="r"><code>lr_pred < round(as.vector(predict(lr_model, test[, 1:2])))
lr_pred</code></pre>
<pre><code>## [1] 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1
## [39] 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0
## [77] 1 1 1 0</code></pre>
</div>
<div id="testthemodel" class="section level2">
<h2>Test the Model</h2>
<p>Finally, it’s time to make predictions. To do that </p>
<ol style="liststyletype: decimal">
<li>First get the layer sizes.</li>
<li>Run forward propagation.</li>
<li>Return the prediction.</li>
</ol>
<p>During inference time, we do not need to perform backpropagation as you can see below. We only perform forward propagation and return the final output from our neural network. (Note that instead of randomly initializing parameters, we’re using the trained parameters here. )</p>
<pre class="r"><code>makePrediction < function(X, y, hidden_neurons){
layer_size < getLayerSize(X, y, hidden_neurons)
params < train_model$updated_params
fwd_prop < forwardPropagation(X, params, layer_size)
pred < fwd_prop$A2
return (pred)
}</code></pre>
<p>After obtaining our output probabilities (Sigmoid), we roundoff those to obtain output labels.</p>
<pre class="r"><code>y_pred < makePrediction(X_test, y_test, HIDDEN_NEURONS)
y_pred < round(y_pred)</code></pre>
<p>Here are the true labels and the predicted labels.</p>
<pre><code>## Neural Net:
## 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1</code></pre>
<pre><code>## Ground Truth:
## 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1</code></pre>
<pre><code>## Logistic Reg:
## 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 0</code></pre>
<div id="decisionboundaries" class="section level3">
<h3>Decision Boundaries</h3>
<p>In the following visualization, we’ve plotted our testset predictions on top of the decision boundaries.</p>
<div id="neuralnet" class="section level4">
<h4>Neural Net</h4>
<p>As we can see, our neural net was able to learn the nonlinear decision boundary and has produced accurate results.</p>
<p><img src="/post/20200721buildinganeuralnetfromscratchusingrpart2/index_files/figurehtml/unnamedchunk171.png" width="672" /></p>
</div>
<div id="logisticregression1" class="section level4">
<h4>Logistic Regression</h4>
<p>On the other hand, Logistic Regression with it’s linear decision boundary could not fit the data very well.</p>
<p><img src="/post/20200721buildinganeuralnetfromscratchusingrpart2/index_files/figurehtml/unnamedchunk191.png" width="672" /></p>
</div>
</div>
<div id="confusionmatrix" class="section level3">
<h3>Confusion Matrix</h3>
<p>A confusion matrix is often used to describe the performance of a classifier.
It is defined as:</p>
<p><span class="math display">\[\mathbf{Confusion Matrix} = \left[\begin{array}
{rr}
True Negative & False Positive \\
False Negative & True Positive
\end{array}\right]
\]</span></p>
<p>Let’s go over the basic terms used in a confusion matrix through an example. Consider the case where we were trying to predict if an email was spam or not.</p>
<ul>
<li><strong>True Positive</strong>: Email was predicted to be spam and it actually was spam.</li>
<li><strong>True Negative</strong>: Email was predicted as notspam and it actually was notspam.</li>
<li><strong>False Positive</strong>: Email was predicted to be spam but it actually was notspam.</li>
<li><strong>False Negative</strong>: Email was predicted to be notspam but it actually was spam.</li>
</ul>
<pre class="r"><code>tb_nn < table(y_test, y_pred)
tb_lr < table(y_test, lr_pred)
cat("NN Confusion Matrix: \n")</code></pre>
<pre><code>## NN Confusion Matrix:</code></pre>
<pre class="r"><code>tb_nn</code></pre>
<pre><code>## y_pred
## y_test 0 1
## 0 34 10
## 1 7 29</code></pre>
<pre class="r"><code>cat("\nLR Confusion Matrix: \n")</code></pre>
<pre><code>##
## LR Confusion Matrix:</code></pre>
<pre class="r"><code>tb_lr</code></pre>
<pre><code>## lr_pred
## y_test 0 1
## 0 14 30
## 1 18 18</code></pre>
</div>
<div id="accuracymetrics" class="section level3">
<h3>Accuracy Metrics</h3>
<p>We’ll calculate the Precision, Recall, F1 Score, Accuracy. These metrics, derived from the confusion matrix, are defined as </p>
<p><strong>Precision</strong> is defined as the number of true positives over the number of true positives plus the number of false positives.</p>
<p><span class="math display">\[Precision = \frac {True Positive}{True Positive + False Positive} \]</span></p>
<p><strong>Recall</strong> is defined as the number of true positives over the number of true positives plus the number of false negatives.</p>
<p><span class="math display">\[Recall = \frac {True Positive}{True Positive + False Negative} \]</span></p>
<p><strong>F1score</strong> is the harmonic mean of precision and recall.</p>
<p><span class="math display">\[F1 Score = 2 \times \frac {Precision \times Recall}{Precision + Recall} \]</span></p>
<p><strong>Accuracy</strong> gives us the percentage of the all correct predictions out total predictions made.</p>
<p><span class="math display">\[Accuracy = \frac {True Positive + True Negative} {True Positive + False Positive + True Negative + False Negative} \]</span></p>
<p>To better understand these terms, let’s continue the example of “emailspam” we used above.</p>
<ul>
<li><p>If our model had a precision of 0.6, that would mean when it predicts an email as spam, then it is correct 60% of the time.</p></li>
<li><p>If our model had a recall of 0.8, then it would mean our model correctly classifies 80% of all spam.</p></li>
<li><p>The F1 score is way we combine the precision and recall together. A perfect F1score is represented with a value of 1, and worst with 0</p></li>
</ul>
<p>Now that we have an understanding of the accuracy metrics, let’s actually calculate them. We’ll define a function that takes as input the confusion matrix. Then based on the above formulas, we’ll calculate the metrics.</p>
<pre class="r"><code>calculate_stats < function(tb, model_name) {
acc < (tb[1] + tb[4])/(tb[1] + tb[2] + tb[3] + tb[4])
recall < tb[4]/(tb[4] + tb[3])
precision < tb[4]/(tb[4] + tb[2])
f1 < 2 * ((precision * recall) / (precision + recall))
cat(model_name, ": \n")
cat("\tAccuracy = ", acc*100, "%.")
cat("\n\tPrecision = ", precision*100, "%.")
cat("\n\tRecall = ", recall*100, "%.")
cat("\n\tF1 Score = ", f1*100, "%.\n\n")
}</code></pre>
<p>Here are the metrics for our neuralnet and logistic regression.</p>
<pre><code>## Neural Network :
## Accuracy = 78.75 %.
## Precision = 80.56 %.
## Recall = 74.36 %.
## F1 Score = 77.33 %.</code></pre>
<pre><code>## Logistic Regression :
## Accuracy = 40 %.
## Precision = 50 %.
## Recall = 37.5 %.
## F1 Score = 42.86 %.</code></pre>
<p>As we can see, the logistic regression performed horribly because it cannot learn nonlinear boundaries. Neuralnets on the other hand, are able to learn nonlinear boundaries and as a result, have fit our complex data very well.</p>
</div>
</div>
<div id="conclusion" class="section level2">
<h2>Conclusion</h2>
<p>In this twopart series, we’ve built a neural net from scratch with a vectorized implementation of backpropagation. We went through the entire life cycle of training a model; right from data preprocessing to model evaluation. Along the way, we learned about the mathematics that makes a neuralnetwork. We went over basic concepts of linear algebra and calculus and implemented them as functions. We saw how to initialize weights, perform forward propagation, gradient descent, and backpropagation.</p>
<p>We learned about the ability of a neural net to fit to nonlinear data and understood the importance of the role activation functions play in it. We trained a neural net and compared it’s performance to a logisticregression model. We visualized the decision boundaries of both these models and saw how a neuralnet was able to fit better than logistic regression. We learned about metrics like Precision, Recall, F1Score, and Accuracy by evaluating our models against them.</p>
<p>You should now have a pretty solid understanding of how neuralnetworks are built.</p>
<p>I hope you had as much fun reading as I had while writing this! If I’ve made a mistake somewhere, I’d love to hear about it so I can correct it. Suggestions and constructive criticism are welcome. :)</p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p>Here is a short list of two intermediate level and two beginner level references for the mathematics underlying neural networks.</p>
<p><strong>Intermediate</strong></p>
<ul>
<li><em>The Matrix Calculus You Need for Deep Learning</em>  <a href="https://arxiv.org/abs/1802.01528">Parr and Howard (2018)</a></li>
<li><em>Deep Learning: An Introduction for Applied Mathematicians</em>  <a href="https://arxiv.org/abs/1801.05894">Higham and Higham (2018)</a></li>
</ul>
<p><strong>Beginner</strong></p>
<ul>
<li><a href="https://www.coursera.org/learn/neuralnetworksdeeplearning?specialization=deeplearning">Deep Learning</a> course by Andrew NG on Coursera. It can be audited for free.<br />
</li>
<li>Grant Sanderson’s YouTube channel. Here are the 4 relevant playlists. <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDNPOjrT6KVlfJuKtYTftqH6">diff eq</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab">linear algebra</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9Krj53DwVRMYO3t5Yr">calculus</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB3pi">neural nets</a>.</li>
</ul>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/07/24/buildinganeuralnetfromscratchusingrpart2/';</script>