R Views
https://rviews.rstudio.com/
Recent content on R ViewsHugo -- gohugo.ioen-usMon, 03 Aug 2020 00:00:00 +0000Party with R: How the Community Enabled Us to Write a Book
https://rviews.rstudio.com/2020/08/03/party-with-r-how-the-community-enabled-us-to-write-a-book/
Mon, 03 Aug 2020 00:00:00 +0000https://rviews.rstudio.com/2020/08/03/party-with-r-how-the-community-enabled-us-to-write-a-book/
<p><em>Isabella C. Velásquez, MS, is a data analyst committed to nonprofit work with the aim of reducing racial and socioeconomic inequities.</em></p>
<p>“<em>It is like a party all the time; nobody has to worry about giving one or being invited; it is going on every day in the street and you can go down or be part of it from your window.</em>” - Eleanor Clark. Without the R community, <a href="https://www.datascienceineducation.com/"><em>Data Science in Education Using R</em></a> would never have happened. Most evidently, we wouldn’t have met each other without the strong R presence on Twitter that sparked a conversation about data use in education. More importantly though, are the aspects of the R community that inspired that initial discussion and enabled us to complete a book for a broad and complex field.</p>
<p>In the <a href="https://datascienceineducation.com/c01.html">first chapter of our book</a>, we invite data practitioners in education to the party but the R community invited us to the party first. Like any successful party, certain elements had to exist for us to join in and end up having an amazing time.</p>
<p><strong>The Invitation</strong> 📩</p>
<p>As someone who started following the R community on Twitter after it was already well established and popular, I never felt the apprehension about having to ask how to get involved or interact with others. First, there are so many avenues. The R community offers many ways to let you in, whether it be replying to tweets, posting a question on community.rstudio.com, or sharing a blog post on R Weekly. Second, the R community welcomes users no matter their level. Whether it is a code snippet with a function you found cool, a blog post, or a personal side project, there are ways to engage that appeal to everybody. Members of the R community can interact how they want and as often as they want.</p>
<p>For us, it was exciting to meet on Twitter, talk about collaborating on an education data project, and then just get started on it. We felt welcome and encouraged to do so. Because people in the R community meet other users virtually and begin side projects all the time, we didn’t have to worry about whether something like this was possible: The invitation was already there.</p>
<p><strong>An Open Door</strong> 🚪</p>
<p>In our first blog post, we described what it’s like to learn by seeing someone do the thing you want to learn. One of the best things about the R community is you constantly get to see this in action. The R community not only holds open principles but actually exhibits them whenever possible. Users post their code, projects, and drafts constantly. Just by scanning the #rstats hashtag, one can discover something new.</p>
<p>Because others were open, we knew that we wanted to be open as well. We wrote our book on <a href="https://github.com/data-edu/data-science-in-education">Github</a>, showing all the many changes it went through until its final completion. By having it freely available on a <a href="https://www.datascienceineducation.com/">website</a>, we hope that it opens the door to others who’d like to learn what we did and work on their own open project as well.</p>
<p><strong>Food (for Thought)</strong> 🍕</p>
<p>The R community offers so many opportunities to get feedback, advice, and information from a wide variety of users. Early on in the book development, we had a lot of questions to nail down: who is the audience? “Data is” or “data are”? How do we describe “people who work in the education field and use data and want to get more effective at it”? Finding a common language was difficult, but we were able to do this by engaging others in the wider R community. We listened to the stories of many data scientists who work in education, then found common experiences we could describe in our writing.</p>
<p>As an example, we learned we weren’t the only ones challenged by learning a programming language while attending to full time jobs and personal lives. Knowing this, we made sure to <a href="https://datascienceineducation.com/c02.html#different-strokes-for-different-data-scientists-in-education">discuss these challenges in our book</a> and offer various ways of engaging with the material based on the reader’s needs.</p>
<p><strong>Socializing</strong> 💬</p>
<p>Throughout the process of writing DSIEUR, we asked the R community several times for other types of feedback and suggestions. We also ran into some technical issues, especially when it came to preparing our manuscript with {bookdown} to meet our publisher’s specifications, and were able to get ideas on how to resolve them. We know that the R community is a safe and encouraging place to ask questions, and this enabled us to write a stronger book.</p>
<p><strong>The Next Day</strong> ☀️</p>
<p>The community participation throughout the DSIEUR process helped us define our goals and get feedback. Another wonderful aspect is that the R community engaged us back. Writing this R Views series is an example: Someone reached out to us and wanted us to reflect on what we discovered and share it with all of you. This type of engagement reminds us of what an inclusive and encouraging place the R community is and helps us come up with new ways of making sure others see the invitation as well.</p>
<p>Thank you for reading! 🎉 We’ll be back with the fourth post on “One Writer, Five Authors” in about two weeks. Until then, we’d love to know how else the R paRty has encouraged your work, both personally and professionally. You can reach us on Twitter: Emily <a href="https://twitter.com/ebovee09">@ebovee09</a>, Jesse <a href="https://twitter.com/kierisi">@kierisi</a>, Joshua <a href="https://twitter.com/jrosenberg6432">@jrosenberg6432</a>, Ryan <a href="https://twitter.com/RyanEs">@RyanEs</a>, and me <a href="https://twitter.com/ivelasq3">@ivelasq3</a>.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/08/03/party-with-r-how-the-community-enabled-us-to-write-a-book/';</script>
R Package Integration with Modern Reusable C++ Code Using Rcpp - Part 3
https://rviews.rstudio.com/2020/07/31/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-3/
Fri, 31 Jul 2020 00:00:00 +0000https://rviews.rstudio.com/2020/07/31/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-3/
<p><em>Daniel Hanson is a full-time lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington.</em></p>
<p>In the <a href="https://rviews.rstudio.com/2020/07/14/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-2/">previous post</a> in this series, we looked at some design considerations when integrating standard and reusable C++ code into an R package. Specific emphasis was on Rcpp’s role in facilitating a means of communication between R and the C++ code, particularly highlighting a few of the C++ functions in the <code>Rcpp</code> namespace that conveniently and efficiently pass data between an R <code>numeric</code> vector and a C++ <code>std::vector<double></code> object.</p>
<p>Today, we will look at a specific example of implementing the interface. We will see how to configure code that allows the use of standard reusable C++ in an R package, without having to modify it with any R or Rcpp-specific syntax. It is admittedly a simple and toy example, but the goal is to provide a starting point that can be easily extended for more real world examples.</p>
<div id="the-code" class="section level2">
<h2>The Code</h2>
<p>To get started, let’s have a look at the code we will use for our demonstration. It is broken into three categories, consistent with the design considerations from the previous post:</p>
<ul>
<li>Standard and reusable C++: No dependence on R or Rcpp</li>
<li>Interface level C++: Uses functions in the <code>Rcpp</code> namespace</li>
<li>R functions exported by the interface level: Same names as in the interface level</li>
</ul>
<div id="standard-and-reusable-c-code" class="section level3">
<h3>Standard and Reusable C++ Code</h3>
<p>In this example, we will use a small set of C++ non-member functions, and two classes. There is a declaration (header) file for the non-member functions, say <code>NonmemberCppFcns.h</code>, and another with class declarations for two shape classes, <code>Square</code> and <code>Circle</code>, called <code>ConcreteShapes.h</code>. Each of these is accompanied by a corresponding implementation file, with file extension <code>.cpp</code>, as one might expect in a more realistic C++ code base.</p>
<div id="nonmember-c-functions" class="section level4">
<h4>Nonmember C++ Functions</h4>
<p>These functions are shown and described here, as declared in the following C++ header file:</p>
<pre><code>#include <vector>
// Adds two real numbers
double add(double x, double y);
// Sorts a vector of real numbers and returns it
std::vector<double> sortVec(std::vector<double> v);
// Computes the product of the LCM and GCD of two integers,
// using C++17 functions std::lcm(.) and std::gcd(.)
int prodLcmGcd(int m, int n);</code></pre>
<p>The last function uses recently added features in the C++ Standard Library, to show that we can use C++17.</p>
</div>
<div id="c-classes" class="section level4">
<h4>C++ Classes</h4>
<p>The two classes in our reusable code base are declared in the <code>ConcreteShapes.h</code> file, as shown and described here. Much like textbook C++ examples, we’ll write classes for two geometric shapes, each with a member function to compute the area of its corresponding object.</p>
<pre><code>#include <cmath>
class Circle
{
public:
Circle(double radius);
// Computes the area of a circle with given radius
double area() const;
private:
double radius_;
};
class Square
{
public:
Square(double side);
// Computes the area of a square with given side length
double area() const;
private:
double side_;
};</code></pre>
</div>
</div>
<div id="interface-level-c" class="section level3">
<h3>Interface Level C++</h3>
<p>Now, the next step is to employ Rcpp, namely for the following essential tasks:</p>
<ul>
<li>Export the interface functions to R</li>
<li>Facilitate data exchange between R and C++ container objects</li>
</ul>
<p>An interface file containing functions designated for export to R does not require a header file with declarations; one can think of it as being analogous to a <code>.cpp</code> file that contains the <code>main()</code> function in a C++ executable project. In addition, the interface can be contained in one file, or split into multiple files. For demonstration, I have written two such files: <code>CppInterface.cpp</code>, which provides the interface to the non-member functions above, and <code>CppInterface2.cpp</code>, which does the same for the two C++ classes.</p>
<div id="interface-to-non-member-c-functions" class="section level4">
<h4>Interface to Non-Member C++ Functions:</h4>
<p>Let’s first have a look the <code>CppInterface.cpp</code> interface file, which connects R with the nonmember functions in our C++ code base:</p>
<pre><code>#include "NonmemberCppFcns.h"
#include <vector>
#include <Rcpp.h>
// Nonmember Function Interfaces:
// [[Rcpp::export]]
int rAdd(double x, double y)
{
// Call the add(.) function in the reusable C++ code base:
return add(x, y);
}
// [[Rcpp::export]]
Rcpp::NumericVector rSortVec(Rcpp::NumericVector v)
{
// Transfer data from NumericVector to std::vector<double>
auto stlVec = Rcpp::as<std::vector<double>>(v);
// Call the reusable sortVec(.) function, with the expected
// std::vector<double> argument:
stlVec = sortVec(stlVec);
// Reassign the results from the vector<double> return object
// to the same NumericVector v, using Rcpp::wrap(.):
v = Rcpp::wrap(stlVec);
// Return as an Rcpp::NumericVector:
return v;
}
// C++17 example:
// [[Rcpp::export]]
int rProdLcmGcd(int m, int n)
{
return prodLcmGcd(m, n);
}</code></pre>
<div id="included-declarations" class="section level5">
<h5>Included Declarations:</h5>
<p>The <code>NonmemberCppFcns.h</code> declaration file is included at the top with <code>#include</code>, just as it would in a standalone C++ application, so that the interface will recognize these functions that reside in the reusable code base. The STL <code>vector</code> declaration is required, as we shall soon see. And, the key in making the interface work resides in the <code>Rcpp.h</code> file, which provides access to very useful C++ functions in the <code>Rcpp</code> namespace.</p>
</div>
<div id="function-implementations" class="section level5">
<h5>Function implementations:</h5>
<p>Each of these functions is designated for export to R when the package is built, by placing the
<code>//</code> <code>[[Rcpp::export]]</code> tag just above the each function signature, as shown above. In this particular example, each interface function simply calls a function in the reusable code base. For example, the <code>rAdd(.)</code> function simply calls the <code>add(.)</code> function in the reusable C++ code. In the absence of a user-defined namespace, the interface function name must be different from the function it calls to prevent name clash errors during the build, so I have simply chosen to prefix an <code>r</code> to the name of each interface function.</p>
<p>Note that the <code>rSort(.)</code> function takes in an <code>Rcpp::NumericVector</code> object, <code>v</code>. This type will accept data passed in from R as a <code>numeric</code> vector and present it as a C++ object. Then, so that we can call a function in our code base, such as <code>sort(.)</code>, which expects a <code>std::vector<double></code> type as its input, <code>Rcpp</code> also provides the <code>Rcpp::as<.>(.)</code> function that facilitates the transfer of data from an <code>Rcpp::NumericVector</code> object to the STL container:</p>
<p><code>auto stlVec = Rcpp::as<std::vector<double>>(v);</code></p>
<p><code>Rcpp</code> also gives us a function that will transfer data from a <code>std::vector<double></code> type being returned from our reusable C++ code back into an <code>Rcpp::NumericVector</code>, so that the results can be passed back to R as a familiar <code>numeric</code> vector type:</p>
<p><code>v = Rcpp::wrap(stlVec);</code></p>
<p>As the <code>std::vector<double></code> object is the workhorse C++ STL containers in quantitative work, these two <code>Rcpp</code> functions are a godsend.</p>
<p><strong>Remark 1:</strong> There is no rule that says an interface function can only call a single function in the reusable code; one can use whichever functions or classes that are needed to get the job done, just like with any other C++ function. I’ve merely kept it simple here for demonstration purposes.</p>
<p><strong>Remark 2:</strong> The tag <code>// [[Rcpp::plugins(cpp17)]]</code> is sometimes placed at the top of a C++ source file in online examples related to <code>Rcpp</code> and C++17. I have not found this necessary in my own code, however, as long as the <code>Makeconf</code> file has been updated for C++17, as described in the <a href="https://rviews.rstudio.com/2020/07/08/r-package-integration-with-modern-reusable-c-code-using-rcpp/">first post</a> in this series.</p>
</div>
</div>
<div id="interface-to-c-classes" class="section level4">
<h4>Interface to C++ Classes:</h4>
<p>We now turn our attention to the second interface file, <code>CppInterface2.cpp</code>, which connects R with the C++ classes in our reusable code. It is shown here:</p>
<pre><code>#include "ConcreteShapes.h"
// Class Member Function Interfaces:
// Interface to Square member
// function area(.):
// [[Rcpp::export]]
double squareArea(double side)
{
Square sq(side);
return sq.area();
}
// Interface to Circle member
// function area(.):
// [[Rcpp::export]]
double circleArea(double radius)
{
Circle circ(radius);
return circ.area();
}</code></pre>
<p>This is again nothing terribly sophisticated, but the good news is it shows the process of creating instances of classes from the code base is not difficult at all. We first <code>#include</code> only the header file containing these class declarations; <code>Rcpp.h</code> is not required here, as we are not using any functions in the the Rcpp namespace.</p>
<p>To compute the area of a square, the <code>side</code> length is input in R as a simple <code>numeric</code> type and passed to the interface function as a C++ <code>double</code>. The <code>Square</code> object, <code>sq</code>, is constructed with the <code>side</code> argument, and its <code>area()</code> member function performs said calculation and returns the result. The process is trivally similar for the <code>circleArea(.)</code> function.</p>
</div>
</div>
<div id="r-functions-exported-by-the-interface-level" class="section level3">
<h3>R Functions Exported by the Interface Level</h3>
<p>To wrap up this discussion, let’s look at the functions an R user will have available after we build the package in RStudio (coming next in this series). Each of these functions will be exported from their respective C++ interface functions as regular R functions, namely:</p>
<ul>
<li><code>rAdd(.)</code></li>
<li><code>rSortVec(.)</code></li>
<li><code>rProdLcmGcd(.)</code></li>
<li><code>squareArea(.)</code></li>
<li><code>circleArea(.)</code></li>
</ul>
<p>The package user will not need to know or care that the core calculations are being performed in C++. Visually, we can represent the associations as shown in the following diagram:</p>
<p><img src="CompositeCodeDiagram.png" alt = "Mapping R Package Functions to Reusable C++" height = "400" width="600"></p>
<p>The solid red line represents a “Chinese Wall” that separates our code base from the interface and allows us to maintain it as standard and reusable C++.</p>
</div>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<p>This concludes our example of configuring code that allows the use of standard reusable C++ in an R package, without having to modify it with any R or Rcpp-specific syntax. In the next post, we will examine how to actually build this code into an R package by leveraging the convenience of Rcpp and RStudio, and deploy it for any number of R users. The source code will also be made available so that you can try it out for yourself.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/07/31/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-3/';</script>
June 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2020/07/27/june-2020-top-40-new-cran-packages/
Mon, 27 Jul 2020 00:00:00 +0000https://rviews.rstudio.com/2020/07/27/june-2020-top-40-new-cran-packages/
<p>Two hundred ninety new packages made it to CRAN in June. Here are my “Top 40” picks in ten categories: Computational Methods, Data, Genomes, Machine Learning, Medicine, Science, Statistics, Time Series, Utilization, and Visualization.</p>
<h3 id="computational-methods">Computational Methods</h3>
<p><a href="https://CRAN.R-project.org/package=Rfractran">Rfractran</a> v1.0 Implements the esoteric, Turing complete <a href="https://en.wikipedia.org/wiki/FRACTRAN">FRACTRAN</a> programming language invented by <a href="https://en.wikipedia.org/wiki/John_Horton_Conway">John Horton Conway</a>.</p>
<p><a href="https://cran.r-project.org/package=QGameTheory">QGameTheory</a> v0.1.2: Provides a general purpose toolbox for simulating quantum versions of game theoretic models See <a href="arXiv:quant-ph/0208069">Flitney and Abbott (2002)</a> for background. Models include the Penny Flip Game <a href="arXiv:quant-ph/98040100">Meyer (1998)</a>, the Prisoner’s Dilemma <a href="arXiv:quant-ph/0506219">Grabbe (2005)</a>, Two Person Duel <a href="arXiv:quant-ph/0305058">Flitney and Abbott (2004)</a>, Battle of the Sexes <a href="arXiv:quant-ph/0110096">Nawaz and Toor (2004)</a>, Hawk and Dove Game <a href="arXiv:quant-ph/0108075">Nawaz and Toor (2010)</a>, Newcomb’s Paradox <a href="arXiv:quant-ph/0202074">Piotrowski and Sladkowski (2002)</a> and the Monty Hall Problem <a href="arXiv:quant-ph/0109035">Flitney and Abbott (2002)</a>. Look <a href="https://github.com/indrag49/QGameTheory">here</a> for and introduction to the package.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=covid19dbcand">covid19dbcand</a> v0.1.0: Provides access seventy-five <a href="http://drugbank.ca/covid-19">Drugbank</a> data sets containing information about possible treatment for COVID-19.</p>
<p><a href="https://CRAN.R-project.org/package=tidytuesdayR">tidytuesdayR</a> v1.0.1: Provides functions for downloading the <a href="https://www.tidytuesday.com/">Tidy Tuesday</a> data sets from R for Data Science Online Learning Community <a href="https://github.com/rfordatascience/tidytuesday">repository</a></p>
<p><a href="https://cran.r-project.org/package=us.census.geoheader">us.census.geoheader</a> v1.0.2: Implements a simple interface to the Geographic Header information from the <a href="https://catalog.data.gov/dataset/census-2000-summary-file-2-sf2">2010 US Census Summary File 2</a>. See the <a href="https://cran.r-project.org/web/packages/us.census.geoheader/vignettes/a-tour.html">vignette</a> for details.</p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://CRAN.R-project.org/package=dnapath">dnapath</a> v0.6.4: Provides functions to integrate pathway information into the differential network analysis of two gene expression datasets as described in <a href="https://www.nature.com/articles/s41598-019-41918-3">Grimes et al. (2019)</a>. There is an <a href="https://cran.r-project.org/web/packages/dnapath/vignettes/introduction_to_dnapath.html">Introduction</a> and a vignette on <a href="https://cran.r-project.org/web/packages/dnapath/vignettes/package_data.html">Datasets</a>.</p>
<p><img src="dnapath.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=TreeDist">TreeDist</a> v1.1.1: Implements measures of tree similarity, including the information-based generalized Robinson-Foulds distances <a href="https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa614/5866976?redirectedFrom=fulltext">Smith, (2020)</a>, the <a href="https://academic.oup.com/bioinformatics/article/22/1/117/217975">Nye et al. (2006)</a> metric and other additional metrics. There are several vignettes: <a href="https://cran.r-project.org/web/packages/TreeDist/vignettes/Generalized-RF.html">Generalized Robinson-Foulds distances</a>, <a href="https://cran.r-project.org/web/packages/TreeDist/vignettes/Robinson-Foulds.html">Extending the Robinson-Foulds metric</a>, <a href="https://cran.r-project.org/web/packages/TreeDist/vignettes/Using-TreeDist.html">Calculate tree similarity</a>, <a href="https://cran.r-project.org/web/packages/TreeDist/vignettes/information.html">Comparing splits using information theory</a>, and <a href="https://cran.r-project.org/web/packages/TreeDist/vignettes/using-distances.html">Contextualizing tree distances</a>.</p>
<p><img src="TreeDist.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=volcano3D">volcano3D</a> v1.0.1: Implements interactive plotting for three-way differential expression analysis which is useful for discovering quantitative changes in expression levels between experimental groups. See <a href="https://www.cell.com/cell-reports/fulltext/S2211-1247(19)31007-1?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2211124719310071%3Fshowall%3Dtrue">Lewis et al. (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/volcano3D/vignettes/Vignette.html">vignette</a>.</p>
<p><img src="volcano3D.png" height = "600" width="400"></p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://CRAN.R-project.org/package=boundingbox">boundingbox</a> v1.0.1: Provides functions to generate bounding boxes for image classification. See <a href="https://www.sciencedirect.com/science/article/pii/S1877050912007260?via%3Dihub">Ibrahim et al. (2012)</a> for background and the <a href="https://cran.r-project.org/web/packages/boundingbox/vignettes/boundingbox-vignette.html">vignette</a> for and introduction to the package.</p>
<p><img src="boundingbox.jpeg" height = "400" width="400"></p>
<p><a href="https://CRAN.R-project.org/package=corels">corels</a> v0.0.2: Implements the Certifiably Optimal RulE ListS (Corels)’ learner described in <a href="arXiv:1704.01701">Angelino et al. (2017)</a> which provides interpretable decision rules with an optimality guarantee. <a href="https://cran.r-project.org/web/packages/corels/readme/README.html">README</a> contains an example.</p>
<p><img src="corels.png" height = "400" width="400"></p>
<p><a href="https://cran.r-project.org/package=nntrf">nntrf</a> v0.1.0: Implements non-linear dimension reduction by means of a neural network with hidden layers which can be useful as data pre-processing for machine learning methods that do not work well with many irrelevant or redundant features. See <a href="https://www.nature.com/articles/323533a0">Rumelhart et al. (1986)</a> for background and the <a href="https://cran.r-project.org/web/packages/nntrf/vignettes/nntrf.html">vignette</a>.</p>
<p><img src="nntrf.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=permimp">permimp</a> v1.0-0: Implements an add-on to the <code>party</code> package, with a faster implementation of the partial-conditional permutation importance for random forests. See <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25">Strobl et al. (2007)</a> for background and the <a href="https://cran.r-project.org/web/packages/permimp/vignettes/permimp-package.html">vignette</a> for an introduction.</p>
<p><img src="permimp.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=pluralize">pluralize</a> v0.2.0: Provides tools based on a <a href="https://github.com/blakeembrey/pluralize">JavaScript library</a> to create plural, singular, and regular forms of English words along with tools to augment the built-in rules to fit specialized needs. See the <a href="https://cran.r-project.org/web/packages/pluralize/vignettes/Why-pluralize.html">vignette</a> for examples.</p>
<p><a href="https://CRAN.R-project.org/package=triplot">triplot</a> v1.3.0: Provides model agnostic tools for exploring effects of correlated features in predictive models and calculating the importance of the groups of explanatory variables. <a href="arXiv:1806.08915">Biecek (2018)</a> for details and look <a href="https://github.com/ModelOriented/triplot">here</a> for an example.</p>
<p><img src="triplot.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=tfaddons">tfaddons</a> v0.10.0: Provides and interface to <a href="https://www.tensorflow.org/addons">TensorFlow Addons</a>. See the <a href="https://cran.r-project.org/web/packages/tfaddons/vignettes/NMT.html">vignette</a> for an example.</p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://CRAN.R-project.org/package=BayesianReasoning">BayesianReasoning</a> v0.3.2: Provides functions to plot and help understand positive and negative predictive values (PPV and NPV), and their relationship with sensitivity, specificity, and prevalence. See <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1651-2227.2006.00180.x">Akobeng (2007)</a> for a theoretical overview and <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01327/full">Navarrete et al. (2015)</a> for a practical explanation. There is an <a href="https://cran.r-project.org/web/packages/BayesianReasoning/vignettes/introduction.html">Introduction</a> and a vignette on <a href="https://cran.r-project.org/web/packages/BayesianReasoning/vignettes/PPV_NPV.html">Screening Tests</a>.</p>
<p><img src="BayesianReasoning.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=riskCommunicator">riskCommunicator</a> v0.1.0: Provides functions to estimate flexible epidemiological effect measures including both differences and ratios using the parametric G-formula. See <a href="https://www.sciencedirect.com/science/article/pii/0270025586900886?via%3Dihub">Robbins (1986)</a> and <a href="https://academic.oup.com/aje/article/169/9/1140/125286">Ahern et al. (2009)</a> for background. There is an <a href="https://cran.r-project.org/web/packages/riskCommunicator/vignettes/Vignette_manuscript.html">Introduction</a> and a <a href="https://cran.r-project.org/web/packages/riskCommunicator/vignettes/Vignette_newbieRusers.html">vignette</a> for newbie R users.</p>
<p><img src="RiskCommunicator.png" height = "400" width="600"></p>
<h3 id="science">Science</h3>
<p><a href="https://CRAN.R-project.org/package=actel">actel</a> v1.0.0: Designed for studies where fish tagged with acoustic tags are expected to move through receiver arrays, this package combines the advantages of automatic sorting and checking of fish movements with the possibility for user intervention on tags that deviate from expected behavior. Calculations are based on <a href="https://www.researchgate.net/publication/256443823_Using_mark-recapture_models_to_estimate_survival_from_telemetry_data">Perry et al. (2012)</a>. There are an astounding seventeen vignettes including: <a href="https://cran.r-project.org/web/packages/actel/vignettes/a-0_workspace_requirements.html">Preparing your data</a>, <a href="https://cran.r-project.org/web/packages/actel/vignettes/a-1_study_area.html">Structruing the Study Area</a>, and <a href="https://cran.r-project.org/web/packages/actel/vignettes/b-0_explore.html">Explore</a>.</p>
<p><img src="actel.SVG" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=safedata">safedata</a> v1.0.5: Provides access to data from the <a href="https://www.safeproject.net/">SAFE Project</a>, a large scale ecological experiment in Malaysian Borneo that explores the impact of habitat fragmentation and conversion on ecosystem function and services. There is an <a href="https://cran.r-project.org/web/packages/safedata/vignettes/overview.html">Overview</a> and an <a href="https://cran.r-project.org/web/packages/safedata/vignettes/using_safe_data.html">Introduction</a>.</p>
<p><img src="safedata.png" height = "400" width="500"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=causact">causact</a> v0.3.2: Built on <code>greta</code> and <code>TensorFlow</code>, this package enables users to define probabilistic models using directed acyclic graphs. See <a href="https://cran.r-project.org/web/packages/causact/readme/README.html">README</a> for examples.</p>
<p><img src="causact.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=frechet">frechet</a> v0.1.0: Provides implementation of statistical methods for random objects lying in various metric spaces, which are not necessarily linear spaces including Fréchet regression for random objects with Euclidean predictors. See <a href="https://projecteuclid.org/euclid.aos/1547197235">Petersen and Müller (2019)</a> for the theory.</p>
<p><a href="https://cran.r-project.org/package=hmclearn">hmclearn</a> v0.0.3: Provide a framework for learning the intricacies of the Hamiltonian Monte Carlo. See <a href="arXiv:1701.02434">Michael (2017)</a> and <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08243">Thomas and Tu (2020)</a> for background. There are vignettes on <a href="https://cran.r-project.org/web/packages/hmclearn/vignettes/linear_mixed_effects_hmclearn.html">Linear mixed effects</a>, <a href="https://cran.r-project.org/web/packages/hmclearn/vignettes/linear_regression_hmclearn.html">linear regression</a>, <a href="https://cran.r-project.org/web/packages/hmclearn/vignettes/logistic_mixed_effects_hmclearn.html">logistic mixed effects</a>, <a href="https://cran.r-project.org/web/packages/hmclearn/vignettes/logistic_regression_hmclearn.html">logistic regression</a>, and <a href="https://cran.r-project.org/web/packages/hmclearn/vignettes/poisson_regression_hmclearn.html">poisson regrression</a>.</p>
<p><img src="hmclearn.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=mashr">mashr</a> v0.2.38: Implements the multivariate adaptive shrinkage (mash) method of <a href="https://www.nature.com/articles/s41588-018-0268-8">Urbut et al. (2019)</a> for estimating and testing large numbers of effects in many conditions (or many outcomes) There is an <a href="https://cran.r-project.org/web/packages/mashr/vignettes/intro_mash.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/mashr/vignettes/eQTL_outline.html">eQTL studies</a>, <a href="https://cran.r-project.org/web/packages/mashr/vignettes/intro_correlations.html">Correlations</a>, <a href="https://cran.r-project.org/web/packages/mashr/vignettes/intro_mash_dd.html">Covariances</a>, <a href="https://cran.r-project.org/web/packages/mashr/vignettes/intro_mashcommonbaseline.html">Common Baseline</a>, <a href="https://cran.r-project.org/web/packages/mashr/vignettes/intro_mashnobaseline.html">No Common Baseline</a>, <a href="https://cran.r-project.org/web/packages/mashr/vignettes/mash_sampling.html">Sampling from Posteriors</a>, and <a href="https://cran.r-project.org/web/packages/mashr/vignettes/simulate_noncanon.html">Simulation</a>.</p>
<p><a href="https://CRAN.R-project.org/package=molic">molic</a> v2.0.1: Implements the method of <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/sjos.12407">Lindskou et al. (2019)</a> to detect outliers in high dimensional, categorical data. There are vignettes on the <a href="https://cran.r-project.org/web/packages/molic/vignettes/outlier_intro.html">Outlier Model</a>, <a href="https://cran.r-project.org/web/packages/molic/vignettes/dermatitis.html">Detecting Skin Diseases</a>, and <a href="https://cran.r-project.org/web/packages/molic/vignettes/genetic_example.html">Genetic Data</a>.</p>
<p><img src="molic.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=multinma">multinma</a> v0.1.3: Uses <code>Stan</code> to fit network meta-analysis and network meta-regression models for aggregate data, individual patient data, and mixtures of both. See <a href="https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssa.12579">Phillippo et al. (2020)</a> for background and the vignettes for examples:
<a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_atrial_fibrillation.html">Stroke prevention</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_bcg_vaccine.html">BCG Vaccine for Tuberculosis</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_blocker.html">Beta Blockers</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_diabetes.html">Diabetes</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_dietary_fat.html">Dietary Fat</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_parkinsons.html">Parkinson’s disease</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_plaque_psoriasis.html">Plaque Psoriasis</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_smoking.html">Smoking Cessation</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_statins.html">Statins</a>, <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_thrombolytics.html">Thrombolytic Treatments</a>, and <a href="https://cran.r-project.org/web/packages/multinma/vignettes/example_transfusion.html">neutropenia or neutrophil dysfunction</a>.</p>
<p><img src="multinma.png" height = "300" width="500"></p>
<p><a href="https://CRAN.R-project.org/package=SCOUTer">SCOUTer</a> v1.0.0: Offers a new approach to simulating outliers by generating new observations defined by the statistics: Squared Prediction Error (SPE) and Hotelling’s $T^{2}$ statistic. See the <a href="https://cran.r-project.org/web/packages/SCOUTer/vignettes/demoscouter.html">vignette</a>.</p>
<p><img src="SCOUTer.png" height = "400" width="600"></p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://cran.r-project.org/package=bootUR">bootUR</a> v0.1.0: Provides functions to perform various bootstrap unit root tests for both individual time series (including augmented Dickey-Fuller test and union tests), multiple time series and panel data. See <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9892.2007.00565.x">Palm et al. (2008)</a> for background, and the <a href="https://cran.r-project.org/web/packages/bootUR/index.html">vignette</a> for an introduction and extensive references.</p>
<p><a href="https://cran.r-project.org/package=ChangePointTaylor">ChangePointTaylor</a> v0.1.0: Implements the change in mean detection method described in <a href="https://variation.com/wp-content/uploads/change-point-analyzer/change-point-analysis-a-powerful-new-tool-for-detecting-changes.pdf">Taylor (2000)</a>. See the <a href="https://cran.r-project.org/web/packages/ChangePointTaylor/vignettes/ChangePointTaylor-vignette.html">vignette</a>.</p>
<p><img src="cpt.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=LOPART">LOPART</a> Implements the change point detection algorithm described in <a href="arXiv:2006.13967">Hocking and Srivastava (2020)</a>.</p>
<p><img src="LOPART.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=modeltime">modeltime</a> v0.0.2: Implements a time series forecasting framework for use with the <code>tidymodels</code> ecosystem, and ARIMA, Exponential Smoothing, and time series models from the <code>forecast</code> and <code>prophet</code> See <a href="https://otexts.com/fpp2/"><em>Forecasting Principles & Practice</em></a>, and <a href="https://research.fb.com/blog/2017/02/prophet-forecasting-at-scale/"><em>Prophet: forecasting at scale</em></a> for background. These is a <a href="https://cran.r-project.org/web/packages/modeltime/vignettes/getting-started-with-modeltime.html">Getting Started Guide</a> and vignettes describing <a href="https://cran.r-project.org/web/packages/modeltime/vignettes/extending-modeltime.html">Extension</a> and the <a href="https://cran.r-project.org/web/packages/modeltime/vignettes/modeltime-model-list.html">Model List</a>.</p>
<p><img src="modeltime.jpeg" height = "400" width="600"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=knitrdata">knitrdata</a> v0.5.0: Implements a data language engine for incorporating data directly in ‘rmarkdown’ documents so that they can be made completely standalone. See the <a href="https://cran.r-project.org/web/packages/knitrdata/vignettes/data_language_engine_vignette.html">vignette</a> for details.</p>
<p><a href="https://cran.r-project.org/package=lazyarray">lazyarray</a> v1.1.0: Implements multi-threaded, serialized compressed arrays that fully utilizes modern solid state drives that allow users to quickly store large data while using limited memory. A lazy-array can be shared across multiple R sessions and multiple R sessions can simultaneously write to a same array. For more information, look <a href="https://github.com/dipterix/lazyarray">here</a>.</p>
<div align="center"><iframe width="684" height="385" src="https://www.youtube.com/embed/xX4YRAXYFxE" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>
<p><a href="https://cran.r-project.org/package=officedown">officedown</a> v0.2.0: Provides functions to produce Microsoft Word documents from R Markdown. There are vignettes on <a href="https://cran.r-project.org/web/packages/officedown/vignettes/captions.html">Captions and References</a>, <a href="https://cran.r-project.org/web/packages/officedown/vignettes/lists.html">Lists</a>, <a href="https://cran.r-project.org/web/packages/officedown/vignettes/officer.html"><code>officer</code> Support</a>, <a href="https://cran.r-project.org/web/packages/officedown/vignettes/tables.html">Data Frame Printing</a>, and <a href="https://cran.r-project.org/web/packages/officedown/vignettes/yaml.html">YAML Headers</a>.</p>
<p><img src="officedown.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=r2dictionary">r2dictionary</a> v0.1: Allows users to directly search for definitions of terms from within the R environment. The source dictionary is an original work of The Online Plain Text English Dictionary (<a href="https://www.mso.anu.edu.au/~ralph/OPTED/">OPTED</a>). See the <a href="https://cran.r-project.org/web/packages/r2dictionary/vignettes/simple_samples.html">vignette</a>.</p>
<p><a href="https://CRAN.R-project.org/package=rmdpartials">rmdpartials</a> v0.5.8: Enables the use of <code>rmarkdown</code> <em>partials</em> (<code>knitr</code> <em>child</em> documents) for making components of HTML, PDF and Word documents. See the <a href="https://cran.r-project.org/web/packages/rmdpartials/vignettes/rmdpartials.html">vignette</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=tidycat">tidycat</a> v0.1.1: Provides functions to create additional rows and columns on <code>broom::tidy()</code> output to allow for easier control on categorical parameter estimates. The <a href="https://cran.r-project.org/web/packages/tidycat/vignettes/intro.html">vignette</a> contains examples</p>
<p><img src="tidycat.png" height = "400" width="600"></p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://CRAN.R-project.org/package=ggdist">ggdist</a> v2.2.0: Provides primitives for visualizing distributions using <code>ggplot2</code> that are tuned for visualizing uncertainty in either a frequentist or Bayesian mode. Primitives include points with multiple uncertainty intervals, eye plots <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-985X.00120">Spiegelhalter (1999)</a>, density plots, gradient plots, dot plots <a href="https://www.tandfonline.com/doi/abs/10.1080/00031305.1999.10474474">Wilkinson (1999)</a>, quantile dot plots <a href="https://dl.acm.org/doi/10.1145/2858036.2858558">Kay et al. (2016)</a>,, complementary cumulative distribution function barplots, <a href="https://dl.acm.org/doi/10.1145/3173574.3173718">Fernandes et al. (2018)</a>, and fit curves with multiple uncertainty ribbons.</p>
<p><a href="https://CRAN.R-project.org/package=loon.ggplot">loon.ggplot</a> v1.0.1: Provides a bridge between the <code>loon</code> and <code>ggplot2</code> packages. Users can turn static <code>ggplot2</code> plots into interactive <code>loon</code> plots and vice versa. There are vignettes on <a href="https://cran.r-project.org/web/packages/loon.ggplot/vignettes/ggplots2loon.html">ggplots -> loon</a>, <a href="https://cran.r-project.org/web/packages/loon.ggplot/vignettes/loon2ggplots.html">loon -> ggplots</a>, and on u sing <a href="Pipes">pipes</a>.</p>
<p><img src="loon.png" height = "300" width="300"></p>
<p><a href="https://cran.r-project.org/package=treeheatr">treeheatr</a> v0.1.0: Provides interpretable decision tree visualizations with the data represented as a heatmap at the tree’s leaf nodes. There is a <a href="https://cran.r-project.org/web/packages/treeheatr/vignettes/explore.html">vignette</a>.</p>
<p><img src="treeheatr.png" height = "300" width="500"></p>
<p><a href="https://CRAN.R-project.org/package=tilemaps">tilemaps</a> v0.2.0: Implements the algorithm of <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13200">McNeill and Hale (2017)</a> for generating tilemaps. See the <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13200">vignette</a>.</p>
<p><img src="tilemaps.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=wrGraph">wrGraph</a> v1.0.2: Provides enhancements to base R graphics for plotting high-throughput data including automatic segmenting of the current device (e.g. window) to accommodate multiple new plots, automatic checking for optimal location of legends in plots, small histograms inserted as legends, the generation of mouse-over interactive html pages and more. See the <a href="https://cran.r-project.org/web/packages/wrGraph/vignettes/wrGraphVignette2.html">vignette</a>.</p>
<p><img src="wrGraph.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/07/27/june-2020-top-40-new-cran-packages/';</script>
Building A Neural Net from Scratch Using R - Part 2
https://rviews.rstudio.com/2020/07/24/building-a-neural-net-from-scratch-using-r-part-2/
Fri, 24 Jul 2020 00:00:00 +0000https://rviews.rstudio.com/2020/07/24/building-a-neural-net-from-scratch-using-r-part-2/
<p><em>Akshaj is a budding deep learning researcher who loves to work with R. He has worked as a Research Associate at the Indian Institute of Science and as a Data Scientist at KPMG India.</em></p>
<p>In the previous post, we went through the dataset, the pre-processing involved, train-test split, and talked in detail about the architecture of the model. We started build our neural net chunk-by-chunk and wrote functions for initializing parameters and running forward propagation.</p>
<p>In the this post, we’ll implement backpropagation by writing functions to calculate gradients and update the weights. Finally, we’ll make predictions on the test data and see how accurate our model is using metrics such as <code>Accuracy</code>, <code>Recall</code>, <code>Precision</code>, and <code>F1-score</code>. We’ll compare our neural net with a logistic regression model and visualize the difference in the decision boundaries produced by these models.</p>
<p>Let’s continue by implementing our cost function.</p>
<div id="compute-cost" class="section level3">
<h3>Compute Cost</h3>
<p>We will use Binary Cross Entropy loss function (aka log loss). Here, <span class="math inline">\(y\)</span> is the true label and <span class="math inline">\(\hat{y}\)</span> is the predicted output.</p>
<p><span class="math display">\[ cost = - 1/N\sum_{i=1}^{N} y_{i}\log(\hat{y}_{i}) + (1 - y_{i})(\log(1 - \hat{y}_{i})) \]</span></p>
<p>The <code>computeCost()</code> function takes as arguments the input matrix <span class="math inline">\(X\)</span>, the true labels <span class="math inline">\(y\)</span> and a <code>cache</code>. <code>cache</code> is the output of the forward pass that we calculated above. To calculate the error, we will only use the final output <code>A2</code> from the <code>cache</code>.</p>
<pre class="r"><code>computeCost <- function(X, y, cache) {
m <- dim(X)[2]
A2 <- cache$A2
logprobs <- (log(A2) * y) + (log(1-A2) * (1-y))
cost <- -sum(logprobs/m)
return (cost)
}</code></pre>
<pre class="r"><code>cost <- computeCost(X_train, y_train, fwd_prop)
cost</code></pre>
<pre><code>## [1] 0.693</code></pre>
</div>
<div id="backpropagation" class="section level3">
<h3>Backpropagation</h3>
<p>Now comes the best part of this all: backpropagation!</p>
<p>We’ll write a function that will calculate the gradient of the loss function with respect to the parameters. Generally, in a deep network, we have something like the following.</p>
<p><img src="backprop_deep.png" alt = "Figure 3: Backpropagation with cache. Credits: deep learning.ai" height = "400" width="600"></p>
<p>The above figure has two hidden layers. During backpropagation (red boxes), we use the output cached during forward propagation (purple boxes). Our neural net has only one hidden layer. More specifically, we have the following:</p>
<p><img src="linear_backward.png" alt = "Figure 4: Backpropagation for a single layer. Credits: deep learning.ai" height = "200" width="400"></p>
<p>To compute backpropagation, we write a function that takes as arguments an input matrix <code>X</code>, the train labels <code>y</code>, the output activations from the forward pass as <code>cache</code>, and a list of <code>layer_sizes</code>. The three outputs <span class="math inline">\((dW^{[l]}, db^{[l]}, dA^{[l-1]})\)</span> are computed using the input <span class="math inline">\(dZ^{[l]}\)</span> where <span class="math inline">\(l\)</span> is the layer number.</p>
<p>We first differentiate the loss function with respect to the weight <span class="math inline">\(W\)</span> of the current layer.</p>
<p><span class="math display">\[ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}\]</span></p>
<p>Then we differentiate the loss function with respect to the bias <span class="math inline">\(b\)</span> of the current layer.
<span class="math display">\[ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}\]</span></p>
<p>Once we have these, we calculate the derivative of the previous layer with respect to <span class="math inline">\(A\)</span>, the output + activation from the previous layer.</p>
<p><span class="math display">\[ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}\]</span></p>
<p>Because we only have a single hidden layer, we first calculate the gradients for the final (output) layer and then the middle (hidden) layer. In other words, the gradients for the weights that lie between the output and hidden layer are calculated first. Using this (and chain rule), gradients for the weights that lie between the hidden and input layer are calculated next.</p>
<p>Finally, we return a list of gradient matrices. These gradients tell us the the small value by which we should increase/decrease our weights such that the loss decreases. Here are the equations for the gradients. I’ve calculated them for you so you don’t differentiate anything. We’ll directly use these values -</p>
<ul>
<li><span class="math inline">\(dZ^{[2]} = A^{[2]} - Y\)</span></li>
<li><span class="math inline">\(dW^{[2]} = \frac{1}{m} dZ^{[2]}A^{[1]^T}\)</span></li>
<li><span class="math inline">\(db^{[2]} = \frac{1}{m}\sum dZ^{[2]}\)</span></li>
<li><span class="math inline">\(dZ^{[1]} = W^{[2]^T} * g^{[1]'} Z^{[1]}\)</span> where <span class="math inline">\(g\)</span> is the activation function.</li>
<li><span class="math inline">\(dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}\)</span></li>
<li><span class="math inline">\(db^{[1]} = \frac{1}{m}\sum dZ^{[1]}\)</span></li>
</ul>
<p>If you would like to know more about the math involved in constructing these equations, please see the references below.</p>
<pre class="r"><code>backwardPropagation <- function(X, y, cache, params, list_layer_size){
m <- dim(X)[2]
n_x <- list_layer_size$n_x
n_h <- list_layer_size$n_h
n_y <- list_layer_size$n_y
A2 <- cache$A2
A1 <- cache$A1
W2 <- params$W2
dZ2 <- A2 - y
dW2 <- 1/m * (dZ2 %*% t(A1))
db2 <- matrix(1/m * sum(dZ2), nrow = n_y)
db2_new <- matrix(rep(db2, m), nrow = n_y)
dZ1 <- (t(W2) %*% dZ2) * (1 - A1^2)
dW1 <- 1/m * (dZ1 %*% t(X))
db1 <- matrix(1/m * sum(dZ1), nrow = n_h)
db1_new <- matrix(rep(db1, m), nrow = n_h)
grads <- list("dW1" = dW1,
"db1" = db1,
"dW2" = dW2,
"db2" = db2)
return(grads)
}</code></pre>
<p>As you can see below, the shapes of the gradients are the same as their corresponding weights i.e. <code>W1</code> has the same shape as <code>dW1</code> and so on. This is important because we are going to use these gradients to update our actual weights.</p>
<pre class="r"><code>back_prop <- backwardPropagation(X_train, y_train, fwd_prop, init_params, layer_size)
lapply(back_prop, function(x) dim(x))</code></pre>
<pre><code>## $dW1
## [1] 4 2
##
## $db1
## [1] 4 1
##
## $dW2
## [1] 1 4
##
## $db2
## [1] 1 1</code></pre>
</div>
<div id="update-parameters" class="section level3">
<h3>Update Parameters</h3>
<p>From the gradients calculated by the <code>backwardPropagation()</code>, we update our weights using the <code>updateParameters()</code> function. The <code>updateParameters()</code> function takes as arguments the gradients, network parameters, and a learning rate.</p>
<p>Why a learning rate? Because sometimes the weight updates (gradients) are too large and because of that we miss the minima completely. Learning rate is a hyper-parameter that is set by us, the user, to control the impact of weight updates. The value of learning rate lies between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span>. This learning rate is multiplied with the gradients before being subtracted from the weights.The weights are updated as follows where the learning rate is defined by <span class="math inline">\(\alpha\)</span>.</p>
<ul>
<li><span class="math inline">\(W^{[2]} = W^{[2]} - \alpha * dW^{[2]}\)</span></li>
<li><span class="math inline">\(b^{[2]} = b^{[2]} - \alpha * db^{[2]}\)</span></li>
<li><span class="math inline">\(W^{[1]} = W^{[1]} - \alpha * dW^{[1]}\)</span></li>
<li><span class="math inline">\(b^{[1]} = b^{[1]} - \alpha * db^{[1]}\)</span></li>
</ul>
<p>Updated parameters are returned by <code>updateParameters()</code> function. We take the gradients, weight parameters, and a learning rate as the input. <code>grads</code> and <code>params</code> are calculated above while we choose the <code>learning_rate</code>.</p>
<pre class="r"><code>updateParameters <- function(grads, params, learning_rate){
W1 <- params$W1
b1 <- params$b1
W2 <- params$W2
b2 <- params$b2
dW1 <- grads$dW1
db1 <- grads$db1
dW2 <- grads$dW2
db2 <- grads$db2
W1 <- W1 - learning_rate * dW1
b1 <- b1 - learning_rate * db1
W2 <- W2 - learning_rate * dW2
b2 <- b2 - learning_rate * db2
updated_params <- list("W1" = W1,
"b1" = b1,
"W2" = W2,
"b2" = b2)
return (updated_params)
}</code></pre>
<p>As we can see, the weights still maintain their original shape. This means we’ve done things correctly till this point.</p>
<pre class="r"><code>update_params <- updateParameters(back_prop, init_params, learning_rate = 0.01)
lapply(update_params, function(x) dim(x))</code></pre>
<pre><code>## $W1
## [1] 4 2
##
## $b1
## [1] 4 1
##
## $W2
## [1] 1 4
##
## $b2
## [1] 1 1</code></pre>
</div>
<div id="train-the-model" class="section level2">
<h2>Train the Model</h2>
<p>Now that we have all our components, let’s go ahead write a function that will train our model.</p>
<p>We will use all the functions we have written above in the following order.</p>
<ol style="list-style-type: decimal">
<li>Run forward propagation</li>
<li>Calculate loss</li>
<li>Calculate gradients</li>
<li>Update parameters</li>
<li>Repeat</li>
</ol>
<p>This <code>trainModel()</code> function takes as arguments the input matrix <code>X</code>, the true labels <code>y</code>, and the number of epochs.</p>
<ol style="list-style-type: decimal">
<li>Get the sizes for layers and initialize random parameters.</li>
<li>Initialize a vector called <code>cost_history</code> which we’ll use to store the calculated loss value per epoch.</li>
<li>Run a for-loop:
<ul>
<li>Run forward prop.</li>
<li>Calculate loss.</li>
<li>Update parameters.</li>
<li>Replace the current parameters with updated parameters.</li>
</ul></li>
</ol>
<p>This function returns the updated parameters which we’ll use to run our model inference. It also returns the <code>cost_history</code> vector.</p>
<pre class="r"><code>trainModel <- function(X, y, num_iteration, hidden_neurons, lr){
layer_size <- getLayerSize(X, y, hidden_neurons)
init_params <- initializeParameters(X, layer_size)
cost_history <- c()
for (i in 1:num_iteration) {
fwd_prop <- forwardPropagation(X, init_params, layer_size)
cost <- computeCost(X, y, fwd_prop)
back_prop <- backwardPropagation(X, y, fwd_prop, init_params, layer_size)
update_params <- updateParameters(back_prop, init_params, learning_rate = lr)
init_params <- update_params
cost_history <- c(cost_history, cost)
if (i %% 10000 == 0) cat("Iteration", i, " | Cost: ", cost, "\n")
}
model_out <- list("updated_params" = update_params,
"cost_hist" = cost_history)
return (model_out)
}</code></pre>
<p>Now that we’ve defined our function to train, let’s run it! We’re going to train our model, with 40 hidden neurons, for 60000 epochs with a learning rate of 0.9. We will print out the loss after every 10000 epochs.</p>
<pre class="r"><code>EPOCHS = 60000
HIDDEN_NEURONS = 40
LEARNING_RATE = 0.9
train_model <- trainModel(X_train, y_train, hidden_neurons = HIDDEN_NEURONS, num_iteration = EPOCHS, lr = LEARNING_RATE)</code></pre>
<pre><code>## Iteration 10000 | Cost: 0.3724
## Iteration 20000 | Cost: 0.4081
## Iteration 30000 | Cost: 0.3273
## Iteration 40000 | Cost: 0.4671
## Iteration 50000 | Cost: 0.4479
## Iteration 60000 | Cost: 0.3074</code></pre>
<p><img src="/post/2020-07-21-building-a-neural-net-from-scratch-using-r-part-2/index_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
</div>
<div id="logistic-regression" class="section level2">
<h2>Logistic Regression</h2>
<p>Before we go ahead and test our neural net, let’s quickly train a simple logistic regression model so that we can compare its performance with our neural net. Since, a logistic regression model can learn only linear boundaries, it will not fit the data well. A neural-network on the other hand will.</p>
<p>We’ll use the <code>glm()</code> function in R to build this model.</p>
<pre class="r"><code>lr_model <- glm(y ~ x1 + x2, data = train)
lr_model</code></pre>
<pre><code>##
## Call: glm(formula = y ~ x1 + x2, data = train)
##
## Coefficients:
## (Intercept) x1 x2
## 0.51697 0.00889 -0.05207
##
## Degrees of Freedom: 319 Total (i.e. Null); 317 Residual
## Null Deviance: 80
## Residual Deviance: 76.4 AIC: 458</code></pre>
<p>Let’s now make generate predictions of the logistic regression model on the test set.</p>
<pre class="r"><code>lr_pred <- round(as.vector(predict(lr_model, test[, 1:2])))
lr_pred</code></pre>
<pre><code>## [1] 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1
## [39] 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0
## [77] 1 1 1 0</code></pre>
</div>
<div id="test-the-model" class="section level2">
<h2>Test the Model</h2>
<p>Finally, it’s time to make predictions. To do that -</p>
<ol style="list-style-type: decimal">
<li>First get the layer sizes.</li>
<li>Run forward propagation.</li>
<li>Return the prediction.</li>
</ol>
<p>During inference time, we do not need to perform backpropagation as you can see below. We only perform forward propagation and return the final output from our neural network. (Note that instead of randomly initializing parameters, we’re using the trained parameters here. )</p>
<pre class="r"><code>makePrediction <- function(X, y, hidden_neurons){
layer_size <- getLayerSize(X, y, hidden_neurons)
params <- train_model$updated_params
fwd_prop <- forwardPropagation(X, params, layer_size)
pred <- fwd_prop$A2
return (pred)
}</code></pre>
<p>After obtaining our output probabilities (Sigmoid), we round-off those to obtain output labels.</p>
<pre class="r"><code>y_pred <- makePrediction(X_test, y_test, HIDDEN_NEURONS)
y_pred <- round(y_pred)</code></pre>
<p>Here are the true labels and the predicted labels.</p>
<pre><code>## Neural Net:
## 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1</code></pre>
<pre><code>## Ground Truth:
## 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1</code></pre>
<pre><code>## Logistic Reg:
## 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 0</code></pre>
<div id="decision-boundaries" class="section level3">
<h3>Decision Boundaries</h3>
<p>In the following visualization, we’ve plotted our test-set predictions on top of the decision boundaries.</p>
<div id="neural-net" class="section level4">
<h4>Neural Net</h4>
<p>As we can see, our neural net was able to learn the non-linear decision boundary and has produced accurate results.</p>
<p><img src="/post/2020-07-21-building-a-neural-net-from-scratch-using-r-part-2/index_files/figure-html/unnamed-chunk-17-1.png" width="672" /></p>
</div>
<div id="logistic-regression-1" class="section level4">
<h4>Logistic Regression</h4>
<p>On the other hand, Logistic Regression with it’s linear decision boundary could not fit the data very well.</p>
<p><img src="/post/2020-07-21-building-a-neural-net-from-scratch-using-r-part-2/index_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
</div>
</div>
<div id="confusion-matrix" class="section level3">
<h3>Confusion Matrix</h3>
<p>A confusion matrix is often used to describe the performance of a classifier.
It is defined as:</p>
<p><span class="math display">\[\mathbf{Confusion Matrix} = \left[\begin{array}
{rr}
True Negative & False Positive \\
False Negative & True Positive
\end{array}\right]
\]</span></p>
<p>Let’s go over the basic terms used in a confusion matrix through an example. Consider the case where we were trying to predict if an email was spam or not.</p>
<ul>
<li><strong>True Positive</strong>: Email was predicted to be spam and it actually was spam.</li>
<li><strong>True Negative</strong>: Email was predicted as not-spam and it actually was not-spam.</li>
<li><strong>False Positive</strong>: Email was predicted to be spam but it actually was not-spam.</li>
<li><strong>False Negative</strong>: Email was predicted to be not-spam but it actually was spam.</li>
</ul>
<pre class="r"><code>tb_nn <- table(y_test, y_pred)
tb_lr <- table(y_test, lr_pred)
cat("NN Confusion Matrix: \n")</code></pre>
<pre><code>## NN Confusion Matrix:</code></pre>
<pre class="r"><code>tb_nn</code></pre>
<pre><code>## y_pred
## y_test 0 1
## 0 34 10
## 1 7 29</code></pre>
<pre class="r"><code>cat("\nLR Confusion Matrix: \n")</code></pre>
<pre><code>##
## LR Confusion Matrix:</code></pre>
<pre class="r"><code>tb_lr</code></pre>
<pre><code>## lr_pred
## y_test 0 1
## 0 14 30
## 1 18 18</code></pre>
</div>
<div id="accuracy-metrics" class="section level3">
<h3>Accuracy Metrics</h3>
<p>We’ll calculate the Precision, Recall, F1 Score, Accuracy. These metrics, derived from the confusion matrix, are defined as -</p>
<p><strong>Precision</strong> is defined as the number of true positives over the number of true positives plus the number of false positives.</p>
<p><span class="math display">\[Precision = \frac {True Positive}{True Positive + False Positive} \]</span></p>
<p><strong>Recall</strong> is defined as the number of true positives over the number of true positives plus the number of false negatives.</p>
<p><span class="math display">\[Recall = \frac {True Positive}{True Positive + False Negative} \]</span></p>
<p><strong>F1-score</strong> is the harmonic mean of precision and recall.</p>
<p><span class="math display">\[F1 Score = 2 \times \frac {Precision \times Recall}{Precision + Recall} \]</span></p>
<p><strong>Accuracy</strong> gives us the percentage of the all correct predictions out total predictions made.</p>
<p><span class="math display">\[Accuracy = \frac {True Positive + True Negative} {True Positive + False Positive + True Negative + False Negative} \]</span></p>
<p>To better understand these terms, let’s continue the example of “email-spam” we used above.</p>
<ul>
<li><p>If our model had a precision of 0.6, that would mean when it predicts an email as spam, then it is correct 60% of the time.</p></li>
<li><p>If our model had a recall of 0.8, then it would mean our model correctly classifies 80% of all spam.</p></li>
<li><p>The F-1 score is way we combine the precision and recall together. A perfect F1-score is represented with a value of 1, and worst with 0</p></li>
</ul>
<p>Now that we have an understanding of the accuracy metrics, let’s actually calculate them. We’ll define a function that takes as input the confusion matrix. Then based on the above formulas, we’ll calculate the metrics.</p>
<pre class="r"><code>calculate_stats <- function(tb, model_name) {
acc <- (tb[1] + tb[4])/(tb[1] + tb[2] + tb[3] + tb[4])
recall <- tb[4]/(tb[4] + tb[3])
precision <- tb[4]/(tb[4] + tb[2])
f1 <- 2 * ((precision * recall) / (precision + recall))
cat(model_name, ": \n")
cat("\tAccuracy = ", acc*100, "%.")
cat("\n\tPrecision = ", precision*100, "%.")
cat("\n\tRecall = ", recall*100, "%.")
cat("\n\tF1 Score = ", f1*100, "%.\n\n")
}</code></pre>
<p>Here are the metrics for our neural-net and logistic regression.</p>
<pre><code>## Neural Network :
## Accuracy = 78.75 %.
## Precision = 80.56 %.
## Recall = 74.36 %.
## F1 Score = 77.33 %.</code></pre>
<pre><code>## Logistic Regression :
## Accuracy = 40 %.
## Precision = 50 %.
## Recall = 37.5 %.
## F1 Score = 42.86 %.</code></pre>
<p>As we can see, the logistic regression performed horribly because it cannot learn non-linear boundaries. Neural-nets on the other hand, are able to learn non-linear boundaries and as a result, have fit our complex data very well.</p>
</div>
</div>
<div id="conclusion" class="section level2">
<h2>Conclusion</h2>
<p>In this two-part series, we’ve built a neural net from scratch with a vectorized implementation of backpropagation. We went through the entire life cycle of training a model; right from data pre-processing to model evaluation. Along the way, we learned about the mathematics that makes a neural-network. We went over basic concepts of linear algebra and calculus and implemented them as functions. We saw how to initialize weights, perform forward propagation, gradient descent, and back-propagation.</p>
<p>We learned about the ability of a neural net to fit to non-linear data and understood the importance of the role activation functions play in it. We trained a neural net and compared it’s performance to a logistic-regression model. We visualized the decision boundaries of both these models and saw how a neural-net was able to fit better than logistic regression. We learned about metrics like Precision, Recall, F1-Score, and Accuracy by evaluating our models against them.</p>
<p>You should now have a pretty solid understanding of how neural-networks are built.</p>
<p>I hope you had as much fun reading as I had while writing this! If I’ve made a mistake somewhere, I’d love to hear about it so I can correct it. Suggestions and constructive criticism are welcome. :)</p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p>Here is a short list of two intermediate level and two beginner level references for the mathematics underlying neural networks.</p>
<p><strong>Intermediate</strong></p>
<ul>
<li><em>The Matrix Calculus You Need for Deep Learning</em> - <a href="https://arxiv.org/abs/1802.01528">Parr and Howard (2018)</a></li>
<li><em>Deep Learning: An Introduction for Applied Mathematicians</em> - <a href="https://arxiv.org/abs/1801.05894">Higham and Higham (2018)</a></li>
</ul>
<p><strong>Beginner</strong></p>
<ul>
<li><a href="https://www.coursera.org/learn/neural-networks-deep-learning?specialization=deep-learning">Deep Learning</a> course by Andrew NG on Coursera. It can be audited for free.<br />
</li>
<li>Grant Sanderson’s YouTube channel. Here are the 4 relevant playlists. <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDNPOjrT6KVlfJuKtYTftqH6">diff eq</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab">linear algebra</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr">calculus</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi">neural nets</a>.</li>
</ul>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/07/24/building-a-neural-net-from-scratch-using-r-part-2/';</script>
Building A Neural Net from Scratch Using R - Part 1
https://rviews.rstudio.com/2020/07/20/shallow-neural-net-from-scratch-using-r-part-1/
Mon, 20 Jul 2020 00:00:00 +0000https://rviews.rstudio.com/2020/07/20/shallow-neural-net-from-scratch-using-r-part-1/
<p><em>Akshaj is a budding deep learning researcher who loves to work with R. He has worked as a Research Associate at the Indian Institute of Science and as a Data Scientist at KPMG India.</em></p>
<p>A lot of deep learning frameworks often abstract away the mechanics behind training a neural network. While this has the advantage of quickly building deep learning models, it has the disadvantage of hiding the details. It is equally important to slow down and understand how neural nets work. In this two-part series, we’ll dig deep and build our own neural net from scratch. This will help us understand, at a basic level, how those big frameworks work. The network we’ll build will contain a single hidden layer and perform binary classification using a vectorized implementation of backpropagation, all written in base-R. We will describe in detail what a single-layer neural network is, how it works, and the equations used to describe it. We will see what kind of data preparation is required to be able to use it with a neural network. Then, we will implement a neural-net step-by-step from scratch and examine the output at each step. Finally, to see how our neural-net fares, we will describe a few metrics used for classification problems and use them.</p>
<p>In this first part, we’ll present the dataset we are going to use, the pre-processing involved, the train-test split, and describe in detail the architecture of the model. Then we’ll build our neural net chunk-by-chunk. It will involve writing functions for initializing parameters and running forward propagation.</p>
<p>In the second part, we’ll implement backpropagation by writing functions to calculate gradients and update the weights. Finally, we’ll make predictions on the test data and see how accurate our model is using metrics such as <code>Accuracy</code>, <code>Recall</code>, <code>Precision</code>, and <code>F1-score</code>. We’ll compare our neural net with a logistic regression model and visualize the difference in the decision boundaries produced by these models.</p>
<p>By the end of this series, you should have a deeper understanding of the math behind neural-networks and the ability to implement it yourself from scratch!</p>
<div id="set-seed" class="section level3">
<h3>Set Seed</h3>
<p>Before we start, let’s set a seed value to ensure reproducibility of the results.</p>
<pre class="r"><code>set.seed(69)</code></pre>
</div>
<div id="architecture-definition" class="section level3">
<h3>Architecture Definition</h3>
<p>To understand the matrix multiplications better and keep the numbers digestible, we will describe a very simple 3-layer neural net i.e. a neural net with a single hidden layer. The <span class="math inline">\(1^{st}\)</span> layer will take in the inputs and the <span class="math inline">\(3^{rd}\)</span> layer will spit out an output.</p>
<p>The input layer will have two (input) neurons, the hidden layer four (hidden) neurons, and the output layer one (output) neuron.</p>
<p>Our input layer has two neurons because we’ll be passing two features (columns of a dataframe) as the input. A single output neuron because we’re performing binary classification. This means two output classes - 0 and 1. Our output will actually be a probability (a number that lies between 0 and 1). We’ll define a threshold for rounding off this probability to 0 or 1. For instance, this threshold can be 0.5.</p>
<p>In a deep neural net, multiple hidden layers are stacked together (hence the name “deep”). Each hidden layer can contain any number of neurons you want.</p>
<p>In this series, we’re implementing a single-layer neural net which, as the name suggests, contains a single hidden layer.</p>
<ul>
<li><code>n_x</code>: the size of the input layer (set this to 2).</li>
<li><code>n_h</code>: the size of the hidden layer (set this to 4).</li>
<li><code>n_y</code>: the size of the output layer (set this to 1).</li>
</ul>
<p><img src="single_layer_nn.png" alt = "Figure 1: Single layer NNet Architecture. Credits: deep learning.a" height = "400" width="600"></p>
<p>Neural networks flow from left to right, i.e. input to output. In the above example, we have two features (two columns from the input dataframe) that arrive at the input neurons from the first-row of the input dataframe. These two numbers are then multiplied by a set of weights (randomly initialized at first and later optimized).</p>
<p>An activation function is then applied on the result of this multiplication. This new set of numbers becomes the neurons in our hidden layer. These neurons are again multiplied by another set of weights (randomly initialized) with an activation function applied to this result. The final result we obtain is a single number. This is the prediction of our neural-net. It’s a number that lies between 0 and 1.</p>
<p>Once we have a prediction, we then compare it to the true output. To optimize the weights in order to make our predictions more accurate (because right now our input is being multiplied by random weights to give a random prediction), we need to first calculate how far off is our prediction from the actual value. Once we have this <em>loss</em>, we calculate the gradients with respect to each weight.</p>
<p>The gradients tell us the amount by which we need to increase or decrease each weight parameter in order to minimize the loss. All the weights in the network are updated as we repeat the entire process with the second input sample (second row).</p>
<p>After all the input samples have been used to optimize weights, we say that one epoch has passed. We repeat this process for multiple number of epochs till our loss stops decreasing.</p>
<p>At this point, you might be wondering what an activation function is. An activation function adds non-linearity to our network and enables it to learn complex features. If you look closely, a neural network consists of a bunch multiplications and additions. It’s linear and we know that a linear classification model will not be able to learn complex features in high dimensions.</p>
<p>Here are a few popular activation functions -</p>
<p><img src="activation_functions.png" alt = "Figure 2: Sigmoid Activation Function. Credits - analyticsindiamag" height = "400" width="600"></p>
<p>We will use <code>tanh()</code> and <code>sigmoid()</code> activation functions in our neural net. Because <code>tanh()</code> is already available in base-R, we will implement the <code>sigmoid()</code> function ourselves later on.</p>
</div>
<div id="dry-run" class="section level3">
<h3>Dry Run</h3>
<p>For now, let’s see how the numbers flow through the above described neural-net by writing out the equations for a single sample (one input row).</p>
<p>For one input sample <span class="math inline">\(x^{(i)}\)</span> where <span class="math inline">\(i\)</span> is the row-number:</p>
<p>First, we calculate the output <span class="math inline">\(Z\)</span> from the input <span class="math inline">\(x\)</span>. We will tune the parameters <span class="math inline">\(W\)</span> and <span class="math inline">\(b\)</span>. Here, the superscript in square brackets tell us the layer number and the one in parenthesis tell use the neuron number. For instance <span class="math inline">\(z^{[1] (i)}\)</span> is the output from the <span class="math inline">\(i^{{th}}\)</span> neuron of the <span class="math inline">\(1^{{st}}\)</span> layer.</p>
<p><span class="math display">\[z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}\]</span></p>
<p>Then we’ll pass this value through the <code>tanh()</code> activation function to get <span class="math inline">\(a\)</span>.</p>
<p><span class="math display">\[a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}\]</span></p>
<p>After that, we’ll calculate the value for the final output layer using the hidden layer values.</p>
<p><span class="math display">\[z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}\]</span></p>
<p>Finally, we’ll pass this value through the <code>sigmoid()</code> activation function and obtain our output probability.
<span class="math display">\[\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}\]</span></p>
<p>To obtain our prediction class from output probabilities, we round off the values as follows.
<span class="math display">\[y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}\]</span></p>
<p>Once, we have the prediction probabilities, we’ll compute the loss in order to tune our parameters (<span class="math inline">\(w\)</span> and <span class="math inline">\(b\)</span> can be adjusted using gradient-descent).</p>
<p>Given the predictions on all the examples, we will compute the cost <span class="math inline">\(J\)</span> the cross-entropy loss as follows:<br />
<span class="math display">\[J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(\hat{y}^{(i)}\right) + (1-y^{(i)})\log\left(1- \hat{y}^{ (i)}\right) \large \right) \small \tag{6}\]</span></p>
<p>Once we have our loss, we need to calculate the gradients. I’ve calculated them for you so you don’t differentiate anything. We’ll directly use these values -</p>
<ul>
<li><span class="math inline">\(dZ^{[2]} = A^{[2]} - Y\)</span></li>
<li><span class="math inline">\(dW^{[2]} = \frac{1}{m} dZ^{[2]}A^{[1]^T}\)</span></li>
<li><span class="math inline">\(db^{[2]} = \frac{1}{m}\sum dZ^{[2]}\)</span></li>
<li><span class="math inline">\(dZ^{[1]} = W^{[2]^T} * g^{[1]'} Z^{[1]}\)</span> where <span class="math inline">\(g\)</span> is the activation function.</li>
<li><span class="math inline">\(dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}\)</span></li>
<li><span class="math inline">\(db^{[1]} = \frac{1}{m}\sum dZ^{[1]}\)</span></li>
</ul>
<p>Now that we have the gradients, we will update the weights. We’ll multiply these gradients with a number known as the <code>learning rate</code>. The learning rate is represented by <span class="math inline">\(\alpha\)</span>.</p>
<ul>
<li><span class="math inline">\(W^{[2]} = W^{[2]} - \alpha * dW^{[2]}\)</span></li>
<li><span class="math inline">\(b^{[2]} = b^{[2]} - \alpha * db^{[2]}\)</span></li>
<li><span class="math inline">\(W^{[1]} = W^{[1]} - \alpha * dW^{[1]}\)</span></li>
<li><span class="math inline">\(b^{[1]} = b^{[1]} - \alpha * db^{[1]}\)</span></li>
</ul>
<p>This process is repeated multiple times until our model converges i.e. we have learned a good set of weights that fit our data well.</p>
</div>
<div id="load-and-visualize-the-data" class="section level2">
<h2>Load and Visualize the Data</h2>
<p>Since, the goal of the series is to understand how neural-networks work behind the scene, we’ll use a small dataset so that our focus is on building our neural net.</p>
<p>We’ll use a planar dataset that looks like a flower. The output classes cannot be separated accurately using a straight line.</p>
<div id="construct-dataset" class="section level3">
<h3>Construct Dataset</h3>
<pre class="r"><code>df <- read.csv(file = "planar_flower.csv")</code></pre>
<p>Let’s shuffle our dataset so that our model is invariant to the order of samples. This is good for generalization and will help increase performance on unseen (test) data.</p>
<pre class="r"><code>df <- df[sample(nrow(df)), ]
head(df)</code></pre>
<pre><code>## x1 x2 y
## 209 1.53856 3.242555 0
## 347 -0.05617 -0.808464 0
## 386 -3.85811 1.423514 1
## 112 0.82630 0.044276 1
## 104 0.31350 0.004274 1
## 111 2.28420 0.352476 1</code></pre>
</div>
<div id="visualize-data" class="section level3">
<h3>Visualize Data</h3>
<p>We have four hundred samples where two hundred belong each class.</p>
<p>Here’s a scatter plot between our input variables. As you can see, the output classes are <strong>not</strong> easily separable.
<img src="/post/2020-07-11-shallow-neural-net-from-scratch-using-r-part-1/index_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="train-test-split" class="section level3">
<h3>Train-Test Split</h3>
<p>Now that we have our dataset prepared, let’s go ahead and split it into train and test sets. We’ll put 80% of our data into our train set and the remaining 20% into our test set. (To keep the focus on the neural-net, we will not be using a validation set here. ).</p>
<pre class="r"><code>train_test_split_index <- 0.8 * nrow(df)</code></pre>
<div id="train-and-test-dataset" class="section level4">
<h4>Train and Test Dataset</h4>
<p>Because we’ve already shuffled the dataset above, we can go ahead and extract the first 80% rows into train set.</p>
<pre class="r"><code>train <- df[1:train_test_split_index,]
head(train)</code></pre>
<pre><code>## x1 x2 y
## 209 1.53856 3.242555 0
## 347 -0.05617 -0.808464 0
## 386 -3.85811 1.423514 1
## 112 0.82630 0.044276 1
## 104 0.31350 0.004274 1
## 111 2.28420 0.352476 1</code></pre>
<p>Next, we select last 20% rows of the shuffled dataset to be our test set.</p>
<pre class="r"><code>test <- df[(train_test_split_index+1): nrow(df),]
head(test)</code></pre>
<pre><code>## x1 x2 y
## 210 -0.0352 -0.03489 0
## 348 2.7257 -0.54170 0
## 19 -2.2235 0.42137 1
## 362 2.3366 -0.40412 0
## 143 -1.4984 3.55267 0
## 4 -3.2264 -0.81648 0</code></pre>
<p>Here, we visualize the number of samples per class in our train and test data sets to ensure that there isn’t a major class imbalance.
<img src="/post/2020-07-11-shallow-neural-net-from-scratch-using-r-part-1/index_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
</div>
</div>
<div id="preprocess" class="section level2">
<h2>Preprocess</h2>
<p>Neural networks work best when the input values are standardized. So, we’ll scale all the values to to have their <code>mean=0</code> and <code>standard-deviation=1</code>.</p>
<p>Standardizing input values speeds up the training and ensures faster convergence.</p>
<p>To standardize the input values, we’ll use the <code>scale()</code> function in R. Note that we’re standardizing the input values (X) only and not the output values (y).</p>
<pre class="r"><code>X_train <- scale(train[, c(1:2)])
y_train <- train$y
dim(y_train) <- c(length(y_train), 1) # add extra dimension to vector
X_test <- scale(test[, c(1:2)])
y_test <- test$y
dim(y_test) <- c(length(y_test), 1) # add extra dimension to vector</code></pre>
<p>The output below tells us the shape and size of our input data.</p>
<pre><code>## Shape of X_train (row, column):
## 320 2
## Shape of y_train (row, column) :
## 320 1
## Number of training samples:
## 320
## Shape of X_test (row, column):
## 80 2
## Shape of y_test (row, column) :
## 80 1
## Number of testing samples:
## 80</code></pre>
<p>Because neural nets are made up of a bunch matrix multiplications, let’s convert our input and output to matrices from dataframes. While dataframes are a good way to represent data in a tabular form, we choose to convert to a matrix type because matrices are smaller than an equivalent dataframe and often speed up the computations.</p>
<p>We will also change the shape of <code>X</code> and <code>y</code> by taking its transpose. This will make the matrix calculations slightly more intuitive as we’ll see in the second part. There’s really no difference though. Some of you might find this way better, while others might prefer the non-transposed way. I feel this this makes more sense.</p>
<p>We’re going to use the <code>as.matrix()</code> method to construct out matrix. We’ll fill out matrix row-by-row.</p>
<pre class="r"><code>X_train <- as.matrix(X_train, byrow=TRUE)
X_train <- t(X_train)
y_train <- as.matrix(y_train, byrow=TRUE)
y_train <- t(y_train)
X_test <- as.matrix(X_test, byrow=TRUE)
X_test <- t(X_test)
y_test <- as.matrix(y_test, byrow=TRUE)
y_test <- t(y_test)</code></pre>
<p>Here are the shapes of our matrices after taking the transpose.</p>
<pre><code>## Shape of X_train:
## 2 320
## Shape of y_train:
## 1 320
## Shape of X_test:
## 2 80
## Shape of y_test:
## 1 80</code></pre>
</div>
<div id="build-a-neural-net" class="section level2">
<h2>Build a neural-net</h2>
<p>Now that we’re done processing our data, let’s move on to building our neural net. As discussed above, we will broadly follow the steps outlined below.</p>
<ol style="list-style-type: decimal">
<li>Define the neural net architecture.</li>
<li>Initialize the model’s parameters from a random-uniform distribution.</li>
<li>Loop:
<ul>
<li>Implement forward propagation.</li>
<li>Compute loss.</li>
<li>Implement backward propagation to get the gradients.</li>
<li>Update parameters.</li>
</ul></li>
</ol>
<div id="get-layer-sizes" class="section level3">
<h3>Get layer sizes</h3>
<p>A neural network optimizes certain parameters to get to the right output. These parameters are initialized randomly. However, the size of these matrices is dependent upon the number of layers in different layers of neural-net.</p>
<p>To generate matrices with random parameters, we need to first obtain the size (number of neurons) of all the layers in our neural-net. We’ll write a function to do that. Let’s denote <code>n_x</code>, <code>n_h</code>, and <code>n_y</code> as the number of neurons in input layer, hidden layer, and output layer respectively.</p>
<p>We will obtain these shapes from our input and output data matrices created above.</p>
<p><code>dim(X)[1]</code> gives us <span class="math inline">\(2\)</span> because the shape of <code>X</code> is <code>(2, 320)</code>. We do the same for <code>dim(y)[1]</code>.</p>
<pre class="r"><code>getLayerSize <- function(X, y, hidden_neurons, train=TRUE) {
n_x <- dim(X)[1]
n_h <- hidden_neurons
n_y <- dim(y)[1]
size <- list("n_x" = n_x,
"n_h" = n_h,
"n_y" = n_y)
return(size)
}</code></pre>
<p>As we can see below, the number of neurons is decided based on shape of the input and output matrices.</p>
<pre class="r"><code>layer_size <- getLayerSize(X_train, y_train, hidden_neurons = 4)
layer_size</code></pre>
<pre><code>## $n_x
## [1] 2
##
## $n_h
## [1] 4
##
## $n_y
## [1] 1</code></pre>
</div>
<div id="initialise-parameters" class="section level3">
<h3>Initialise parameters</h3>
<p>Before we start training our parameters, we need to initialize them. Let’s initialize the parameters based on random uniform distribution.</p>
<p>The function <code>initializeParameters()</code> takes as argument an input matrix and a list which contains the layer sizes i.e. number of neurons. The function returns the trainable parameters <code>W1, b1, W2, b2</code>.</p>
<p>Our neural-net has 3 layers, which gives us 2 sets of parameter. The first set is <code>W1</code> and <code>b1</code>. The second set is <code>W2</code> and <code>b2</code>. Note that these parameters exist as matrices.</p>
<p>These random weights matrices <code>W1, b1, W2, b2</code> are created based on the layer sizes of the different layers (<code>n_x</code>, <code>n_h</code>, and <code>n_y</code>).</p>
<p>The sizes of these weights matrices are -</p>
<p><code>W1</code> = <code>(n_h, n_x)</code><br />
<code>b1</code> = <code>(n_h, 1)</code><br />
<code>W2</code> = <code>(n_y, n_h)</code><br />
<code>b2</code> = <code>(n_y, 1)</code></p>
<pre class="r"><code>initializeParameters <- function(X, list_layer_size){
m <- dim(data.matrix(X))[2]
n_x <- list_layer_size$n_x
n_h <- list_layer_size$n_h
n_y <- list_layer_size$n_y
W1 <- matrix(runif(n_h * n_x), nrow = n_h, ncol = n_x, byrow = TRUE) * 0.01
b1 <- matrix(rep(0, n_h), nrow = n_h)
W2 <- matrix(runif(n_y * n_h), nrow = n_y, ncol = n_h, byrow = TRUE) * 0.01
b2 <- matrix(rep(0, n_y), nrow = n_y)
params <- list("W1" = W1,
"b1" = b1,
"W2" = W2,
"b2" = b2)
return (params)
}</code></pre>
<p>For our network, the size of our weight matrices are as follows. Remember that, number of input neurons <code>n_x = 2</code>, hidden neurons <code>n_h = 4</code>, and output neuron <code>n_y = 1</code>. <code>layer_size</code> is calculate above.</p>
<pre class="r"><code>init_params <- initializeParameters(X_train, layer_size)
lapply(init_params, function(x) dim(x))</code></pre>
<pre><code>## $W1
## [1] 4 2
##
## $b1
## [1] 4 1
##
## $W2
## [1] 1 4
##
## $b2
## [1] 1 1</code></pre>
</div>
<div id="define-the-activation-functions." class="section level3">
<h3>Define the Activation Functions.</h3>
<p>We implement the <code>sigmoid()</code> activation function for the output layer.</p>
<pre class="r"><code>sigmoid <- function(x){
return(1 / (1 + exp(-x)))
}</code></pre>
<p><span class="math display">\[S(x) = \frac {1} {1 + e^{-x}}\]</span></p>
<p>The <code>tanh()</code> function is already present in R.</p>
<p><span class="math display">\[T(x) = \frac {e^x - e^{-x}} {e^x + e^{-x}}\]</span></p>
<p>Here, we plot both activation functions side-by-side for comparison.
<img src="/post/2020-07-11-shallow-neural-net-from-scratch-using-r-part-1/index_files/figure-html/unnamed-chunk-23-1.png" width="672" /></p>
</div>
<div id="forward-propagation" class="section level3">
<h3>Forward Propagation</h3>
<p>Now, onto defining the forward propagation. The function <code>forwardPropagation()</code> takes as arguments the input matrix <code>X</code>, the parameters list <code>params</code>, and the list of <code>layer_sizes</code>. We extract the layers sizes and weights from the respective functions defined above. To perform matrix multiplication, we use the <code>%*%</code> operator.</p>
<p>Before we perform the matrix multiplications, we need to reshape the parameters <code>b1</code> and <code>b2</code>. Why do we do this? Let’s find out. Note that, the parameter shapes are:</p>
<ul>
<li><code>W1</code>: <code>(4, 2)</code><br />
</li>
<li><code>b1</code>: <code>(4, 1)</code><br />
</li>
<li><code>W2</code>: <code>(1, 4)</code><br />
</li>
<li><code>b2</code> : <code>(1, 1)</code></li>
</ul>
<p>And the layers sizes are:</p>
<ul>
<li><code>n_x</code> = <code>2</code><br />
</li>
<li><code>n_h</code> = <code>4</code><br />
</li>
<li><code>n_y</code> = <code>1</code></li>
</ul>
<p>Finally, shape of input matrix <span class="math inline">\(X\)</span> (input layer):</p>
<ul>
<li><code>X</code>: <code>(2, 320)</code></li>
</ul>
<p>If we talk about the <strong>input => hidden</strong>; the hidden layer obtained by the equation <code>A1 = activation(y1) = W1 %*% X + b1</code>, would be as follows:</p>
<ul>
<li><p>For the matrix multiplication of <code>W1</code> and <code>X</code>, their shapes are correct by default: <code>(4, 2) x (2, 320)</code>. The shape of the output matrix <code>W1 %*% X</code> is <code>(4, 320)</code>.</p></li>
<li><p>Now, <code>b1</code> is of shape <code>(4, 1)</code>. Since, <code>W1 %*% X</code> is of the shape <code>(4, 320)</code>, we need to repeat it <code>b1</code> 320 times, one for each input sample We do that using the command <code>rep(b1, m)</code> where <code>m</code> is calculated as <code>dim(X)[2]</code> which selects the second dimension of the shape of <code>X</code>.</p></li>
<li><p>The shape of <code>A1</code> is <code>(4, 320)</code>.</p></li>
</ul>
<p>In the case of <strong>hidden => output</strong>; the output obtained by the equation <code>y2 = W2 %*% A1 + b2</code>, would be as follows:</p>
<ul>
<li><p>To shapes of <code>W2</code> and <code>A1</code> are correct for us to perform matrix multiplication on them. <code>W2</code> is <code>(1, 4)</code> and <code>A1</code> is <code>(4, 320)</code>. The output <code>W2 %*% A1</code> has the shape <code>(1, 320)</code>. <code>b2</code> has a shape of <code>(1, 1)</code>. We will again repeat <code>b2</code> like we did above. So, <code>b2</code> now becomes <code>(1, 320)</code>.</p></li>
<li><p>The shape of <code>A2</code> is now <code>(1, 320)</code>.</p></li>
</ul>
<p>We use the <code>tanh()</code> activation for the hidden layer and <code>sigmoid()</code> activation for the output layer.</p>
<pre class="r"><code>forwardPropagation <- function(X, params, list_layer_size){
m <- dim(X)[2]
n_h <- list_layer_size$n_h
n_y <- list_layer_size$n_y
W1 <- params$W1
b1 <- params$b1
W2 <- params$W2
b2 <- params$b2
b1_new <- matrix(rep(b1, m), nrow = n_h)
b2_new <- matrix(rep(b2, m), nrow = n_y)
Z1 <- W1 %*% X + b1_new
A1 <- sigmoid(Z1)
Z2 <- W2 %*% A1 + b2_new
A2 <- sigmoid(Z2)
cache <- list("Z1" = Z1,
"A1" = A1,
"Z2" = Z2,
"A2" = A2)
return (cache)
}</code></pre>
<p>Even though we only need the value <code>A2</code> for forward propagation, you’ll notice we return all other calculated values as well. We do this because these values will be needed during backpropagation. Saving them here will reduce the the time it takes for backpropagation because we don’t have to calculate it again.</p>
<p>Another thing to notice is the <code>Z</code> and <code>A</code> of a particular layer will always have the same shape. This is because <code>A = activation(Z)</code> which does not change the shape of <code>Z</code>. An activation function only introduces non-linearity in a network.</p>
<pre class="r"><code>fwd_prop <- forwardPropagation(X_train, init_params, layer_size)
lapply(fwd_prop, function(x) dim(x))</code></pre>
<pre><code>## $Z1
## [1] 4 320
##
## $A1
## [1] 4 320
##
## $Z2
## [1] 1 320
##
## $A2
## [1] 1 320</code></pre>
</div>
</div>
<div id="end-of-part-1" class="section level2">
<h2>End of Part 1</h2>
<p>We have reached the end of Part1 here. In the next and final part, we will implement backpropagation and evaluate our model. Stay tuned!</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/07/20/shallow-neural-net-from-scratch-using-r-part-1/';</script>
R Package Integration with Modern Reusable C++ Code Using Rcpp - Part 2
https://rviews.rstudio.com/2020/07/14/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-2/
Tue, 14 Jul 2020 00:00:00 +0000https://rviews.rstudio.com/2020/07/14/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-2/
<p><em>Daniel Hanson is a full-time lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington. His appointment followed over 25 years of experience in private sector quantitative development in finance and data science.</em></p>
<p>In the <a href="https://rviews.rstudio.com/2020/07/08/r-package-integration-with-modern-reusable-c-code-using-rcpp">first post</a> in this series, we looked at configuring a Windows 10 environment for using the <code>Rcpp</code> package. However, what follows below, and going forward, is applicable to an up-to-date R, RStudio, and <code>Rcpp</code> configuration on any operating system.</p>
<p>Today, we will examine design considerations in integrating standard and portable C++ code in an R package, using <code>Rcpp</code> in the interface level alone. This will ensure no R-related dependencies are introduced into the C++ code base. In general, of course, best programming practices say we should strive to keep interface and implementation separate.</p>
<div id="design-considerations" class="section level2">
<h2>Design Considerations</h2>
<p>For this discussion, we will assume the package developer has access to a repository of standard C++ code that is intended for use with other mathematical or scientific applications and interfaces. The goal is integrate this code into an R package, and then export functions to R that will use this existing C++ code. The end users need not be concerned that they are using C++ code; they will only see the exported functions that can be used and called like any other R function.</p>
<p>The package developer, at this stage, has two components that cannot communicate with each other, at least yet:</p>
<p><img src="Fig_1_R_and_Cpp_Only.png" alt = "R and C++ Components" height = "400" width="600"></p>
<div id="establishing-communication" class="section level3">
<h3>Establishing Communication</h3>
<p>This is where <code>Rcpp</code> comes in. We will create an interface layer that utilizes functions and objects in the <code>Rcpp</code> C++ namespace that facilitate communication between R and C++. This interface will ensure that no dependence on R or <code>Rcpp</code> is introduced into our reusable code base.</p>
<p><img src="Fig_2_R_RcppIF_Cpp.png" alt = "The Rcpp interface connects R and C++" height = "400" width="600"></p>
<p>The <code>Rcpp</code> namespace contains a treasure trove of functions and objects that abstract away the terse underlying C interface provided by R, making our job far less painful. However, at this initial stage, to keep the discussion focused on a basic interface example, we will limit our use of Rcpp functions to facilitate the transfer of <code>numeric</code> vector data in R to the workhorse STL container <code>std::vector<double></code>, which is of course ubiquitous in quantitative C++ code.</p>
<div id="tags-to-indicate-interface-functions" class="section level4">
<h4>Tags to Indicate Interface Functions</h4>
<p>C++ interface functions are indicated by a tag that needs to be placed just above the function name and signature. It is written as</p>
<p><strong><code>// [[Rcpp::export]]</code></strong></p>
<p>This tag instructs the package build process to export a function of the exact same name to R. As an interface function, it will take arguments from an R session, route them in a call to a function or class in the C++ code base, and then take the results that are returned and pass them back to the calling function in R.</p>
</div>
<div id="conversion-between-r-vectors-and-c-vectors" class="section level4">
<h4>Conversion between R Vectors and C++ Vectors</h4>
<p>The <code>Rcpp::NumericVector</code> class, as its name suggests, stores data taken from an R numeric vector, but what makes Rcpp even more powerful here is its inclusion of the C++ template function <code>Rcpp::as<T>(.)</code>. This function safely and efficiently copies the contents of an <code>Rcpp::NumericVector</code> to a <code>std::vector<double></code> object, as demonstrated in Figure 3, below.</p>
<p><em>Remark:</em> Rcpp also has the function <code>Rcpp::wrap(.)</code>, which copies values in an STL vector back into an <code>Rcpp::NumericVector</code> object, so that the results can then be to R; this function will be covered in our next article in this series.<br />
</p>
<p><img src="Fig_3_Export_Function.png" alt = "C++ interface function to be exported to R" height = "300" width="500"></p>
</div>
<div id="a-c-interface-function" class="section level4">
<h4>A C++ Interface Function</h4>
<p>Figure 3 shows a mythical <code>Rcpp</code> interface function at the top, called <code>fcn(.)</code>, and a function in our reusable standard C++ code base called <code>doSomething(.)</code>. Note first the tag that appears just above the interface function signature. It must be exactly one line above, and there must be a single space only between the second forward slash and the first left square bracket.</p>
<p>This interface function will be exported to R, where it can be called by the same function name, <code>fcn(.)</code>, taking in an R <code>numeric</code> vector input. The <code>Rcpp::NumericVector</code> object takes this data in as the input to C++. The contents are then transferred to a C++ <code>std::vector<double></code>, using the Rcpp template function <code>Rcpp::as<vector<double>>(.)</code>.</p>
<p>The data can then be passed to the <code>doSomething(.)</code> function in the standard C++ code base, as it is expecting a <code>std::vector<double></code> input. This function returns the C++ <code>double</code> variable <code>ret</code> to the variable <code>y</code> in the interface function. This requires no special conversion and can be passed back to R as a C++ <code>double</code> type.<br />
</p>
</div>
</div>
<div id="putting-the-high-level-design-together" class="section level3">
<h3>Putting the High-Level Design Together</h3>
<p>With the C++ interface in place, this means an R user can call an R function that has been exported from C++. When the results are returned, they can be used in other R functions, but where we get an extraordinarily complementary benefit is with R’s powerful data visualization capabilities. Unlike languages such as Python, Java, or VB.NET, C++ does not have a standard GUI, but we can use cutting-edge R packages such as <code>ggplot2</code>, <code>plotly</code>, <code>shiny</code>, and <code>xts</code> – among many others – to generate a massive variety of plots and visualizations that are simply not available in other general purpose languages.</p>
<p><img src="Fig_4_All_together_now.png" alt = "R and C++ Components" height = "400" width="600"></p>
</div>
</div>
<div id="next-steps" class="section level2">
<h2>Next Steps</h2>
<p>This concludes our discussion of high-level design considerations. Coming next, we will look at examples of writing actual interface functions to simple, but real, standard C++ functions and classes.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/07/14/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-2/';</script>
R Package Integration with Modern Reusable C++ Code Using Rcpp
https://rviews.rstudio.com/2020/07/08/r-package-integration-with-modern-reusable-c-code-using-rcpp/
Wed, 08 Jul 2020 00:00:00 +0000https://rviews.rstudio.com/2020/07/08/r-package-integration-with-modern-reusable-c-code-using-rcpp/
<p><em>Daniel Hanson is a full-time lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington. His appointment followed over 25 years of experience in private sector quantitative development in finance and data science.</em></p>
<p>One of the most time-consuming, tedious, and thankless tasks a quantitative developer frequently confronts is writing interfaces to C++ code from applications such as Excel, Python, and R. Fortunately, this process has been made far less painful when interfacing to a front-end in R, thanks to the <a href="https://cran.r-project.org/web/packages/Rcpp/index.html"><code>Rcpp</code> package</a>.</p>
<p><code>Rcpp</code> was first developed by <a href="https://dirk.eddelbuettel.com/">Dirk Eddelbeuttel</a> and <a href="https://github.com/romainfrancois">Romain Francois</a> about ten years ago. It has since evolved significantly under Dirk’s direction in terms of rapid development, and build and documentation tools, which are now conveniently integrated into in RStudio. <code>Rcpp</code> represents a major breakthrough in allowing a programmer to farm out computationally-intensive tasks to C++, return the results to R, and then use them as arguments to other R functions, including R’s powerful data visualization tools. Although there is quite a bit of documentation on using <code>Rcpp</code>, there does not seem to be much written either in print, or online, regarding C++ best practices.</p>
<p>This post is the first in a series of posts in which I will address the best practice of developing Rcpp implementations that keep both reusable and standard C++ code separate from the R interface. Additionally, I will also cover a subtheme that does not seem to get a lot of press, namely using <code>Rcpp</code> in a Windows environment. Linux of course is where the action is, and for good reason, but the reality remains that when a quant developer takes a new job, on the new hire’s desk will almost surely be a Windows notebook or desktop computer, and s/he will be expected to produce applications that can run on Windows machines used by managers and colleagues. So, we will first look at how to set up an <code>Rcpp</code> development environment on Windows 10, including some of the peculiarities and limitations that are involved, but which will not significantly prevent a developer from being highly productive in writing R packages with integrated reusable and standard C++ code.</p>
<h2 id="setting-up-on-windows-10">Setting up on Windows 10</h2>
<p>As with an <code>Rcpp</code> development environment on Linux or the Mac, in order to take advantage of the latest features in C++, as well as the convenient package build tools in RStudio, a Windows user will need the following:</p>
<ul>
<li>An R installation</li>
<li>A C++ compiler</li>
<li>RStudio<br />
<br /></li>
</ul>
<p>For Windows users, however, the R interface is not compatible with Microsoft’s Visual Studio C++ compiler, due to R itself being built with the GNU gcc compiler. Fortunately, there is a very easy way to make one’s Windows environment compatible. This is where <a href="https://cran.r-project.org/bin/windows/Rtools/"><code>Rtools</code></a> comes in.</p>
<h3 id="rtools">Rtools</h3>
<p><code>Rtools</code> contains a collection of utilities and libraries for building R packages on Windows. More specifically, for integrating C++ code with <code>Rcpp</code>, beginning with version 4.0, Rtools now contains <a href="http://mingw-w64.org/">support for the gcc 8.3.0 compiler on Windows</a>. The <code>Rtools</code> installation provides reliable configuration of the gcc compiler that is quick and painless; however, it does come with a small compromise.</p>
<p>The gcc 8.3.0 compiler was released in February 2019, and it has since been superseded by its most recent stable release (9.3). As such, it is fully up to date with C++14 specifications, but it only supports a partial list of language and Standard Library features in C++17. Two specific items very useful for quant development are available:</p>
<ul>
<li>Special math functions (Bessel functions, Legendre polynomials, etc)</li>
<li>New types <code>std::variant</code>, <code>std::optional</code>, and <code>std::any</code><br />
<br /></li>
</ul>
<p>However, the gcc compiler lacks the big daddy of them all, parallel STL algorithms. A comprehensive list of C++17 features supported in gcc 8.3.0 can be found <a href="https://gcc.gnu.org/onlinedocs/gcc-8.3.0/libstdc++/manual/manual/status.html#status.iso.2017">here</a>.</p>
<p>Still, as <code>Rtools</code> saves Windows users from considerable time and effort in configuring one’s machine for not just the gcc compiler, but also integrating (mostly) modern C++ code in R packages, it is advised that newcomers stick to <code>Rtools</code> and gcc 8.3.0.</p>
<p>It is <strong>very important</strong> to note that:</p>
<ul>
<li>You will need version 4.0.0 of R or above to install <code>Rtools</code> 4.0</li>
<li>Unless there is a compelling reason otherwise, you should install the 64-bit version of Rtools: <code>rtools40-x86_64.exe</code><br />
<br /></li>
</ul>
<h3 id="configuration-in-rstudio">Configuration in RStudio</h3>
<p>After installing R and <code>Rtools</code>, as well as RStudio (v 1.3.959 or later recommended), you can now complete the configuration of your development environment for integrated R/C++ package development.</p>
<h4 id="enable-c-17">Enable C++17</h4>
<p>In order to use the C++17 features available in <code>Rtools</code>, as well as C++14, you need to change the settings in the <code>Makeconf</code> file, located under your R installation (not Rtools), as follows:</p>
<ul>
<li>In Windows File Explorer, go to your R installation and click down to the <code>.../R-4.0.0/etc/x64</code> subdirectory</li>
<li>Locate the <code>Makeconf</code> file, and copy it to a backup file (always a prudent idea), e.g. <code>Makeconf.bak</code></li>
<li>Using a standard text editor (eg <a href="https://notepad-plus-plus.org/">Notepad++</a>), open the <code>Makeconf</code> file</li>
<li>Locate the line <code>CXX = $(BINPREF)g++ -std=gnu++11 $(M_ARCH)</code></li>
<li>Change the <code>11</code> to <code>17</code>; viz, <code>CXX = $(BINPREF)g++ -std=gnu++17 $(M_ARCH)</code></li>
<li>Save the updated file.<br />
<br />
<br /></li>
</ul>
<p><img src="MakeconfMod.png" alt = "Setting for C++17 in the Makeconfig file" height = "400" width="600"></p>
<h4 id="install-the-rcpp-package">Install the Rcpp Package</h4>
<p>Installing Rcpp is no different from any other typical R package. Either use the R command:</p>
<pre><code class="language-Rcpp,">install.packages("Rcpp")
</code></pre>
<p>Or, use the <code>Tools/Install Packages...</code> selection in RStudio.</p>
<h2 id="next-steps">Next Steps</h2>
<p>After completing the above, you will now have a development environment ready to start designing and developing an R package containing standard and reusable C++ code. In the next post, we will first spend a little time examining design considerations, before we begin actual code implementation.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/07/08/r-package-integration-with-modern-reusable-c-code-using-rcpp/';</script>
Open-Source Authorship of Data Science in Education Using R
https://rviews.rstudio.com/2020/07/01/open-source-authorship-of-data-science-in-education-using-r/
Wed, 01 Jul 2020 00:00:00 +0000https://rviews.rstudio.com/2020/07/01/open-source-authorship-of-data-science-in-education-using-r/
<p><em>Joshua M. Rosenberg, Ph.D., is Assistant Professor of STEM Education at the
University of Tennessee, Knoxville.</em></p>
<p><img src="alex-ware.jpg" alt = "Photo by Alex Ware" height = "400" width="100%"></p>
<p>In earlier posts, we shared how we wrote <a href="http://datascienceineducation.com/"><em>Data Science in Education Using
R</em></a> as an open book
(<a href="https://rviews.rstudio.com/2020/05/26/community-and-collaboration-writing-our-book-in-the-open/">Post 1</a>,
<a href="https://rviews.rstudio.com/2020/06/11/learning-r-with-education-datasets/">Post 2</a>).
In this post, we describe what we consider to be the <em>open-source authorship</em>
process we took to write the book.</p>
<p>We think of open-source authorship as a broader—and perhaps better—term for
describing what authors of some open books undertake. In our characterization,
open-source authorship draws upon:</p>
<ul>
<li>parts of open-source software (OSS) values and tools</li>
<li>parts of open science that establish the importance of scholarly work beyond
original, discovery research</li>
<li>the values surrounding the creation of open educational resources (OER)</li>
</ul>
<p>We believe that combining elements from OSS, open science, and OER is notable
because while OSS and open science emphasize the sharing of technical work
(including technology and code) and OER emphasizes the sharing of resources,
technical books have not been as much a focus of the conversation. Moreover, the
way in which the conversation about open books has taken place in different
communities and contexts means that some books that are open do not fully
receive the attention (for being openly available) that they merit from those
interested in OER. This also might mean that those involved with OSS development
and open science may fail to recognize the creation of a book as a substantial
contribution.</p>
<p>In this way, we argue for open-source authorship as an important, new type of
work, one that we increasingly see by the authors of other books, especially in
the R community<sup class="footnote-ref" id="fnref:https-bookdown-o"><a href="#fn:https-bookdown-o">1</a></sup>
<sup class="footnote-ref" id="fnref:https-geocompr-r"><a href="#fn:https-geocompr-r">2</a></sup> <sup class="footnote-ref" id="fnref:http-adv-r-had-c"><a href="#fn:http-adv-r-had-c">3</a></sup>
<sup class="footnote-ref" id="fnref:https-r4ds-had-c"><a href="#fn:https-r4ds-had-c">4</a></sup>.</p>
<p>After describing how we wrote our book in an open way, we elaborate on these
ideas and draw connections to the process we undertook.</p>
<h2 id="how-we-wrote-data-science-in-education-using-r-as-an-open-book">How We Wrote Data Science in Education Using R as an Open Book</h2>
<p>Early in our process, we determined that we wanted to share the book in an open
way. Since we were using GitHub as a <a href="https://github.com/data-edu/data-science-in-education/">repository for the
book</a>, it was easy for
the contents of the book to be available for anyone to view–even before and as
the book was being written. Despite the benefits of using GitHub, GitHub can be
difficult to navigate for those who are unfamiliar with it, and so sharing the
book in a more widely-accessible way was also important. To do this, we used
<a href="https://bookdown.org/">{bookdown}</a> and <a href="https://www.netlify.com/">Netlify</a> to
share the book as a website. Additionally, we chose an easy-to-remember URL
(<a href="http://datascienceineducation.com/">http://datascienceineducation.com/</a>) to help others (and us!) to be able to
access it easily.</p>
<p>Being available for others to contribute was important. Because we used GitHub,
we were able to receive feedback at a very early-stage on <a href="https://github.com/data-edu/data-science-in-education/issues/20">issues such as how we
referred to data (as data or
datum)</a>. Other
<a href="https://github.com/data-edu/data-science-in-education/issues/9">issues (by non-authors) raised questions about whether certain content was in
scope—such as content on
gradebooks</a>,
which we included a chapter on. We found that apart from the five of us as
authors, fifteen individuals made contributions, and another one hundred forty-four individuals starred
the
repository<sup class="footnote-ref" id="fnref:https-joshuamros"><a href="#fn:https-joshuamros">5</a></sup>.
Moreover, we received feedback through Twitter and an email account we created
for the book for those unfamiliar with GitHub (or Twitter) to be able to provide
feedback directly to us. In this way, making the book available to others to
contribute made the book better, and points to the importance of sharing work at
only one stage of the writing process.</p>
<p>Lastly, we shared products that could be seen as tangential to the book, but
which were important given its focus on data science and R. Namely, we created
an R package, <a href="https://data-edu.github.io/dataedu/">{dataedu}</a>, to accompany the
book. This package includes code to install the packages necessary to reproduce
the book as well as all of the data sets used in it. By doing so, we invited
others to contribute to the book in ways not related to its prose. This also led
to (pleasantly) surprising contributions, including the creation of <a href="https://colab.research.google.com/drive/1f7CpetOWP9T2XaJCNrcwWj3CMKsQNmtw">an iPython
Notebook with python code to carry out comparable steps as those carried in a
walkthrough chapter of our
book</a>.</p>
<p>Collectively, these practices—involving not only making the book open, but also
planning for others to contribute and creating other, shared (open) products—
comprise what we think of as the results of open-source authorship.</p>
<h2 id="drawing-inspiration-from-other-related-ideas-and-efforts">Drawing Inspiration from Other, Related Ideas and Efforts</h2>
<p>Originally a niche effort, open-source software (OSS) and OSS development are
(likely not to the surprise of R users!) now widespread
<sup class="footnote-ref" id="fnref:https-books-goog"><a href="#fn:https-books-goog">6</a></sup>.
There are some insights that can be gained from efforts to understand how OSS
development proceeds. For example, in foundational, work Mockus et al. found
that OSS is often characterized by a core group of 10-15 individuals
contributing around 80% of the code, but that a group around one order of magnitude
larger than that core will repair specific problems, and a group another order
of magnitude larger will report
issues<sup class="footnote-ref" id="fnref:https-dl-acm-org"><a href="#fn:https-dl-acm-org">7</a></sup>; proportions
(generally) similar to those we found for those who contributed to our book.</p>
<p>Second, open science is both a perspective about how science should operate and
a set of practices that reflect a perspective about how science should proceed
<sup class="footnote-ref" id="fnref:https-www-nap-ed"><a href="#fn:https-www-nap-ed">8</a></sup>
<sup class="footnote-ref" id="fnref:https-journals-s"><a href="#fn:https-journals-s">9</a></sup>. Related to
open science are open scholarly practices. Others trace the origin of the idea of
open scholarly practices to <a href="https://eric.ed.gov/?id=ED326149">a book by Boyer</a>,
who shared a broad description of intellectual (especially academic) work. This suggests that
scholarly work is not only original, discovery research; it also includes the
applications of advances in one’s own discipline (or “translational research”)
and sharing the results of research with multiple stakeholders. Open science and
open scholarly practices point to the scientific or scholarly contributions of
open books; while different from original, scientific research, books such as
our own—which focused on providing a language for data science in education—may
serve as helpful examples (of open science) or forms of a broader view of
scholarship.</p>
<p>Last, OER are “teaching, learning, and research resources that reside in the
public domain or have been released under an intellectual property license that
permits their free use and re-purposing by others”
<sup class="footnote-ref" id="fnref:https-hewlett-or"><a href="#fn:https-hewlett-or">10</a></sup>. These resources range from
courses and books to tests and technologies. By being open, they are not only
available to others to use, but also to reuse, redistribute (or share), revise
(adapt or change the work), and remix (combining existing resources to create a
new one)
<sup class="footnote-ref" id="fnref:https-www-tandfo"><a href="#fn:https-www-tandfo">11</a></sup>.
OER can serve as an inspiration for authors of open books, especially those who
see their books as being used to teach and learn from. At the moment, OER and
traditional publishing modes are largely separate: For most books that are
published, the publisher retains the copyright, and authors are typically not
allowed to share their book in the open, though this may be changing. Many
authors of books about R have negotiated with their publisher to share their
books in the open (often only as a website, as we have) in addition to sharing
them through print and e-book formats. In addition, a number of platforms for
creating books that are OER are emerging; one example is <a href="https://edtechbooks.org">EdTech
Books</a>. There are increasing conversations related to
making materials, resources, and even education as an enterprise more open; OER
may be an area in which authors of books about R and other technical books can
both learn from the work of authors as well as advance the conversation.</p>
<h2 id="fin"><em>fin</em></h2>
<p>This post was an effort to step back from what we did to write our book to
reflect on what we meant by open-source authorship and to attempt to situate what
we did (and what others have done) in broader conversations about OSS, open
science, and OER. In this open mode, we invite others to revise or remix these
ideas to advance other, new forms of authorship of books.</p>
<p>You can reach us on Twitter: Emily <a href="https://twitter.com/ebovee09">@ebovee09</a>,
Jesse <a href="https://twitter.com/kierisi">@kierisi</a>, Joshua
<a href="https://twitter.com/jrosenberg6432">@jrosenberg6432</a>, Isabella
<a href="https://twitter.com/ivelasq3">@ivelasq3</a>, and me
<a href="https://twitter.com/RyanEs">@RyanEs</a>.</p>
<p>See you in two weeks for our next post! Josh, with help from Ryan, Emily, Jesse,
Joshua, and Isabella</p>
<ul>
<li><p><em>Ryan A. Estrellado is a public education leader and data scientist helping
administrators use practical data analysis to improve the student
experience.</em></p></li>
<li><p><em>Emily A. Bovee, Ph.D., is an educational data scientist working in dental
education.</em></p></li>
<li><p><em>Jesse Mostipak, M.Ed., is a community advocate, Kaggle educator, and data
scientist.</em></p></li>
<li><p><em>Isabella C. Velásquez, MS, is a data analyst committed to nonprofit work
with the aim of reducing racial and socioeconomic inequities.</em></p></li>
</ul>
<div class="footnotes">
<hr />
<ol>
<li id="fn:https-bookdown-o"><a href="https://bookdown.org/yihui/rmarkdown/">https://bookdown.org/yihui/rmarkdown/</a> <a class="footnote-return" href="#fnref:https-bookdown-o">↩</a></li>
<li id="fn:https-geocompr-r"><a href="https://geocompr.robinlovelace.net/">https://geocompr.robinlovelace.net/</a> <a class="footnote-return" href="#fnref:https-geocompr-r">↩</a></li>
<li id="fn:http-adv-r-had-c"><a href="http://adv-r.had.co.nz/">http://adv-r.had.co.nz/</a> <a class="footnote-return" href="#fnref:http-adv-r-had-c">↩</a></li>
<li id="fn:https-r4ds-had-c"><a href="https://r4ds.had.co.nz/">https://r4ds.had.co.nz/</a> <a class="footnote-return" href="#fnref:https-r4ds-had-c">↩</a></li>
<li id="fn:https-joshuamros"><a href="https://joshuamrosenberg.com/posts/data-science-in-education-using-r-by-and-beyond-the-numbers/">https://joshuamrosenberg.com/posts/data-science-in-education-using-r-by-and-beyond-the-numbers/</a> <a class="footnote-return" href="#fnref:https-joshuamros">↩</a></li>
<li id="fn:https-books-goog"><a href="https://books.google.com/books?hl=en&lr=&id=bjMsCKvV9I4C&oi=fnd&pg=PR5&dq=DIBONA,+C.,+OCKMAN,+S.,+AND+STONE,+M.+1999.+Open+Sources:+Voices+from+the+Open+Source+Revolution.+O%E2%80%99Reilly,+Sebastopol,+Calif.&ots=D_l_LXcDtB&sig=zu1hkYJlSrqCUaxe3nYbProHlg8">https://books.google.com/books?hl=en&lr=&id=bjMsCKvV9I4C&oi=fnd&pg=PR5&dq=DIBONA,+C.,+OCKMAN,+S.,+AND+STONE,+M.+1999.+Open+Sources:+Voices+from+the+Open+Source+Revolution.+O%E2%80%99Reilly,+Sebastopol,+Calif.&ots=D_l_LXcDtB&sig=zu1hkYJlSrqCUaxe3nYbProHlg8</a> <a class="footnote-return" href="#fnref:https-books-goog">↩</a></li>
<li id="fn:https-dl-acm-org"><a href="https://dl.acm.org/doi/abs/10.1145/567793.567795">https://dl.acm.org/doi/abs/10.1145/567793.567795</a> <a class="footnote-return" href="#fnref:https-dl-acm-org">↩</a></li>
<li id="fn:https-www-nap-ed"><a href="https://www.nap.edu/catalog/25116/open-science-by-design-realizing-a-vision-for-21st-century">https://www.nap.edu/catalog/25116/open-science-by-design-realizing-a-vision-for-21st-century</a> <a class="footnote-return" href="#fnref:https-www-nap-ed">↩</a></li>
<li id="fn:https-journals-s"><a href="https://journals.sagepub.com/doi/full/10.1177/2332858418787466">https://journals.sagepub.com/doi/full/10.1177/2332858418787466</a> <a class="footnote-return" href="#fnref:https-journals-s">↩</a></li>
<li id="fn:https-hewlett-or"><a href="https://hewlett.org/strategy/open-education/">https://hewlett.org/strategy/open-education/</a> <a class="footnote-return" href="#fnref:https-hewlett-or">↩</a></li>
<li id="fn:https-www-tandfo"><a href="https://www.tandfonline.com/doi/full/10.1080/02680510903482132?casa_token=S0sRaVJZiA4AAAAA%3ABO-fx7uNOQoNEdXl5-aQ8ooYpfTFohZdefU-ZJROwFDo3XL-W2oAbaOb3Un_DwRItNN4gj8eBXUo9A">https://www.tandfonline.com/doi/full/10.1080/02680510903482132?casa_token=S0sRaVJZiA4AAAAA%3ABO-fx7uNOQoNEdXl5-aQ8ooYpfTFohZdefU-ZJROwFDo3XL-W2oAbaOb3Un_DwRItNN4gj8eBXUo9A</a> <a class="footnote-return" href="#fnref:https-www-tandfo">↩</a></li>
</ol>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/07/01/open-source-authorship-of-data-science-in-education-using-r/';</script>
May 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2020/06/24/may-2020-top-40-new-cran-packages/
Wed, 24 Jun 2020 00:00:00 +0000https://rviews.rstudio.com/2020/06/24/may-2020-top-40-new-cran-packages/
<p>One hundred eighty-four new packages stuck to CRAN in May. The following are my “Top 40” picks in eleven categories: Data, Finance, Genomics, Marketing, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization.</p>
<h3 id="data">Data</h3>
<p><a href="https://CRAN.R-project.org/package=covid19nytimes">covid19nytimes</a> v0.1.3: Provides accesses the NY Times Covid-19 <a href="https://www.nytimes.com/article/coronavirus-county-data-us.html">county-level data</a> for the US that is also available <a href="https://github.com/nytimes/covid-19-data">here</a>. There is a <a href="https://cran.r-project.org/web/packages/covid19nytimes/vignettes/ny-times-bubble-map.html">vignette</a>.</p>
<p><img src="covid19nytimes.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=geodaData">geodata</a> v0.1.0: Contains small spatial datasets used to teach basic spatial analysis concepts. Datasets are based on of the <a href="https://geodacenter.github.io/data-and-lab/">GeoDa</a> software workbook and data site.</p>
<p><a href="https://CRAN.R-project.org/package=GermaParl">GermaParl</a> v1.4.2: Provides access to the <a href="http://www.lrec-conf.org/proceedings/lrec2018/pdf/1024.pdf">GermaParl</a> corpus of parliamentary debates of the German Bundestag maintained by the <a href="https://polmine.github.io/">PolMine Project</a>. The <a href="https://cran.r-project.org/web/packages/GermaParl/vignettes/GermaParl.html">vignette</a> introduces the corpus and package.</p>
<p><a href="https://CRAN.R-project.org/package=nhlapi">nhlapi</a> v0.1.2: Retrieves and processes the data exposed by the open <a href="https://github.com/dword4/nhlapi">NHL API</a>, including information on players, teams, games, tournaments, drafts, standings, schedules and other endpoints. There are vignettes on a <a href="https://cran.r-project.org/web/packages/nhlapi/vignettes/low_level_api.html">Low-level API</a>, <a href="https://cran.r-project.org/web/packages/nhlapi/vignettes/nhl_players_api.html">Retrieving Player Data</a>, and <a href="https://cran.r-project.org/web/packages/nhlapi/vignettes/nhl_teams_api.html">Retireving Team Data</a>.</p>
<p><a href="https://CRAN.R-project.org/package=polAr">polAr</a> v0.1.3: Implements a toolbox for the analysis of political and electoral data from Argentina. There are vignettes on <a href="https://cran.r-project.org/web/packages/polAr/vignettes/compute.html">Computing</a>, <a href="https://cran.r-project.org/web/packages/polAr/vignettes/data.html">Data Access</a>, and <a href="https://cran.r-project.org/web/packages/polAr/vignettes/results.html">Displaying Results</a>.</p>
<p><a href="https://cran.r-project.org/package=rKolada">rKolada</a> v0.1.3: Provides methods for downloading and processing data and metadata from <a href="https://www.kolada.se/">Kolada</a>, the official Swedish regions and municipalities database. There is an <a href="https://cran.r-project.org/web/packages/rKolada/vignettes/introduction-to-rkolada.html">Introduction</a> and a <a href="https://cran.r-project.org/web/packages/rKolada/vignettes/quickstart-rkolada.html">Quick Start Guide</a>.</p>
<p><img src="rKolada.png" height = "400" width="600"></p>
<h3 id="finance">Finance</h3>
<p><a href="https://cran.r-project.org/package=strand">strand</a> v0.1.3: Provides a framework for performing discrete (share-level) simulations of investment strategies. Simulated portfolios optimize exposure to an input signal subject to constraints such as position size and factor exposure. The vignette on <a href="https://cran.r-project.org/web/packages/strand/vignettes/strand.html">Backtesting with strand</a> is nicely done.</p>
<p><img src="strand.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=TwitterAutomatedTrading">TwitterAutomatedTrading</a> v0.1.0: Provides access to the <a href="https://www.metatrader5.com/en">MetaTrader 5</a> platform that enables users to carry out automated trading using sentiment indexes computed from twitter and/or <a href="https://stocktwits.com/">stockwits</a>. See <a href="https://repositorio.ufpb.br/jspui/handle/123456789/15198">Godeiro (2018)</a> for background, and the <a href="https://cran.r-project.org/web/packages/TwitterAutomatedTrading/vignettes/TwitterAutomatedTrading.html">vignette</a> for how to use the package.</p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.r-project.org/package=immunarch">immunarch</a> v0.6.5: Provides a framework for bioinformatics exploratory analysis of bulk and single-cell T-cell receptor and antibody repertoires that includes data loading, analysis and visualization for bulk and single-cell AIRR (Adaptive Immune Receptor Repertoire) data. There is an <a href="https://cran.r-project.org/web/packages/immunarch/vignettes/v1_introduction.html">Introduction</a> and a vignette on <a href="https://cran.r-project.org/web/packages/immunarch/vignettes/v2_data.html">Working with Data</a>.</p>
<p><a href="https://CRAN.R-project.org/package=SubtypeDrug">SubtypeDrug</a> v0.1.0: Implements a tool to prioritize cancer subtype-specific drugs by integrating genetic perturbation, drug action, biological pathway, and cancer subtype. See <a href="https://academic.oup.com/bioinformatics/article-abstract/36/7/2303/5671692?redirectedFrom=fulltext">Han et al. (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/SubtypeDrug/vignettes/vignette.html">vignette</a> for details on the package.</p>
<p><img src="SubtypeDrug.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=TransPhylo">TransPhylo</a> v1.4.4: Provides functions to reconstruct infectious disease transmission using genomic data. See <a href="https://academic.oup.com/mbe/article/31/7/1869/2925708">Didelot et. al (2014)</a> and <a href="https://academic.oup.com/mbe/article/34/4/997/2919386">Didelot et. al (2017)</a> for background. See the <a href="https://cran.r-project.org/web/packages/TransPhylo/vignettes/TransPhylo.html">Introduction</a> and the vignettes: <a href="https://cran.r-project.org/web/packages/TransPhylo/vignettes/infer.html">Inference of transmission tree from a dated phylogeny</a>, <a href="https://cran.r-project.org/web/packages/TransPhylo/vignettes/multitree.html">Simultaneous Inference of Multiple Transmission Trees</a> and <a href="https://cran.r-project.org/web/packages/TransPhylo/vignettes/simulate.html">Simulation of outbreak data</a>.</p>
<p><img src="TransPhylo.png" height = "400" width="600"></p>
<h3 id="marketing">Marketing</h3>
<p><a href="https://cran.r-project.org/package=CLVTools">CLVTools</a> v0.5.0: Implements various probabilistic latent customer attrition models for non-contractual settings (e.g., retail business) with and without time-invariant and time-varying covariates. See <a href="https://pubsonline.informs.org/doi/abs/10.1287/mnsc.33.1.1">Schmittlein et al. (1987)</a> and <a href="https://journals.sagepub.com/doi/10.1509/jmkr.2005.42.4.415">Fader et al. (2005)</a> for background and the <a href="https://cran.r-project.org/web/packages/CLVTools/vignettes/CLVTools.pdf">vignette</a> to get started.</p>
<p><img src="CLVTools.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=grizbayr">grizbayr</a> v1.2.2: Provides functions to implement Bayesian A / B and Bandit marketing tests. See <a href="http://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technical_whitepaper.pdf">Stucchio (2015)</a> for background and the <a href="https://cran.r-project.org/web/packages/grizbayr/vignettes/intro.html">vignette</a> to get started.</p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=applicable">applicable</a> v0.0.1.1: Provides functions that measure the amount of extrapolation new samples can have from the training set which are based on the concept of applicability domains. See <a href="https://journals.sagepub.com/doi/10.1177/026119290503300209">Netzeva et al (2005)</a>. There are vignettes for <a href="https://cran.r-project.org/web/packages/applicable/vignettes/binary-data.html">binary</a> and <a href="https://cran.r-project.org/web/packages/applicable/vignettes/continuous-data.html">continuous</a> data.</p>
<p><a href="https://CRAN.R-project.org/package=piRF">piRF</a> v0.1.0: Implements multiple state-of-the-art prediction interval methodologies for random forests including quantile regression intervals, out-of-bag intervals, bag-of-observations intervals, one-step boosted random forest intervals, bias-corrected intervals, high-density intervals, and split-conformal intervals. Look <a href="https://github.com/chancejohnstone/piRF">here</a> for an example.</p>
<p><img src="piRF.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=rules">rules</a> v0.0.2: Provides bindings that allow <a href="https://projecteuclid.org/euclid.aoas/1223908046">prediction rule ensembles</a>, <a href="https://www.rulequest.com/see5-unix.html#:~:text=C5.,chosen%20as%20the%20final%20prediction.">C5.0 rules</a>, and <a href="https://link.springer.com/book/10.1007%2F978-1-4614-6849-3">Cubist</a> to be used with the <a href="https://CRAN.R-project.org/package=parsnip">parsnip</a> package.</p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://CRAN.R-project.org/package=AdhereRViz">AdhereRViz</a> v0.1.0: Implements a Shiny based GUI to the <a href="https://CRAN.R-project.org/package=AdhereR">AdhereR</a> package to allow users to access different data sources, explore patterns of medication, and compute various measures of adherence. See the <a href="https://cran.r-project.org/web/packages/AdhereRViz/vignettes/adherer_interctive_plots.html">vignette</a> for details.</p>
<p><img src="AdhereRViz.jpeg" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=MrSGUIDE">MrSGUIDE</a> v0.1.1: provides functions to facilitate subgroup analysis for single and multiple responses in both randomized trials and observational studies based on the <a href="http://pages.stat.wisc.edu/~loh/guide.html">GUIDE</a> algorithm. See the <a href="https://cran.r-project.org/web/packages/MrSGUIDE/vignettes/UsageOfMrSGUIDE.html">Vignette</a>.</p>
<p><img src="guide.png" height = "200" width="400"></p>
<h3 id="science">Science</h3>
<p><a href="https://CRAN.R-project.org/package=ldsr">ldsr</a> v0.0.2: Provides functions to reconstruct streamflow and climate information using linear dynamical systems. See <a href="https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017WR022114">Nguyen and Galelli (2018)</a> for background and the <a href="https://cran.r-project.org/web/packages/ldsr/vignettes/ldsr.html">vignette</a> for examples.</p>
<p><img src="ldsr.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=rties">rties</a> v5.0.0: Provides tools for investigating temporal processes in bivariate (e.g., dyadic) systems. The theoretical background can be found in <a href="https://journals.sagepub.com/doi/10.1177/1088868311411164">Butler (2011)</a> and <a href="https://journals.lww.com/psychosomaticmedicine/Abstract/2019/10000/Quantifying_Interpersonal_Dynamics_for_Studying.10.aspx">Butler & Barnard (2019)</a>. There is an <a href="https://cran.r-project.org/web/packages/rties/vignettes/overview_data_prep_V05.html">Overview</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/rties/vignettes/inertia_coordination_V05.html">Intertia Coordination</a>, <a href="https://cran.r-project.org/web/packages/rties/vignettes/overview_data_prep_V05.html">Data Preparation</a>, and <a href="https://cran.r-project.org/web/packages/rties/vignettes/sysVar_inOut_V05.html">System Varibles</a>.</p>
<p><img src="rties.png" height = "400" width="600"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://CRAN.R-project.org/package=Compack">Compack</a> v0.1.0: Implements regression methodologies with compositional covariates, including sparse log-contrast regression with compositional covariates proposed by <a href="https://academic.oup.com/biomet/article-abstract/101/4/785/1775476?redirectedFrom=fulltext">Lin et al. (2014)</a>, and sparse log-contrast regression with functional compositional predictors proposed by <a href="https://arxiv.org/abs/1808.02403">Sun et al. (2020)</a>. There is a <a href="https://cran.r-project.org/web/packages/Compack/vignettes/Introduction_to_Compack_package.html">vignette</a>.</p>
<p><img src="Compack.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=ghypernet">ghypernet</a> v1.0.0: Provides functions for model fitting and selection of generalized hypergeometric ensembles of random graphs (gHypEG). The package is based on the research by Casiraghi and collaborators. For example, see <a href="https://arxiv.org/abs/1607.02441">Casiraghi et al. (2016)</a>, <a href="https://arxiv.org/abs/1702.02048">Casiraghi (2017)</a> and <a href="https://arxiv.org/abs/1810.06495">Casiraghi and Nanumyan (2018)</a>. There is an <a href="https://cran.r-project.org/web/packages/ghypernet/vignettes/Tutorial_NRM.html">Introduction</a>, a short <a href="https://cran.r-project.org/web/packages/ghypernet/vignettes/tutorial.html">Tutorial</a>, and a vignette on <a href="https://cran.r-project.org/web/packages/ghypernet/vignettes/Significantlinks.html">Finding Significant Links</a>.</p>
<p><a href="https://CRAN.R-project.org/package=motifcluster">motifcluster</a> v0.1.0: Provides tools for spectral clustering of weighted directed networks using motif adjacency matrices. These methods, which perform well on large and sparse networks, are based on the methodology described in <a href="https://arxiv.org/abs/2004.01293">Underwood et al. (2020)</a>. See the <a href="https://cran.r-project.org/web/packages/motifcluster/vignettes/motifcluster_vignette.pdf">vignette</a>.</p>
<p><a href="https://CRAN.R-project.org/package=regmedint">regmedint</a> v0.1.0: Implements the regression-based causal mediation analysis with a treatment-mediator interaction term, as originally implemented in the <code>SAS</code> macro described in <a href="https://doi.apa.org/fulltext/2013-03476-001.html">Valeri and VanderWeele (2013)</a> and <a href="https://journals.lww.com/epidem/Fulltext/2015/03000/SAS_Macro_for_Causal_Mediation_Analysis_with.32.aspx">Valeri and VanderWeele (2015)</a>. There is an <a href="https://cran.r-project.org/web/packages/regmedint/vignettes/vig_01_introduction.html">Introduction</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/regmedint/vignettes/vig_02_formulas.html">Implementing Formulas</a>, <a href="https://cran.r-project.org/web/packages/regmedint/vignettes/vig_03_bootstrap.html">Bootstrapping</a>, and <a href="https://cran.r-project.org/web/packages/regmedint/vignettes/vig_04_mi.html">Multiple Imputation</a>.</p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://CRAN.R-project.org/package=DeCAFS">DeCAFS</a> v3.1.5: Provides functions to detect abrupt changes in time series with local fluctuations as a random walk process and autocorrelated noise as an AR(1) process. See <a href="https://arxiv.org/abs/2005.01379">Romano et al. (2020)</a> for the theory.</p>
<p><a href="https://cran.r-project.org/package=Rdrw">Rdrw</a> v1.0.1: Provides functions to fit and simulate a univariate or multivariate damped random walk process (also known as an Ornstein-Uhlenbeck process or a continuous-time autoregressive model of the first order) which is suitable for analyzing time series data with irregularly-spaced observation times and heteroscedastic measurement errors. See <a href="https://arxiv.org/abs/2005.08049">Hu and Tak (2020)</a> for background.</p>
<p><a href="https://cran.r-project.org/package=statespacer">statespacer</a> v0.1.0: Provides functions for estimating time series using the state space method. For background see <a href="https://www.jstatsoft.org/issue/view/v041">JSS Vol 41</a>. The package has an <a href="https://cran.r-project.org/web/packages/statespacer/vignettes/intro.html">Introduction</a>, a <a href="https://cran.r-project.org/web/packages/statespacer/vignettes/dictionary.html">Dictionary</a> for the model object and vignettes on <a href="https://cran.r-project.org/web/packages/statespacer/vignettes/boxjenkins.html">Fitting and ARIMA Model</a>, an <a href="https://cran.r-project.org/web/packages/statespacer/vignettes/seatbelt.html">Example</a> and on <a href="https://cran.r-project.org/web/packages/statespacer/vignettes/selfspec.html">Specifying a new model component</a>.</p>
<p><img src="statespacer.png" height = "400" width="600"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://CRAN.R-project.org/package=almanac">almanac</a> v0.1.1: Provides tools for implementing recurrence rules, i.e. functions for defining recurring events. There is an <a href="https://cran.r-project.org/web/packages/almanac/vignettes/almanac.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/almanac/vignettes/adjust-and-shift.html">Adjusting and Shifting Dates</a>, <a href="https://cran.r-project.org/web/packages/almanac/vignettes/icalendar.html">iCalendar Specification</a>, and <a href="https://cran.r-project.org/web/packages/almanac/vignettes/quarterly.html">Quarterly Rules</a>.</p>
<p><a href="https://cran.r-project.org/package=gdiff">gdiff</a> v0.2-1: Provides functions for performing graphical difference testing. Look <a href="https://stattech.wordpress.fos.auckland.ac.nz/2020/01/06/2020-01-visual-testing-for-graphics-in-r/">here</a> for more information.</p>
<p><a href="https://CRAN.R-project.org/package=i2dash">i2dash</a> v0.2.1: Provides functions for creating web-based dashboards. See the <a href="https://cran.r-project.org/web/packages/i2dash/vignettes/i2dash-intro.html">vignette</a>.</p>
<p><a href="https://CRAN.R-project.org/package=pkgndep">pkgndep</a> v1.0.0: Provides functions to check and visualize the “heaviness” of <code>R</code> packages. See the <a href="https://cran.r-project.org/web/packages/pkgndep/vignettes/pkgndep.html">vignette</a>.</p>
<p><img src="pkgndep.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=presser">presser</a> v1.0.0: Implements the <a href="https://httpbin.org/">httpbin.org</a> web service and functions to test web clients without using the internet.</p>
<p><a href="https://CRAN.R-project.org/package=stringfish">stringfish</a> v0.12.1: Implements a framework for performing string and sequence operations using the alt-rep system to speed up the computation of common string operations. See the <a href="https://cran.r-project.org/web/packages/stringfish/vignettes/vignette.html">vignette</a>.</p>
<p><img src="stringfish.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=worcs">worcs</a> Implements the Workflow for Open Reproducible Code in Science, <a href="https://osf.io/zcvbs/">WORCS</a>. There is an <a href="https://cran.r-project.org/web/packages/worcs/vignettes/workflow.html">Introduction</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/worcs/vignettes/citation.html">citing</a>, <a href="https://cran.r-project.org/web/packages/worcs/vignettes/git_cloud.html">git_cloud</a>, and <a href="https://cran.r-project.org/web/packages/worcs/vignettes/setup.html">setup</a>.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=ggpacman">ggpacman</a> v0.1.0: Reproduces the game Pac-Man using <code>ggplot2</code> and <code>gganimate</code>. Look <a href="https://github.com/mcanouil/ggpacman">here</a> for more information.</p>
<p><img src="ggpacman.gif" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=iNZightTS">iNZightTS</a> v1.5.2: Provides tools for working with time series data, including functions for drawing, decomposing, and forecasting, comparing multiple series, and fitting both additive and multiplicative models. Look <a href="https://www.stat.auckland.ac.nz/~wild/iNZight/">here</a> for more information.</p>
<p><img src="iNZightTS.gif" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=prismadiagramR">prismadiagramR</a> v1.0.0: Provides functions to create <a href="http://prisma-statement.org/">PRISMA</a> diagrams used to track the identification, screening, eligibility, and inclusion of studies in a systematic review. See the <a href="https://cran.r-project.org/web/packages/prismadiagramR/vignettes/PRISMA.html">vignette</a>.</p>
<p><img src="prism.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=sketcher">sketcher</a> v0.1.3: Implements image processing effects that convert a photo into a line drawing image. See <a href="https://psyarxiv.com/svmw5/">Tsuda (2020)</a> for background and look <a href="https://htsuda.net/sketcher/">here</a> for examples.</p>
<p><img src="sketcher.png" height = "400" width="600"></p>
<p><a href="https://CRAN.R-project.org/package=upsetjs">upsetjs</a> v1.3.1: Provides an <code>htmlwidget</code> wrapper for the JavaScript <code>UpSet.js</code> library. There is an <a href="https://cran.r-project.org/web/packages/upsetjs/vignettes/basic.html">Introduction</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/upsetjs/vignettes/colors.html">Coloring</a>, <a href="https://cran.r-project.org/web/packages/upsetjs/vignettes/combinationModes.html">Combination Modes</a>, and <a href="https://cran.r-project.org/web/packages/upsetjs/vignettes/venn.html">Venn and Euler Diagrams</a>.</p>
<p><img src="upsetjs.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=xaringanthemer">xaringanthemer</a> v0.3.0: Provides functions to create custom <code>CSS</code> themes. There is and <a href="https://cran.r-project.org/web/packages/xaringanthemer/vignettes/xaringanthemer.html">Overview</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/xaringanthemer/vignettes/ggplot2-themes.html">ggplot2 Themes</a>, and <a href="https://cran.r-project.org/web/packages/xaringanthemer/vignettes/template-variables.html">Template Variables</a>.</p>
<p><img src="xaringanthemer.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/06/24/may-2020-top-40-new-cran-packages/';</script>
R Can Pull the Fire Alarm!
https://rviews.rstudio.com/2020/06/18/how-to-have-r-notify-you/
Thu, 18 Jun 2020 00:00:00 +0000https://rviews.rstudio.com/2020/06/18/how-to-have-r-notify-you/
<p><em>Brian Law is a customer success representative at RStudio and a new R Views contributor.</em></p>
<p><em>Jason Rich manages all US Data Engineering for PRA Group and studies computer science at Old Dominion University</em></p>
<p>There are times when it would be really nice to get an email from R. Maybe you have a long running job that you would just like to leave alone while you go off and do other things. When this happens, it would be nice if R notified you when it was done; something like calling out “dinner’s ready!”, but for work. Other times you may be monitoring a process for anomalies. Usually, everything’s fine, except when it’s not, and then we want a “fire alarm” to go off in R. Below we’ll walk through how to automate having R send you an: email, text message, <code>Slack</code> message, or <code>Microsoft Teams</code> message.</p>
<h3 id="email">Email</h3>
<pre><code class="language-r">library(blastula)
# First let's build a rich HTML email using library(blastula)
owl <- compose_email(body = md(c("Hello from Hogwarts. <br><br> The polyjuice potion is complete!")))
# Second, store credentials to later send our message via smtp (using gmail here but could be others)
create_smtp_creds_file(
file = "gmail_creds",
user = "name@gmail.com",
provider = "gmail"
) # Note, a pop up will ask for the pwd for the user you provided
# Third, send the email (using gmail here but could be others)
owl %>%
smtp_send(
to = "someone@email.com",
from = "name@gmail.com",
subject = "Mischief Managed",
credentials = creds_file("gmail_creds")
)
</code></pre>
<h3 id="text">Text</h3>
<p>If you prefer to get notifications via text messages then you can use the old trick to send an email that will convert into a text message to a phone number. Each mobile phone plan provider has a slightly different format for how to do this and so the first step is to google “how to send email to text” and look yours up, e.g. AT&T’s is the <code>ten-digit-phone-number@mms.att.net</code>. Let’s run through an example.</p>
<pre><code class="language-r">owl2 <- ""
owl2 %>%
smtp_send(
to = "xxxxxxxxxx@mms.att.net",
from = "name@gmail.com",
subject = "Mischief Managed",
credentials = creds_file("gmail_creds")
)
</code></pre>
<p>Note that the text message above will only render the subject line currently but you can tinker further.</p>
<h3 id="slack">Slack</h3>
<p>Who knew that chat rooms would make such a comeback? If you use <code>Slack</code> and want to send a message there, you can do so using R. Here we’ll walk through a stripped down method that posts messages directly to the <code>Slack</code> API using an RStudio library <code>httr</code>. If you want to get fancy there is also <code>library(slackr)</code>, which has more control and features.</p>
<p>There are a few steps to set things up. First, you will need a Slack account. Second, go to <code>api.slack.com/apps</code> and login.</p>
<p><img src="slack_1.png" height = "400" width="600"></p>
<p>Third, click on “Create New App”. Fourth, on the left sidebar, click on “Incoming Webhooks” and make sure “Activate Incoming Webhooks” is “On”.</p>
<p><img src="slack_2.png" height = "400" width="600"></p>
<p>Then, scroll to the bottom of the page and click “Add New Webhook to WorkSpace”.</p>
<p><img src="slack_3.png" height = "400" width="600"></p>
<p>Next, choose what <code>Slack</code> channel you want to generate a webhook to connect with, for example, your own <code>Slack</code> username, or a more general <code>Slack</code> channel at your company, like “cat_photos”.</p>
<p><img src="slack_4.png" height = "400" width="600"></p>
<p>Lastly, click on the “Copy” button to copy the webhook, which is what we’ll use in our R code below to actually send the message.</p>
<p><img src="slack_3.png" height = "400" width="600"></p>
<pre><code class="language-r">library(httr)
test_msg <- list(text="hello world!")
hook_to_me <- "https://hooks.slack.com/services/some_long_hash"
POST(hook_to_me, encode = "json", body = test_msg)
if (2 < 3) { # make this conditional aka a "fire alarm"
hook_to_cats_channel <- "https://hooks.slack.com/services/some_long_hash"
POST(hook_to_cats_channel, encode = "json", body = test_msg)
}
</code></pre>
<h3 id="microsoft-teams">Microsoft Teams</h3>
<p>The process of connecting to <code>Microsoft Teams</code> in similar to connecting to <code>Slack</code> under the covers, using a webhook and leveraging the incoming webhook app, installed from the <code>Teams</code> app store. To get started, let’s install the <code>teamr</code> package from <code>CRAN</code>, call the library and create our connection.</p>
<p>Just like with <code>Slack</code>, you will first need an organizational <code>Teams</code> account. Secondly, navigate to the <code>add more apps</code> button from within the desktop client, which is the ellipsis below the pinned apps on the left task bar.</p>
<p><img src="teams_6.png" height = "400" width="600"></p>
<p>Third, use the search bar, located in the upper left corner to search for the <code>Incoming Webhook</code> app</p>
<p><img src="teams_1.png" height = "400" width="600"></p>
<p>click on the icon to bring up the install screen, where you can name the webhook, and assign it a specific channel within <code>Teams</code></p>
<p><img src="teams_2.png" height = "400" width="600"></p>
<p>After following the prompts to name the connection and assign it to a channel, you are given the option of uploading an image so the webhook is easily recognizable when notifications are posted to your channel. Here, I have chosen to use an <code>RStudio</code> logo, which we will see later, makes it easier to know who, and/or from where, the post originates. After you are satisfied, click the done button in the bottom left corner.</p>
<p><img src="teams_4.png" height = "400" width="600"></p>
<p>The code to test your webhook in reasonably simple, and requires only six lines of code. Although, in practice, we leverage many of, if not all, the customizable features provided within the <code>teamr</code> package. You can read more about these feature either in the official documentation or on the <code>GitHub</code> page <a href="https://github.com/wwwjk366/teamr">https://github.com/wwwjk366/teamr</a></p>
<pre><code class="language-r">library(teamr)
con <- connector_card$new(hookurl = "https://outlook.office.com/webhook/...")
con$text("This is the notification body!")
con$title("Message Title")
con$add_link_button("The GitHub repo for the teamr package", "https://github.com/wwwjk366/teamr")
con$send()
#[1] TRUE
</code></pre>
<p>The final step, after running this code, is to verify everything is working as it should. The above code snippet sends a notification to <code>Teams</code>, with a link button pointing the package’s GitHub repo, from a webhook bot containing the <code>RStudio</code> logo.</p>
<p><img src="teams_5.png" height = "400" width="600"></p>
<p>We hope you found this discussion helpful. The next time you wish there was a way to have <code>R</code> notify you, you can incorporate one of these into your workflow.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/06/18/how-to-have-r-notify-you/';</script>
Learning R With Education Datasets
https://rviews.rstudio.com/2020/06/11/learning-r-with-education-datasets/
Thu, 11 Jun 2020 00:00:00 +0000https://rviews.rstudio.com/2020/06/11/learning-r-with-education-datasets/
<p><em>Ryan A. Estrellado is a public education leader and data scientist helping administrators use practical data analysis to improve the student experience.</em></p>
<p>Timothy Gallwey wrote in <em>The Inner Game of Tennis</em>:</p>
<blockquote>
<p>…There is a natural learning process which operates within everyone, if it is allowed to. This process is waiting to be discovered by all those who do not know of its existence … It can be discovered for yourself, if it hasn’t been already. If it has been experienced, trust it.</p>
</blockquote>
<p>Discovering a new R concept like a function or package is exciting. You never know if you’re about to learn something that fundamentally changes the way you code or solve data science problems. But I get even more excited when I see somebody <em>use</em> new R concepts. For example, I learned about random forest models when I read about them in <a href="https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370">An Introduction to Statistical Learning (ISL)</a>. Then I imagined myself using them when I watched <a href="https://youtu.be/LPptRkGoYMg">Julia Silge fit a random forest model</a> to predict attendance at NFL games. I need the reading to give me language for what I see data scientists do. Then I need to see what data scientists do for me to imagine myself doing what I’ve read.</p>
<p>Still, for most people using R in their jobs, there’s another step. They have to imagine how to apply what they’ve read and seen to the problems they’re solving at work. But what if we used education datasets to help them imagine using R on the job, just as the authors of ISL use words and code to teach about models and Julia Silge uses video to inspire coding?</p>
<p>We learned from writing <a href="https://datascienceineducation.com"><em>Data Science in Education Using R (DSIEUR)</em></a> that we can combine words, code, and professional context. Professional context includes scenarios, language, and data that readers will recognize in their education jobs. We wanted readers to feel motivated and engaged by seeing words and data that reminds them of their everyday work tasks. This connection to their professional lives is a hook for readers as they engage R syntax which is, if you’ve never used it, literally a foreign language.</p>
<p>Let’s use <code>pivot_longer()</code> as an example. We’ll describe this process in three steps: discovering the concept, seeing how the concept is used, and seeing how the concept is used <em>in education</em>.</p>
<p><strong>Step 1: See the concept</strong></p>
<p>When I read something like “Use <code>pivot_longer()</code> to transform a dataset from wide to long”, I can imagine the shape of a dataset changing. But it’s harder to imagine what happens with the variables and their contents as the dataset’s shape changes. I’ve been using R for over five years and I still struggle to visualize the contents of many columns rearranging themselves into one.</p>
<p><strong>Step 2: See how the concept is used</strong></p>
<p>The concept gets much clearer when you add an example—even one with little context—to the explanation. Here’s one from the <code>pivot_longer()</code> vignette, which you can view with <code>vignette("pivot")</code>:</p>
<pre class="r"><code>library(tidyverse)</code></pre>
<pre class="r"><code># Simplest case where column names are character data
relig_income</code></pre>
<pre><code>#> # A tibble: 18 x 11
#> religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137 122
#> 2 Atheist 12 27 37 52 35 70 73
#> 3 Buddhist 27 21 30 34 33 58 62
#> 4 Catholic 418 617 732 670 638 1116 949
#> 5 Don’t k… 15 14 15 11 10 35 21
#> 6 Evangel… 575 869 1064 982 881 1486 949
#> 7 Hindu 1 9 7 9 11 34 47
#> 8 Histori… 228 244 236 238 197 223 131
#> 9 Jehovah… 20 27 24 24 21 30 15
#> 10 Jewish 19 19 25 25 30 95 69
#> 11 Mainlin… 289 495 619 655 651 1107 939
#> 12 Mormon 29 40 48 51 56 112 85
#> 13 Muslim 6 7 9 10 9 23 16
#> 14 Orthodox 13 17 23 32 32 47 38
#> 15 Other C… 9 7 11 13 13 14 18
#> 16 Other F… 20 33 40 46 49 63 46
#> 17 Other W… 5 2 3 4 2 7 3
#> 18 Unaffil… 217 299 374 365 341 528 407
#> # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
#> # know/refused` <dbl></code></pre>
<pre class="r"><code>relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count")</code></pre>
<pre><code>#> # A tibble: 180 x 3
#> religion income count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # … with 170 more rows</code></pre>
<p>Sharing an idea by pairing an abstract programming concept with a reproducible example is a common practice for experienced R programmers. <a href="https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example">Community guidelines for Stack Overflow posts</a> and the <a href="https://www.tidyverse.org/help/">{reprex}</a> package are two artifacts of a popular R community norm: help folks understand an idea by using words <em>and</em> code.</p>
<p><strong>Step 3: See how the concept is used in education</strong></p>
<p>Combining the explanation with a reproducible example makes <code>pivot_longer()</code> more concrete by showing how it works. What happens when we connect the explanation and reproducible example to the everyday work of a data scientist in education?</p>
<p>In <a href="https://datascienceineducation.com/c07.html">chapter seven</a> of <em>DSIEUR</em>, we use <code>pivot_longer()</code> to transform a dataset of coursework survey responses from wide to long. Before using <code>pivot_longer()</code>, the dataset had a column for each survey question. When we use <code>pivot_longer()</code>, the name of each survey question moves to a new column called “question”. Another new column is added, “response”, which contains the corresponding response to each survey question.</p>
<p>To run this code, you’ll need the <em>DSIEUR</em> companion R package, <a href="https://github.com/data-edu/dataedu">{dataedu}</a>:</p>
<pre class="r"><code># Install the {dataedu} package if you don't have it
# devtools::install_github("data-edu/dataedu")
library(dataedu)</code></pre>
<p>Here’s the survey data in its original, wide format:</p>
<pre class="r"><code># Wide format
pre_survey</code></pre>
<pre><code>#> # A tibble: 1,102 x 12
#> opdata_username opdata_CourseID Q1Maincellgroup… Q1Maincellgroup…
#> <chr> <chr> <dbl> <dbl>
#> 1 _80624_1 FrScA-S116-01 4 4
#> 2 _80623_1 BioA-S116-01 4 4
#> 3 _82588_1 OcnA-S116-03 NA NA
#> 4 _80623_1 AnPhA-S116-01 4 3
#> 5 _80624_1 AnPhA-S116-01 NA NA
#> 6 _80624_1 AnPhA-S116-02 4 2
#> 7 _80624_1 AnPhA-T116-01 NA NA
#> 8 _80624_1 BioA-S116-01 5 3
#> 9 _80624_1 BioA-T116-01 NA NA
#> 10 _80624_1 PhysA-S116-01 4 4
#> # … with 1,092 more rows, and 8 more variables: Q1MaincellgroupRow3 <dbl>,
#> # Q1MaincellgroupRow4 <dbl>, Q1MaincellgroupRow5 <dbl>,
#> # Q1MaincellgroupRow6 <dbl>, Q1MaincellgroupRow7 <dbl>,
#> # Q1MaincellgroupRow8 <dbl>, Q1MaincellgroupRow9 <dbl>,
#> # Q1MaincellgroupRow10 <dbl></code></pre>
<p>The third through eighth columns are named after each survey question—“Q1MaincellgroupRow1”, “Q1MaincellgroupRow2”, “Q1MaincellgroupRow3”, etc. These are the column names we’ll be moving to a single column called “question” when the dataset transforms from wide to long.</p>
<p>Here’s the new dataset, where a column called “question” contains the question names and a column called “response” contains the corresponding responses:</p>
<pre class="r"><code># Pivot the dataset from wide to long format
pre_survey %>%
pivot_longer(cols = Q1MaincellgroupRow1:Q1MaincellgroupRow10,
names_to = "question",
values_to = "response")</code></pre>
<pre><code>#> # A tibble: 11,020 x 4
#> opdata_username opdata_CourseID question response
#> <chr> <chr> <chr> <dbl>
#> 1 _80624_1 FrScA-S116-01 Q1MaincellgroupRow1 4
#> 2 _80624_1 FrScA-S116-01 Q1MaincellgroupRow2 4
#> 3 _80624_1 FrScA-S116-01 Q1MaincellgroupRow3 4
#> 4 _80624_1 FrScA-S116-01 Q1MaincellgroupRow4 1
#> 5 _80624_1 FrScA-S116-01 Q1MaincellgroupRow5 5
#> 6 _80624_1 FrScA-S116-01 Q1MaincellgroupRow6 4
#> 7 _80624_1 FrScA-S116-01 Q1MaincellgroupRow7 1
#> 8 _80624_1 FrScA-S116-01 Q1MaincellgroupRow8 5
#> 9 _80624_1 FrScA-S116-01 Q1MaincellgroupRow9 5
#> 10 _80624_1 FrScA-S116-01 Q1MaincellgroupRow10 5
#> # … with 11,010 more rows</code></pre>
<p>When you put it all together, the learning thought process is something like this:</p>
<ul>
<li>There’s a function called <code>pivot_longer()</code>, which turns a wide dataset into a long dataset</li>
<li><code>pivot_longer()</code> does this by putting multiple column names into its own column, then creating a new column that pairs each column name with a value</li>
<li>I can use <code>pivot_longer()</code> to change an education survey dataset that has question names for columns into one that has a “question” column and a “response” column</li>
</ul>
<p>We’ll be back with the next post in about two weeks. Until then, do share with us about the people and tools that inspire you to work on collaborative projects. You can reach us on Twitter: Emily <a href="https://twitter.com/ebovee09">@ebovee09</a>, Jesse <a href="https://twitter.com/kierisi">@kierisi</a>, Joshua <a href="https://twitter.com/jrosenberg6432">@jrosenberg6432</a>, Isabella <a href="https://twitter.com/ivelasq3">@ivelasq3</a> and me <a href="https://twitter.com/RyanEs">@RyanEs</a>.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/06/11/learning-r-with-education-datasets/';</script>
More Select COVID-19 Resources
https://rviews.rstudio.com/2020/06/03/more-select-covid-19-resources/
Wed, 03 Jun 2020 00:00:00 +0000https://rviews.rstudio.com/2020/06/03/more-select-covid-19-resources/
<p>We are over five months into this pandemic, and it is pretty clear that almost everyone is really tired of hearing about it. I myself am totally zoomed out and have already seen too many dashboards. Nevertheless, we are in this for the long run. So from time-to-time, I think it worthwhile to continue to look for tools that can help us make some sense of the continuing stream of incoming data.</p>
<p>First, I would like to draw your attention to the <a href="https://aatishb.com/covidtrends/?doublingtime=7">Covid Trends</a> animated dashboard from Physics teacher <a href="https://aatishb.com/">Aatish Bhatia</a>. The epidemiologists are the experts in this domain, but it is just like a physicist to deliver on insight.</p>
<p><img src="CovidTrends.png" height="600" width="100%"></p>
<p>What’s unique about this dashboard is how it beautifully illustrates the consequence of exponential growth. Notice that there is no time axis on the graph. Total confirmed cases are plotted on the x axis, and new confirmed cases in the past week are plotted on the y axis. In this setup, doubling times are represented as straight lines. As you run the animation, time passes and you observe the various countries hugging the seven day doubling time line and then dropping down as the whatever counter measures they are taking get the epidemic under control. This plot makes it clear that while things are opening up in the U.S. we do not quite have the disease under control. Please do watch the short video explaining the graph.</p>
<iframe width="848" height="500" src="https://www.youtube.com/embed/54XLXg4fYsc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>Next, please have a look at the <a href="https://covid19datahub.io/">COVID-19 Data Hub</a>, an open source project started by Finance Ph.D. student <a href="https://guidotti.dev/">Emanuele Guidotti</a> with initial <a href="https://ivado.ca/en/">IVADO</a> arranged by <a href="https://ardiad.github.io/website/">David Ardia</a> that may very well become the main repository for epidemiologists working with COVID-19 case data. Currently over sixty data sets are available.</p>
<iframe width="848" height="500" src="https://www.youtube.com/embed/Uj6zTnZWJWA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>All data sets are in a standardized format, and are <a href="https://covid19datahub.io/articles/doc/data.html">well documented</a>. Additionally, the site provides R, Python, MatLab, Julia, Node.js, Scala and Excel code to access the data. This project is an extraordinary effort that deserves community support.</p>
<p>Finally for today, I recommend the <a href="https://www.youtube.com/watch?v=6N1p99bLXjk&feature=youtu.be">video recording</a> from the first <a href="https://covid19-data-forum.org/">COVID-19 Data Forum</a> webinar held on May 14, 2020. After the opening remarks by Michael Kane, Assistant Professor, Department of Biostatistics, Yale University which begin at one minute and forty seconds (1:40) into the video, there are four talks, each approximately fifteen minutes long.</p>
<iframe width="848" height="500" src="https://www.youtube.com/embed/6N1p99bLXjk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>The first talk: <em>Modeling COVID19 spread and control: Data needs and challenges</em> by Alison Hill of the Department of Organismic & Evolutionary Biology, Harvard University begins at (5:33). The second talk: <em>Collecting and Visualizing COVID-19 Case Count Data from Multiple Open Sources</em> by independent consultant Ryan Hafen begins at (21:26). The third talk: <em>Spatial and Space-Time Data on COVID-19</em> by Orhun Aydin of <a href="https://www.esri.com/en-us/home">esri</a> and the Environmental Systems Research Institute University of Southern California begins at (38:49). The final talk by Noam Ross of the <a href="https://www.ecohealthalliance.org/">EcoHealth Alliance</a> and <a href="https://ropensci.org/">rOpenSci</a> which begins at (56:23), focuses on the genomic data that enables scientists to study the emergence of new diseases.</p>
<p>Enjoy the videos.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/06/03/more-select-covid-19-resources/';</script>
April 2020: "Top 40" New CRAN Packages
https://rviews.rstudio.com/2020/05/28/april-2020-top-40-new-cran-packages/
Thu, 28 May 2020 00:00:00 +0000https://rviews.rstudio.com/2020/05/28/april-2020-top-40-new-cran-packages/
<p>One hundred forty-eight new packages made it to CRAN in April. Here are my “Top 40” picks in nine categories: Computational Methods, Data, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization.</p>
<h3 id="computational-methods">Computational Methods</h3>
<p><a href="https://cran.r-project.org/package=JuliaConnectoR">JuliaConnectoR</a> v0.6.0: Allows users to import <code>Julia</code> packages and functions in such a way that they can be called directly as as <code>R</code> functions.</p>
<p><a href="https://cran.r-project.org/package=RcppBigIntAlgos">RcppBigIntAlgos</a>: v0.2.2: Implements the multiple polynomial quadratic sieve (MPQS) algorithm for factoring large integers and a vectorized factoring function that returns the complete factorization of an integer. See <a href="https://link.springer.com/chapter/10.1007%2F3-540-39757-4_17">Pomerance (1984)</a> and <a href="https://www.ams.org/journals/mcom/1987-48-177/S0025-5718-1987-0866119-8/home.html">Silverman (1987)</a> for background and this <a href="https://docs.microsoft.com/en-us/archive/blogs/devdev/factoring-large-numbers-with-quadratic-sieve">Microsoft post</a> for an explanation.</p>
<p><a href="https://cran.r-project.org/package=smoothedLasso">smoothedLasso</a> v1.0: Implements the smoothed LASSO regression using the method of <a href="https://link.springer.com/article/10.1007%2Fs10107-004-0552-5">Nesterov (2005)</a>.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=daqapo">daqape</a> v0.3.0: Provides a variety of methods to identify data quality issues in process-oriented data. There is an <a href="https://cran.r-project.org/web/packages/daqapo/vignettes/Introduction-to-DaQAPO.html">Introduction</a>.</p>
<p><a href="https://cran.r-project.org/package=DSOpal">DSOpal</a> v1.1.0: is the <a href="https://www.datashield.ac.uk/">DataShield</a> implementation of <a href="https://www.obiba.org/pages/products/opal/">Opal</a>, the data integration application for biobanks by <a href="https://www.obiba.org/">OBiBa</a>, open source software for epidemiology.</p>
<p><a href="https://cran.r-project.org/package=epuR">epuR</a> v0.1: Provides functions to collect data from the the <a href="https://www.policyuncertainty.com/index.html">economic policy uncertainty</a> website. See the <a href="https://cran.r-project.org/web/packages/epuR/vignettes/epuR_intro.html">vignette</a>.</p>
<p><img src="epuR.png" height = 400" width="400"></p>
<p><a href="https://cran.r-project.org/package=hystReet">hystReet</a> v0.0.1: Implements an API wrapper for the <a href="https://hystreet.com">Hystreet project</a> which provides pedestrian counts for various cities in Germany. See the <a href="https://cran.r-project.org/web/packages/hystReet/vignettes/Getting_started_with_the_R_package_hystReet.html">vignette</a> to get started.</p>
<p><img src="hystreet.png" height = "600" width="400"></p>
<p><a href="https://cran.r-project.org/package=rGEDI">rGEDI</a> v0.1.7: Provides a set of tools for downloading, reading, visualizing and processing <a href="https://gedi.umd.edu/">GEDI</a> Level1B, Level2A and Level2B data. see the <a href="https://cran.r-project.org/web/packages/rGEDI/vignettes/tutorial.html">vignette</a> to get started.</p>
<p><img src="rGEDI.png" height = "600" width="400"></p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=catsim">catsim</a> v0.2.1: Computes structural similarity metrics for binary and categorical 2D and 3D images including Cohen’s kappa, Rand index, adjusted Rand index, Jaccard index, Dice index, normalized mutual information, or adjusted mutual information. See <a href="arXiv:2004.09073">Thompson & Maitra (2020)</a> for background and the <a href="https://cran.r-project.org/web/packages/catsim/vignettes/two-dimensional-example.html">vignette</a> for an introduction.</p>
<p><img src="catsim.png" height = "400" width="400"></p>
<p><a href="https://cran.r-project.org/package=klic">klic</a> v1.0.2: Implements a kernel learning integrative clustering algorithm which allows combining multiple kernels, each representing a different measure of the similarity between a set of observations. There is an <a href="https://cran.r-project.org/web/packages/klic/vignettes/klic-vignette.html">Introduction</a>.</p>
<p><img src="klic.png" height = "600" width="400"></p>
<p><a href="https://cran.r-project.org/package=MIDASwrappeR">MIDASwrappeR</a> V0.5.1: Provides a wrapper for the C++ implementation of the <code>MIDAS</code> algorithm described in <a href="https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf">Bhatia et al. (2020)</a> for graph like data. See the <a href="https://cran.r-project.org/web/packages/MIDASwrappeR/vignettes/Introduction.html">Introduction</a>.</p>
<p><img src="MIDAS.png" height = "600" width="400"></p>
<p><a href="https://cran.r-project.org/package=VUROCS">VUROCS</a> v1.0: Calculates the volume under the ROC surface and its (co)variance for ordered multi-class ROC analysis as well as certain bivariate ordinal measures of association.</p>
<p><a href="https://cran.r-project.org/package=WeightSVM">WeightSVM</a> v1.7-4: Provides functions for subject/instance weighted support vector machines (SVM). It uses a modified version of <code>libsvm</code> and is compatible with <code>e1071</code> package. Look <a href="https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances">here</a> for some background.</p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=covid19.analytics">covid19.analytics</a> v1.1: Provides functions to load and analyze COVID-19 data from the Johns Hopkins University <a href="https://github.com/CSSEGISandData/COVID-19">CSSE data repository</a>. It includes functions to visualize cases for specific geographical locations, generate interactive visualizations and produce a SIR model. See the <a href="https://cran.r-project.org/web/packages/covid19.analytics/vignettes/covid19.analytics.html">vignette</a> for an introduction.</p>
<p><a href="https://cran.r-project.org/package=covid19france">covid19france</a> Provides functions to import, clean and update French COVID-19 data from <a href="https://github.com/opencovid19-fr/data">opencovid19-fr</a>.</p>
<p><a href="https://cran.r-project.org/package=interactionR">interactionR</a> v0.1.1: Produces a publication-ready table that includes all effect estimates necessary for full reporting effect modification and interaction analysis as recommended by <a href="https://academic.oup.com/ije/article/41/2/514/692957">Knol & Vanderweele (2012)</a>, estimates confidence interval additive interaction measures using the delta method <a href="https://journals.lww.com/epidem/Abstract/1992/09000/Confidence_Interval_Estimation_of_Interaction.12.aspx">Hosmer & Lemeshow (1992)</a>, the variance recovery method <a href="https://academic.oup.com/aje/article/168/2/212/100828">Zou (2008)</a>, or percentile bootstrapping <a href="https://www.jstor.org/stable/3702864?seq=1">Assmann et al. (1996)</a>.</p>
<p><a href="https://cran.r-project.org/package=RCT">RCT</a> v1.0.2: Provides tools to facilitate the process of designing and evaluating randomized control trials, including methods to handle misfits, power calculations, balance regressions, and more. For background see <a href="arXiv:1607.00698">Athey et al. (2017)</a>. The <a href="https://cran.r-project.org/web/packages/RCT/vignettes/my-vignette.html">vignette</a> describes how to use the package.</p>
<h3 id="science">Science</h3>
<p><a href="https://cran.r-project.org/package=rasterdiv">rasterdiv</a>: Provides functions to calculate indices of diversity on numerical matrices based on information theory. The rationale behind the package is described in <a href="https://www.sciencedirect.com/science/article/abs/pii/S1470160X16304319?via%3Dihub">Rocchini et al. (2017)</a>. See the <a href="https://cran.r-project.org/web/packages/rasterdiv/vignettes/vignettes_rasterdiv.html">vignette</a> for an extended example.</p>
<p><a href="https://cran.r-project.org/package=SSHAARP">SSHAARP</a> v1.0.0: Processes amino acid alignments from the <a href="https://www.ebi.ac.uk/ipd/imgt/hla/">IPD-IMGT/HLA</a> database to identify user-defined amino acid residue motifs shared across HLA alleles, calculate the frequencies of those motifs, and generate global frequency heat maps that illustrate the distribution of each user-defined map around the globe. See the <a href="https://cran.r-project.org/web/packages/SSHAARP/vignettes/vignette.html">vignette</a> for an introduction.</p>
<p><img src="SSHAARP.jpeg" height = "600" width="400"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=BayesSampling">BayesSampling</a> v1.0.0: Provides functions for applying the Bayes Linear approach to finite populations with the simple random sampling, stratified simple random sampling designs, and to the ratio estimator. See <a href="https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X201400111886">Gonçalves et al. (2014)</a> for background and the vignettes: <a href="https://cran.r-project.org/web/packages/BayesSampling/vignettes/BLE_Ratio.html">BLE_Ratio</a>, <a href="https://cran.r-project.org/web/packages/BayesSampling/vignettes/BLE_Reg.html">BLE_Reg</a>, <a href="https://cran.r-project.org/web/packages/BayesSampling/vignettes/BLE_SRS.html">BLE_SRS</a>, <a href="https://cran.r-project.org/web/packages/BayesSampling/vignettes/BLE_SSRS.html">BLE_SSRS</a>, and <a href="https://cran.r-project.org/web/packages/BayesSampling/vignettes/BayesSampling.html">BayesSampling</a>.</p>
<p><a href="https://cran.r-project.org/package=cort">cort</a> v0.3.1: Provides S4 classes and methods to fit several copula models including empirical checkerboard copula <a href="https://www.tandfonline.com/doi/abs/10.1080/03610926.2019.1586936?journalCode=lsta20">Cuberos et. al (2019)</a> and the Copula Recursive Tree algorithm proposed by <a href="arXiv:2005.02912">Laverny et. al (2020)</a>. There are vignettes on the <a href="https://cran.r-project.org/web/packages/cort/vignettes/vignette01_ecb.html">Empirical Checkerboard Copula</a>, the <a href="https://cran.r-project.org/web/packages/cort/vignettes/vignette02_cort_clayton.html">Copula Recursive Tree</a>, the <a href="https://cran.r-project.org/web/packages/cort/vignettes/vignette03_ecbkm.html">Empirical Checkerboard Copula with known margins</a>, and the <a href="https://cran.r-project.org/web/packages/cort/vignettes/vignette04_bootstrap_varying_m.html">convex mixture of m-randomized checkerboards</a>.</p>
<p><a href="https://cran.r-project.org/package=ExpertChoice">ExpertChoice</a> v0.2.0: Implements tools for designing efficient discrete choice experiments. See <a href="https://www.sciencedirect.com/science/article/abs/pii/S0167811605000510?via%3Dihub">Street et. al (2005)</a> for some background. There is an <a href="https://cran.r-project.org/web/packages/ExpertChoice/vignettes/practical.html">Practical Introduction</a> and a vignette with some <a href="https://cran.r-project.org/web/packages/ExpertChoice/vignettes/include_theory.pdf">theory</a>.</p>
<p><a href="https://cran.r-project.org/package=genscore">genscore</a> v1.0.2: Implements the generalized score matching estimator from <a href="http://jmlr.org/papers/v20/18-278.html">Yu et al. (2019)</a> for non-negative graphical models with truncated distributions, and the estimator of <a href="https://projecteuclid.org/euclid.ejs/1459967424">Lin et al. (2016)</a> for untruncated Gaussian graphical models. See the <a href="https://cran.r-project.org/web/packages/genscore/vignettes/gen_vignette.html">vignette</a>.</p>
<p><img src="genscore.png" height = "600" width="400"></p>
<p><a href="https://cran.r-project.org/package=hmma">hmma</a> v1.0.0: Provides functions to fit Bayesian asymmetric hidden Markov models. HMM-As are similar to regular HMMs, See <a href="https://www.sciencedirect.com/science/article/abs/pii/S0888613X17303419?via%3Dihub">Bueno et al. (2017)</a> for background and the <a href="https://cran.r-project.org/web/packages/hmma/vignettes/intro.html">vignette</a> for and introduction.</p>
<p><img src="hmma.png" height = "600" width="400"></p>
<p><a href="https://cran.r-project.org/package=lmeInfo">lmeInfo</a> v0.1.1: Provides analytic derivatives and information matrices for fitted linear mixed effects models and generalized least squares models estimated using <code>lme()</code> and <code>gls()</code> as well as functions for estimating the sampling variance-covariance of variance component parameters and standardized mean difference effect sizes. See <a href="https://journals.sagepub.com/home/jeb">Pustejovsky et al. (2014)</a> and the <a href="https://cran.r-project.org/web/packages/lmeInfo/index.html">vignette</a>.</p>
<p><a href="metapower">metapower</a> v0.1.0: Implements a tool for computing meta-analytic statistical power for main effects, tests of homogeneity, and categorical moderator models. Have a look at <a href="https://link.springer.com/book/10.1007%2F978-1-4614-2278-5">Pigott (2012)</a>, <a href="https://psycnet.apa.org/doiLanding?doi=10.1037%2F1082-989X.9.4.426">Hedges & Pigott (2004)</a>, or <a href="https://onlinelibrary.wiley.com/doi/book/10.1002/9780470743386">Borenstein et al. (2009)</a> for background and the <a href="https://cran.r-project.org/web/packages/metapower/vignettes/Using-metapower.html">vignett</a> to get started.</p>
<p><img src="metapower.png" height = "400" width=400"></p>
<p><a href="https://cran.r-project.org/package=sasLM">sasLM</a> v0.1.3: Implements the <code>SAS</code> procedures for linear models: GLM, REG, ANOVA. The <code>sasLM</code> functions produce the same results as the corresponding SAS procedures for nested and complex models.</p>
<p><a href="https://cran.r-project.org/package=sdglinkage">sdglinkage</a> 0.1.0: Provides a tool for synthetic data generation that can be used for linkage method development. There is an <a href="https://cran.r-project.org/web/packages/sdglinkage/vignettes/sdglinkage_README.html">Overview</a> and vignettes on <a href="https://cran.r-project.org/web/packages/sdglinkage/vignettes/From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers.html">Real and Synthetic Identifiers</a>, <a href="https://cran.r-project.org/web/packages/sdglinkage/vignettes/Generation_of_Gold_Standard_File_and_Linkage_Files.html">Gold Standard File and Linkage Files</a>, <a href="https://cran.r-project.org/web/packages/sdglinkage/vignettes/Synthetic_Data_Generation_and_Evaluation.html">Synthetic Data Generation and Evaluation</a>.</p>
<p><img src="sdglinkage.png" height = "600" width=600"></p>
<p><a href="https://cran.r-project.org/package=starm">starm</a> v0.1.0: Estimates the coefficients of the two-time centered autologistic regression model described in <a href="arXiv:1811.06782">Gegout-Petit et al. (2019)</a>. The <a href="https://cran.r-project.org/web/packages/starm/vignettes/estima.pdf">vignette</a> describes the theory.</p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://cran.r-project.org/package=ConsReg">ConsReg</a> v0.1.0: Provides functions to fit regression and generalized linear models with autoregressive moving-average (ARMA) errors for time series data. There is a <a href="https://cran.r-project.org/web/packages/ConsReg/vignettes/GetStarted.html">vignette</a>.</p>
<p><img src="ConsReg.png" height = "600" width=600"></p>
<p><a href="https://cran.r-project.org/package=simITS">simITS</a> v0.1.1: Implements the method of <a href="arXiv:2002.05746">Miratrix (2020)</a> to create prediction intervals for post-policy outcomes in interrupted time series. It provides methods to fit ITS models with lagged outcomes and variables to account for temporal dependencies and then to simulate a set of plausible counterfactual post-policy series to compare to the observed post-policy series. See the <a href="https://cran.r-project.org/web/packages/simITS/vignettes/simple_ITS_example.html">vignette</a>.</p>
<p><img src="simITS.png" height = "400" width=400"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=dreamerr">dreamerr</a> v1.1.0: Implements tools to facilitate package development by providing a flexible way to check the arguments passed to functions. See the <a href="https://cran.r-project.org/web/packages/dreamerr/vignettes/dreamerr_introduction.htm">vignette</a> for details.</p>
<p><img src="dreamerr.png" height = "400" width=400"></p>
<p><a href="https://cran.r-project.org/package=flair">flair</a> v0.0.2: Facilitates formatting and highlighting of <code>R</code> source code in a R Markdown based presentation. The <a href="https://cran.r-project.org/web/packages/flair/vignettes/how_to_flair.html">vignette</a> shows how.</p>
<p><a href="https://cran.r-project.org/package=J4R">J4R</a> v1.0.7: Makes it possible to create <code>Java</code> objects and to execute <code>Java</code> methods from the <code>R</code> environment. The JVM is handled by a gateway server which relies on the <code>Java</code> library <code>j4r.jar</code>.</p>
<p><a href="https://cran.r-project.org/package=waldo">waldo</a> v0.1.0: Provides functions to compare complex R objects and reveal the key differences. It was designed primarily for use in testing packages.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=anglr">anglr</a> v0.6.0: Extends <code>rgl</code> conversion and visualization functions to <code>mesh3d</code> to give direct access to generic 3D tools and provide a full suite of mesh-creation and 3D plotting functions. See the <a href="https://cran.r-project.org/web/packages/anglr/vignettes/anglr.html">vignette</a></p>
<p><img src="anglr.png" height = "400" width=400"></p>
<p><a href="https://cran.r-project.org/package=brickr">brickr</a> v0.3.4: Uses <code>tidyverse</code> functions to generate digital LEGO models and convert image files into 2D and 3D LEGO mosaics. There are vignettes for building <a href="https://cran.r-project.org/web/packages/brickr/vignettes/mosaics.html">mosaics</a> and for generating models from <a href="https://cran.r-project.org/web/packages/brickr/vignettes/models-from-other.html">mosaics</a>, <a href="https://cran.r-project.org/web/packages/brickr/vignettes/models-from-program.html">programs</a>, <a href="https://cran.r-project.org/web/packages/brickr/vignettes/models-from-tables.html">tables</a>, and <a href="https://cran.r-project.org/web/packages/brickr/vignettes/models-piece-type.html">by piece type</a>.</p>
<p><img src="brickr.png" height = "400" width=400"></p>
<p><a href="https://cran.r-project.org/package=survCurve">survCurve</a> v1.0: Provides functions to enhance plots created with the <a href="https://cran.r-project.org/package=survival"><code>survival</code></a> and <a href="https://cran.r-project.org/package=mstate"><code>mstate</code></a> packages. See the <a href="https://cran.r-project.org/web/packages/survCurve/vignettes/survCurve.html">vignette</a> for examples.</p>
<p><img src="survCurve.png" height = "400" width=400"></p>
<p><a href="https://cran.r-project.org/package=textplot">textplot</a> v0.1.2: Provides functions to visualize complex relations in texts by displaying text co-occurrence networks, text correlation networks, dependency relationships and text clustering. The <a href="https://cran.r-project.org/web/packages/textplot/vignettes/textplot-examples.pdf">vignette</a> provides examples.</p>
<p><img src="textplot.gif" height = "600" width=600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/05/28/april-2020-top-40-new-cran-packages/';</script>
Community and Collaboration: Writing Our Book in the Open
https://rviews.rstudio.com/2020/05/26/community-and-collaboration-writing-our-book-in-the-open/
Tue, 26 May 2020 00:00:00 +0000https://rviews.rstudio.com/2020/05/26/community-and-collaboration-writing-our-book-in-the-open/
<p><em>Ryan A. Estrellado is a public education leader and data scientist helping administrators use practical data analysis to improve the student experience.</em></p>
<p><img src="chicken.jpeg" alt="" /><br />
<p style="text-align: center;"> <a href="https://datascienceineducation.com/">Chicken Farm in the Open</a> </p></p>
<p>In 2017, Emily Bovee, Jesse Mostipak, Joshua Rosenberg, Isabella Velásquez, and I started work on our book, Data Science in Education Using R (DSIEUR). We had two goals for DSIEUR. First, we aimed to write a practical reference for data scientists in education that helps them learn and apply R skills in their jobs. Second, we wanted to share the process with the R community by writing the book in the open on GitHub. After working together for almost three years, my co-authors and I submitted the manuscript for DSIEUR to Routledge and are now gearing up to begin editing the print version. The print version will be out from Routledge in late 2020, but you can read the online version of DSIEUR now at <a href="https://datascienceineducation.com/">https://datascienceineducation.com/</a>.</p>
<p>With the writing done, we’re reflecting on lessons we’ve learned from writing DSIEUR. In the coming weeks, we’ll share these reflections on R Views as a series of blog posts. These posts are about the people and tools in the R community that inspired us to do a book like DSIEUR. Think of these as our personal notes, typed up to help us organize our thoughts about what made this project possible. We’ll share in four parts:</p>
<h3 id="part-1-teaching-r-using-everyday-examples-in-education">Part 1: Teaching R Using Everyday Examples in Education</h3>
<p>Learning R on the job presents many challenges, but one in particular sticks out. Once you start coding, it’s not obvious how to apply that code in everyday tasks at the office. We wrote DSIEUR to answer the question, “How would it feel to have a book that taught programming concepts, provided reproducible code, and used scenarios that data scientists in education recognize?” In this first post, we’ll explore how we put these elements together and what we learned in the process.</p>
<h3 id="part-2-how-the-r-community-inspired-us-to-write-about-data-science-in-education">Part 2: How the R Community Inspired Us to Write About Data Science in Education</h3>
<p>It wasn’t long before our team encountered our first writing challenge: do we describe our audience as “data scientists in education” or “education data scientists?”. The debate was a symbol for a larger dilemma–what common language do you use when projects like ours aren’t yet common? It helped that the community we were writing for inspired us to explore the topic. The things we love the most about the R community–welcoming folks from different backgrounds, a collective love of side projects, and a willingness to work in the open–made it safe for us to try new things and learn. We listened to stories from data scientists in education, spent a lot of time reading the Twitter #rstats hashtag, and invited community members to join the conversation. In this post, we’ll explore how community participation empowered our writing process.</p>
<h3 id="part-3-writing-in-the-open">Part 3: Writing In the Open</h3>
<p>This post is about coordinating people and tools to write an open book, a challenging proposition for five writers who had only just met on Twitter. For instance, how would five people in different time zones write instructional materials and code together? And if coordinating five authors wasn’t hard enough, how would they invite the rest of the community to join the mission? Fortunately, people and programming tools encouraged us to believe that this project was possible. R, RStudio, {bookdown}, and Git had already solved publishing and collaboration problems for many. Except for some initial coding gaffes you’d expect from a team finding their feet and the occasional <a href="https://happygitwithr.com/burn.html">burned down fork</a>, these tools freed us to focus on the larger task at hand: finding a common language for data science in education. We’ll close this post by discussing how books, authored through an open-source approach, can serve as an innovative platform for sharing knowledge with a wider audience.</p>
<h3 id="conclusion-one-writer-five-authors">Conclusion: One Writer, Five Authors.</h3>
<p>How do you get five points of view to sound like a single voice? You’ll need a flexible sense of clarity, which I think is what Jesse meant when she said in a recent team call, “I have strong opinions, loosely held.” And it helps to have some basic rules as guardrails to flank your team as you march towards your writing deadlines. In this last post, we’ll share the workflows and processes we leaned on to discover what we wanted this book to be. We’ll also share our go-to tactics to keep the work going for the long haul, like managing meeting agendas, creating flexible norms for participation, and playing to individual strengths.</p>
<p>We’ll be back with that first post in about two weeks. Until then, do share with us about the people and tools that inspire you to work on collaborative projects. You can reach us on Twitter: Emily <a href="https://twitter.com/ebovee09">@ebovee09</a>, Jesse <a href="https://twitter.com/kierisi">@kierisi</a>, Joshua <a href="https://twitter.com/jrosenberg6432">@jrosenberg6432</a>, Isabella <a href="https://twitter.com/ivelasq3">@ivelasq3</a>, and me <a href="https://twitter.com/RyanEs">@RyanEs</a>.</p>
<p>See you in two weeks!</p>
<p>Ryan, with help from Emily, Jesse, Joshua, and Isabella</p>
<ul>
<li><p><em>Emily A. Bovee, Ph.D., is an educational data scientist working in dental education.</em></p></li>
<li><p><em>Jesse Mostipak, M.Ed., is a community advocate, Kaggle educator and data scientist.</em></p></li>
<li><p><em>Joshua M. Rosenberg, Ph.D., is Assistant Professor of STEM Education and the University of Tennessee, Knoxville.</em></p></li>
<li><p><em>Isabella C. Velásquez, MS, is a data analyst committed to nonprofit work with the aim of reducing racial and socioeconomic inequities.</em></p></li>
</ul>
<script>window.location.href='https://rviews.rstudio.com/2020/05/26/community-and-collaboration-writing-our-book-in-the-open/';</script>
Modern Rule-Based Models
https://rviews.rstudio.com/2020/05/21/modern-rule-based-models/
Thu, 21 May 2020 00:00:00 +0000https://rviews.rstudio.com/2020/05/21/modern-rule-based-models/
<script src="/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<div id="modern-rule-based-models" class="section level2">
<h2>Modern Rule-Based Models</h2>
<p>Machine learning models come in many shapes and sizes. While deep learning models currently have the lion’s share of coverage, there are many other classes of models that are effective across many different problem domains. This post gives a short summary of several <em>rule-based models</em> that are closely related to tree-based models (but are less widely known).</p>
<p>While this post is focused on explaining on how these models work, it coincides with the release of the <code>rules</code> package, a tidymodels package that provides a user interface to these models. A <a href="https://www.tidyverse.org/blog/2020/05/rules-0-0-1/">companion post</a> at the tidyverse blog describes the usage of the package.</p>
<p>To start, let’s discuss the concept of rules more generally.</p>
<div id="what-is-a-rule" class="section level3">
<h3>What is a rule?</h3>
<p>Rules in machine learning have been around for a long time (Quinlan, 1979). The focus of this article is using rules for traditional supervised learning (as opposed to <a href="https://en.wikipedia.org/wiki/Association_rule_learning"><em>association rule</em></a> mining). In the context of feature engineering, a rule is a <strong>conditional logical statement</strong>. It can be attached to some sort of predicted value too, such as</p>
<pre class="r"><code>if (chance_of_rain > 0.75) {
umbrella <- "yes"
} else {
umbrella <- "no"
}</code></pre>
<p>There are various ways to create rules from data. The most popular method is to create a tree-based model and then “flatten” the model structure into a set of rules. This is called a “separate and conquer” approach.</p>
<p>To demonstrate this, let’s use a data set with housing prices from Sacramento CA. Two predictors with a large number of levels are removed to make the rule output more readable:</p>
<pre class="r"><code>for (pkg in c('dplyr', 'modeldata', 'rpart')) {
if (!requireNamespace(pkg)) {
install.packages(pkg)
}
}
library(dplyr)
data(Sacramento, package = "modeldata")
Sacramento <-
Sacramento %>%
mutate(price = log10(price)) %>%
select(-zip, -city)
str(Sacramento)</code></pre>
<pre><code>## tibble [932 × 7] (S3: tbl_df/tbl/data.frame)
## $ beds : int [1:932] 2 3 2 2 2 3 3 3 2 3 ...
## $ baths : num [1:932] 1 1 1 1 1 1 2 1 2 2 ...
## $ sqft : int [1:932] 836 1167 796 852 797 1122 1104 1177 941 1146 ...
## $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
## $ price : num [1:932] 4.77 4.83 4.84 4.84 4.91 ...
## $ latitude : num [1:932] 38.6 38.5 38.6 38.6 38.5 ...
## $ longitude: num [1:932] -121 -121 -121 -121 -121 ...</code></pre>
<p>Consider a basic CART model created using <code>rpart</code>:</p>
<pre class="r"><code>library(rpart)
rpart(price ~ ., data = Sacramento) </code></pre>
<pre><code>## n= 932
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 932 48.5000 5.334
## 2) sqft< 1594 535 18.7200 5.211
## 4) sqft< 1170 235 8.0440 5.107
## 8) sqft< 934.5 71 2.2760 5.002 *
## 9) sqft>=934.5 164 4.6500 5.152 *
## 5) sqft>=1170 300 6.1390 5.292
## 10) longitude< -121.3 243 4.8850 5.270 *
## 11) longitude>=-121.3 57 0.6081 5.388 *
## 3) sqft>=1594 397 10.6000 5.501
## 6) sqft< 2317 245 4.7420 5.432
## 12) longitude< -121.3 205 3.3190 5.407 *
## 13) longitude>=-121.3 40 0.6179 5.562 *
## 7) sqft>=2317 152 2.8450 5.611
## 14) longitude< -121.2 110 1.6790 5.570 *
## 15) longitude>=-121.2 42 0.4864 5.720 *</code></pre>
<p>The splits in this particular tree involve the same two predictors. These paths to the terminal nodes are comprised of a set of <code>if-then</code> rules. Consider the path to the eighth terminal node:</p>
<pre class="r"><code>if (sqft < 1594 & sqft < 1169.5 & sqft < 934.5) then pred = 5.001820</code></pre>
<p>It is easy to <em>prune</em> this rule down to just <code>sqft < 934.5</code>. In all, the rules generated from this model would be:</p>
<pre><code>## price
## 5.0 when sqft < 935
## 5.2 when sqft is 935 to 1170
## 5.3 when sqft is 1170 to 1594 & longitude < -121
## 5.4 when sqft is 1170 to 1594 & longitude >= -121
## 5.4 when sqft is 1594 to 2317 & longitude < -121
## 5.6 when sqft is 1594 to 2317 & longitude >= -121
## 5.6 when sqft >= 2317 & longitude < -121
## 5.7 when sqft >= 2317 & longitude >= -121</code></pre>
<p>What are some more modern models that can use or generate rules? We’ll walk through three (which are all included in the <code>rules</code> package).</p>
</div>
</div>
<div id="c5.0-rules" class="section level2">
<h2>C5.0 rules</h2>
<p>The C4.5 algorithm (Quinlan, 1993b) was an early tree-based model that was released not long after the more well known CART model. One cool aspect of this model is that it could generate a classification tree <em>or</em> a set of rules. These rules are derived from the original tree much in the same way that was shown above for <code>rpart</code>.</p>
<p>Over the years, the author (Ross Quinlan) kept evolving the next generation of the model called <a href="https://www.rulequest.com/see5-unix.html">C5.0</a>. About 10 years ago, he open-sourced that model and the C50 R package was born. Like its predecessor, C5.0 could be used for trees or for rules. There are a variety of advances to this model (detailed in Kuhn and Johnson (2013)), but the most significant was the inclusion of boosting. In effect, you could create an ensemble of classification rules. Rather than approaching the problem via the more modern stochastic gradient boosting paradigm, it is more similar to the classical AdaBoost methodology.</p>
<p>Since C5.0 is classification only, an example of a single rule set for the iris data is:</p>
<pre><code>Rule 1: (50, lift 2.9)
Petal.Length <= 1.9
-> class setosa [0.981]
Rule 2: (48/1, lift 2.9)
Petal.Length > 1.9
Petal.Length <= 4.9
Petal.Width <= 1.7
-> class versicolor [0.960]
Rule 3: (46/1, lift 2.9)
Petal.Width > 1.7
-> class virginica [0.958]
Rule 4: (46/2, lift 2.8)
Petal.Length > 4.9
-> class virginica [0.938]
Default class: setosa</code></pre>
<p>If used on the iris data, a boosting ensemble with 50 constituent models used a total of 276 rules and each iteration averaged 5.52 rules per rule set.</p>
<p>The primary method of controlling the complexity of each rule set is to adjust the minimum number of data points required to make additional splits within a node. The default for this parameter is two data points. If this value was increased to 20, the mean number of rules was reduced from 5.52 rules to 3.2 rules.</p>
<p>When tuning, the other main tuning parameter is the number of boosting iterations. The C50 R package enables <a href="https://tune.tidymodels.org/articles/extras/optimizations.html">sub-model predictions</a> across boosting iterations; tuning this parameter over many values is not very computation expensive.</p>
</div>
<div id="cubist" class="section level2">
<h2>Cubist</h2>
<p>The aforementioned Ross Quinlan also developed <em>model trees</em> (Quinlan, 1992). These are regression tree-based models that contain <em>linear regression models</em> in the terminal nodes. This model was called M5 and, much like C5.0, there was a rule-based analog. Most tree-based models, especially ensembles of trees, tend to produce models that underfit in the tails (much like regression to the mean). Model trees do not suffer from this issue since their terminal regression models could make predictions across the whole range of the outcome data.</p>
<p>After an initial set of papers in the 1990’s, Quinlan didn’t publish much on the methodology as he evolved it. The modern version of model rules was called <a href="https://www.rulequest.com/cubist-unix.html"><em>Cubist</em></a>. There were a number of small technical differences between Cubist and M5 rules (enumerated in Kuhn and Johnson (2013)) but the main improvements were:</p>
<ul>
<li><p>An ensemble method for predictions called <em>committees</em>.</p></li>
<li><p>A nearest-neighbor adjustment that occurs after the model predictions.</p></li>
</ul>
<p>We’ll summarize these approaches in-turn.</p>
<p>Committee ensembles are similar to boosting. In boosting, a set of models are created sequentially. For the current iteration of boosting, the model is created using case weights that are defined by the results of previous models. In committees, case weights are not changed. Instead the outcome values are modified for each iteration of committees. For example, if a sample is under-predicted previously, its outcome value is changed to be larger so that the model will be pulled upward in an effort to stop under-predicting.</p>
<p>In committees, the first model uses the original outcome value <span class="math inline">\(y\)</span>. On further iterations, a modified value <span class="math inline">\(y^*\)</span> is used instead. For iteration <span class="math inline">\(m\)</span>, the model uses this adjustment formula:</p>
<p><span class="math display">\[
y^*_{(m)} = y - (\widehat{y}_{(m-1)} - y)
\]</span></p>
<p>As a demonstration, the plot below shows how the pseudo-outcome changes over iterations. The observed price of the house in question is 5.3. On the first iteration, the model under-predicts (show as the solid black dot). The vertical line represents the residual from this model. On the next iteration, the open blue circle shows the value of <span class="math inline">\(y^*_{(2)}\)</span>. On the second iteration, the prediction for this data point becomes <em>worse</em> but, as iterations proceed, the residuals generally become smaller.</p>
<p><img src="/post/2020-05-15-rule-based-models/index_files/figure-html/committee-plot-1.png" width="672" /></p>
<p>Ensemble predictions are made by averaging over the committee model predictions. This concludes in the <em>model-based</em> prediction made by Cubist.</p>
<p>After the model prediction, there is an option to conduct a post-model nearest-neighbor adjustment (Quinlan, 1993a). When predicting a new data point, the <em>K</em>-nearest neighbors are found in the training set (along with their original predictions). If the training set predictions for the neighbors are denoted as <span class="math inline">\(t\)</span>, the adjustment is:</p>
<p><span class="math display">\[
\widehat{y}_{adj} = \frac{1}{K}\sum_{\ell=1}^K w_\ell \left[t_\ell + \left(\widehat{y} - \widehat{t}_\ell \right)\right]
\]</span></p>
<p>where the weights <span class="math inline">\(w_\ell\)</span> are based on the inverse distances (so that far points contribute less to the adjustment). The adjustment is large when the difference between the original and new predictions is large. As the difference between the predictions of the new sample and its closest neighbor increases, the adjustment becomes larger.</p>
<p>Suppose our model used only square footage and longitude as predictors. The training data are shown below with a new prediction point represented as a large red square. The 6-closest neighbors in the training set are shown as red circles. The size of the circle represents the magnitude of the weight. This shows that the nearby points influence the adjustment more than distant points.</p>
<p><img src="/post/2020-05-15-rule-based-models/index_files/figure-html/nn-plot-1.png" width="672" /></p>
<p>This adjustment generally improves performance of the model. Interestingly, there is often a pattern when tuning where the 1-nearest neighbor model does much worse than using no adjustment but two or more neighbors do a much better job. One idea is that the use of a single neighbor is likely overfitting to the training set (as would occur with a more traditional K-NN model).</p>
<p>Using both of these techniques, Cubist tends to produce <em>very</em> competitive in terms of performance. The two primary parameters are the number of committee members and the number of nearest neighbors to use in the adjustment. The Cubist package can use the same fitted model to make predictions across the number of neighbors, so there is little computational cost when tuning this parameter.</p>
<p>For the Sacramento data, a single model (e.g. one committee) consisted of 6 rules, each with its own linear regression model. For example, the first four rules are:</p>
<pre><code>## Rule 1: [198 cases, mean 5.187264, range 4.681241 to 5.759668, est err 0.131632]
##
## if
## sqft <= 1712
## latitude > 38.46639
## latitude <= 38.61377
## longitude > -121.5035
## then
## outcome = -48.838393 + 0.000376 sqft + 1.39 latitude
##
## Rule 2: [254 cases, mean 5.220439, range 4.477121 to 5.923762, est err 0.105572]
##
## if
## sqft <= 1712
## latitude > 38.61377
## longitude > -121.5035
## longitude <= -121.0504
## then
## outcome = 93.155414 + 0.000431 sqft + 0.78 longitude + 0.16 latitude
##
## Rule 3: [90 cases, mean 5.273133, range 4.851258 to 5.580444, est err 0.078920]
##
## if
## sqft <= 1712
## latitude <= 38.46639
## then
## outcome = 15.750124 + 0.000344 sqft + 0.09 longitude - 0.005 beds
## + 0.005 baths
##
## Rule 4: [35 cases, mean 5.340909, range 5.018076 to 5.616476, est err 0.086056]
##
## if
## sqft <= 1712
## longitude <= -121.5035
## then
## outcome = 4.865655 + 0.000357 sqft</code></pre>
<p>New samples being predicted will fall into one or more rule conditions and the final prediction is the average of all of the corresponding linear model predictions.</p>
<p>If the model is run for 50 committees, a total of 271 rules were used across the committees with the average of 5.42 rules per committee member.</p>
</div>
<div id="rulefit" class="section level2">
<h2>RuleFit</h2>
<p>RuleFit models (Friedman and Popescu, 2008) are fairly simple in concept: use a tree ensemble to create a large set of rules, use the rules as binary predictors, then fit a regularized model that only includes the most important rule features.</p>
<p>For example, if a boosted tree were used to generate rules, each path through each tree would generate a conditional statement that can be used to define a model predictor (as was shown above for <code>rpart</code>). If an <code>xgboost</code> model with 100 boosting iterations with a limit of three splits were used on the Sacramento data, an initial set of 609 rules were generated. Some examples:</p>
<pre><code>sqft >= 1594</code></pre>
<pre><code>longitude < -121.2 & latitude >= 38.73 & latitude >= 38.86</code></pre>
<pre><code>longitude >= -121.2 & longitude < -121 & baths < 2.75</code></pre>
<pre><code>longitude < -121.3 & sqft >= 1246 & type == "Multi_Family"</code></pre>
<p>Clearly, the rules show some redundancy; there tends to be a significant amount of similarity in the rules.</p>
<p>These predictors are added to a regularized regression model (e.g. linear or logistic) that will conduct feature selection to remove unhelpful or redundant rules. Based on how much the model is penalized, the user can choose the number of rules that are contained in the final model. For example, depending on the penalty, the final rule set for the Sacramento data can be as large as hundreds or as small as a handful.</p>
<p><img src="/post/2020-05-15-rule-based-models/index_files/figure-html/num-rules-1.png" width="672" /></p>
<p>We can generate glmnet variable importance scores and then parameterize the importance in terms of the original predictors (instead of the rules). For example: for a penalty value of <span class="math inline">\(\lambda\)</span> = 0.005:</p>
<p><img src="/post/2020-05-15-rule-based-models/index_files/figure-html/xrf-preds-small-1.png" width="672" /></p>
<p>RuleFit has many tuning parameters; it inherits them from the boosting model as well as one from <code>glmnet</code> (the amount of lasso regularization). Fortunately, multiple predictions can be made across the lasso penalty using the same model object.</p>
<p>All-in-all, RuleFit is a neat and powerful method for using rules as features. It is interesting to contrast this model and Cubist:</p>
<ul>
<li><p>Cubist creates rules as data subsets then estimates a linear regression models within each.</p></li>
<li><p>RuleFit creates rules as predictors then fits one (generalized) linear model.</p></li>
</ul>
</div>
<div id="acknowledgments" class="section level2">
<h2>Acknowledgments</h2>
<p><a href="https://www.rulequest.com/Personal/">Ross Quinlan</a> has been supportive of our efforts to publish the inner workings of C5.0 and Cubist. I’d like to thank him for his help and all of the excellent work he has done over the years.</p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p>Friedman JH; Popescu BE (2008) “<a href="https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Predictive+learning+via+rule+ensembles%22&btnG=">Predictive learning via rule ensembles</a>.” _Annals of Applied Statistic_s, pp. 916-954.</p>
<p>Kuhn M; Johnson K (2013) <em><a href="https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Applied+Predictive+Modeling%22+author%3Akuhn&btnG=">Applied Predictive Modeling</a></em>, Springer. New York.</p>
<p>Quinlan R (1979). “Discovering rules by induction from large collections of examples.” <em>Expert Systems in the Micro Electronics Age</em>.</p>
<p>Quinlan R (1992). “<a href="https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Learning+with+continuous+classes%22&btnG=">Learning with continuous classes</a>.” <em>Proceedings of the 5th Australian Joint Conference On Artificial Intelligence</em>, pp. 343-348.</p>
<p>Quinlan R (1993a). “<a href="https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Combining+instance-based+and+model-based+learning%22&btnG=">Combining instance-based and model-based learning</a>.” <em>Proceedings of the Tenth International Conference on Machine Learning</em>, pp. 236-243.</p>
<p>Quinlan R (1993b). <em><a href="https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22C4.5%3A+Programs+for+Machine+Learning%22&btnG=">C4.5: Programs for Machine Learning</a></em>. Morgan Kaufmann Publishers.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/05/21/modern-rule-based-models/';</script>