R Views
https://rviews.rstudio.com/
Recent content on R ViewsHugo -- gohugo.ioen-usThu, 26 Mar 2020 00:00:00 +0000February 2020: "Top 40" New R Packages
https://rviews.rstudio.com/2020/03/26/february-2020-top-40-new-r-packages/
Thu, 26 Mar 2020 00:00:00 +0000https://rviews.rstudio.com/2020/03/26/february-2020-top-40-new-r-packages/
<p>One hundred sixty-four new packages made it to CRAN in February. Here are my “Top 40” picks in eleven categories: Computational Methods, Data, Genomics, Machine Learning, Mathematics, Medicine, Science, Statistics, Time Series, Utilities, and Visualizations.</p>
<h3 id="computational-methods">Computational Methods</h3>
<p><a href="https://cran.r-project.org/package=delayed">delayed</a> v0.3.0: Implements mechanisms to parallelize dependent tasks in a manner that optimizes the computational resources. Functions produce “delayed computations” which may be parallelized using futures. See the <a href="https://cran.r-project.org/web/packages/delayed/vignettes/delayed.html">vignette</a> for details.</p>
<p><img src="delayed.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=tergmLite">tergmLite</a> v2.1.7: Provides functions to efficiently simulate dynamic networks estimated with the framework for temporal exponential random graph models implemented in the <a href="https://cran.r-project.org/package=tergm"><code>tergm</code></a> package.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=crsmeta">crsmeta</a> v0.2.0: Provides functions to obtain coordinate system metadata from various data formats including: <a href="https://en.wikipedia.org/wiki/Spatial_reference_system">CRS</a> (Coordinate Reference System), <a href="http://www.epsg.org/">EPSG</a> (European Petroleum Survey Group), <a href="https://proj.org/">PROJ4</a> and <a href="http://docs.opengeospatial.org/is/12-063r5/12-063r5.html">WKT</a> (Well-Known Text 2).</p>
<p><a href="https://cran.r-project.org/package=danstat">danstat</a> v0.1.0: Implements an interface into the <a href="https://www.dst.dk/en/Statistik/statistikbanken/">Statistics Denmark Databank</a> API. The vignette provides an <a href="https://cran.r-project.org/web/packages/danstat/vignettes/Introduction_to_danstat.html">Introduction</a>.</p>
<p><a href="https://cran.r-project.org/package=osfr">osfr</a> v0.2.8: Implements an interface for interacting with <a href="https://osf.io">OSF</a> which enables users to access open research materials and data, or to create and manage private or public projects. There is a <a href="https://cran.r-project.org/web/packages/osfr/vignettes/getting_started.html">Getting Started Guide</a> and a vignette on <a href="https://cran.r-project.org/web/packages/osfr/vignettes/auth.html">Authentication</a>.</p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.r-project.org/package=selectSNPs">selectSNPs</a> v1.0.1: Provides a method using unified local functions to select low-density SNPs. See the <a href="https://cran.r-project.org/web/packages/selectSNPs/vignettes/Tuitorial_selectSNPs.html">Vignette</a> for a tutorial.</p>
<p><img src="selectSNPS.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=varitas">varitas</a> v0.0.1: Implements a multi-caller variant analysis pipeline for targeted analysis sequencing data. There is an <a href="https://cran.r-project.org/web/packages/varitas/vignettes/introduction.html">Introduction</a> and a vignette on <a href="https://cran.r-project.org/web/packages/varitas/vignettes/errors.html">Errors</a>.</p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=autokeras">autokeras</a> v1.0.1: Implements an interface to <a href="https://autokeras.com/">AutoKeras</a>, an open source software library for automated machine learning. See <a href="https://cran.r-project.org/web/packages/autokeras/readme/README.html">README</a> for an example.</p>
<p><a href="https://cran.r-project.org/package=MTPS">MTPS</a> v0.1.9: Implements functions to predict simultaneous multiple outcomes based on revised stacking algorithms as described in <a href="doi:10.1093/bioinformatics/btz531">Xing et al. (2019)</a>. See the <a href="https://cran.r-project.org/web/packages/MTPS/vignettes/Guide.html">vignette</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=quanteda.textmodels">quanteda.textmodels</a> v0.9.1: Implements methods for scaling models and classifiers based on sparse matrix objects representing textual data. It includes implementations of the <a href="doi:10.1017/S0003055403000698">Laver et al. (2003)</a> wordscores model, the <a href="arXiv:1710.08963">Perry & Benoit’s (2017)</a> class affinity scaling model, and the <a href="doi:10.1111/j.1540-5907.2008.00338.x">Slapin & Proksch (2008)</a> wordfish model. See the <a href="https://cran.r-project.org/web/packages/quanteda.textmodels/vignettes/textmodel_performance.html">vignette</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=SeqDetect">SeqDetect</a> v1.0.7: Implements the automaton model found in <a href="https://ieeexplore.ieee.org/document/8910574">Krleža, Vrdoljak & Brčić (2019)</a> to detect and process sequences. See the <a href="https://cran.r-project.org/web/packages/SeqDetect/vignettes/SequentialDetector.pdf">vignette</a> for examples and theory.</p>
<p><img src="SeqDetect.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=studyStrap">studyStrap</a> v1.0.0: Implements multi-Study Learning algorithms such as Merging, Study-Specific Ensembling (Trained-on-Observed-Studies Ensemble), the Study Strap, and the Covariate-Matched Study Strap. and offers over 20 similarity measures. See <a href="doi:10.1101/856385">Kishida, et al. (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/studyStrap/vignettes/vignette.html">vignette</a> for how to use the package.</p>
<h3 id="mathematics">Mathematics</h3>
<p><a href="https://cran.r-project.org/package=PlaneGeometry">PlaneGeometry</a> v1.1.0: Provides R6 classes representing triangles, circles, circular arcs, ellipses, elliptical arcs and lines, plot methods, transformations and more. The <a href="https://cran.r-project.org/web/packages/PlaneGeometry/vignettes/examples.html">vignette</a> offers multiple examples.</p>
<p><img src="PlaneGeometry.gif" height = "200" width="400"></p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=beats">beats</a> v0.1.1: Provides functions to import data from UFI devices and process electrocardiogram (ECG) data. It also includes a Shiny app for finding and exporting heart beats. See <a href="https://cran.r-project.org/web/packages/beats/readme/README.html">README</a> to get started.</p>
<p><img src="beats.gif" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=NMADiagT">NMADiagT</a> v0.1.2: Implements the hierarchical summary receiver operating characteristic model developed by <a href="doi:10.1093/biostatistics/kxx025">Ma et al. (2018)</a> and the hierarchical model developed by <a href="doi:10.1080/01621459.2018.1476239">Lian et al. (2019)</a> for performing meta-analysis. It is able to simultaneously compare one to five diagnostic tests within a missing data framework.</p>
<p><a href="https://cran.r-project.org/package=SAMBA">SAMBA</a> v0.9.0: Implements several methods, as proposed in <a href="doi:10.1101/2019.12.26.19015859">Beesley & Mukherjee (2020)</a> for obtaining bias-corrected point estimates along with valid standard errors using electronic health records data with misclassifird EHR-derived disease status. See the <a href="https://cran.r-project.org/web/packages/SAMBA/vignettes/UsingSAMBA.html">vignette</a> for details.</p>
<p><img src="SAMBA.png" height = "200" width="400"></p>
<h3 id="science">Science</h3>
<p><a href="https://cran.r-project.org/package=baRulho">baRUlho</a> v1.0.1: Provides functions to facilitate acoustic analysis of (animal) sound transmission experiments including functions for data preparation, analysis and visualization. See <a href="doi:10.1121/1.406682">Dabelsteen et al. (1993)</a> for background and the <a href="https://cran.r-project.org/web/packages/baRulho/vignettes/baRulho_quantifying_sound_degradation.html">vignette</a> for an introduction.</p>
<p><img src="baRulho.gif" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=CBSr">CBSr</a> v1.0.3: Uses monotonically constrained <a href="https://pomax.github.io/bezierinfo/">Cubic Bezier Splines</a> to approximate latent utility functions in intertemporal choice and risky choice data. See the <a href="doi:10.31234/osf.io/2ugwr">Lee et al. (2019)</a> for the details.</p>
<p><img src="CBSr.png" height = "400" width="600"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=blockCV">blockCV</a> v2.1.1: Provides functions for creating spatially or environmentally separated folds for cross-validation in spatially structured environments and methods for visualizing the effective range of spatial autocorrelation to separate training and testing datasets as described in <a href="doi:10.1111/2041-210X.13107">Valavi, R. et al. (2019)</a>. See the <a href="https://cran.r-project.org/web/packages/blockCV/vignettes/BlockCV_for_SDM.html">vignette</a> for examples.</p>
<p><img src="blockCV.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=BGGM">BGGM</a> v1.0.0: Implements the methods for fitting Bayesian Gaussian graphical models recently introduced in <a href="doi:10.31234/osf.io/x8dpr">Williams (2019)</a>, <a href="doi:10.31234/osf.io/ypxd8">Williams & Mulder (2019)</a> and <a href="doi:10.31234/osf.io/yt386">Williams et al. (2019)</a>. There are vignettes on <a href="https://cran.r-project.org/web/packages/BGGM/vignettes/credible_intervals.html">Credible Intervals</a>, <a href="https://cran.r-project.org/web/packages/BGGM/vignettes/network_plot_1.html">Plotting Network Structure</a>, <a href="https://cran.r-project.org/web/packages/BGGM/vignettes/ppc1.html">Comparing GGMs with the Posterior Predicive Distributions</a>, and <a href="https://cran.r-project.org/web/packages/BGGM/vignettes/predict.html">Predictability</a>.</p>
<p><img src="BGGM.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=metagam">metagam</a> v:0.1.0: Provides a method to perform the meta-analysis of generalized additive models and generalized additive mixed models, including functionality for removing individual participant data from models computed using the <code>mgcv</code> and <code>gamm4</code> packages. A typical use case is a situation where data cannot be shared across locations, and an overall meta-analytic fit is sought. For the details see <a href="arXiv:2002.02627">Sorensen et al. (2020)</a>, <a href="http://www.jstor.org/stable/3703820">Zanobetti (2000)</a>, and <a href="oi:10.6000/1929-6029.2018.07.02.1">Crippa et al. (2018)</a>. There is an <a href="https://cran.r-project.org/web/packages/metagam/vignettes/introduction.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/metagam/vignettes/dominance.html">Dominance</a>, <a href="https://cran.r-project.org/web/packages/metagam/vignettes/heterogeneity.html">Heterogenity Plots</a>, and <a href="https://cran.r-project.org/web/packages/metagam/vignettes/multivariate.html">Multivariate Smooth Terms</a>.</p>
<p><img src="metagam.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=MKpower">MKpower</a> v0.4: Provides functions for power analysis and sample size calculations for Welch and Hsu t-tests, Wilcoxon rank sum tests and diagnostic tests. See <a href="doi:10.1016/j.jclinepi.2004.12.009">Flahault et al. (2005)</a> and <a href="doi:10.1093/biostatistics/kxj036">Dobbin & Simon (2007)</a> for background, and the <a href="https://cran.r-project.org/web/packages/MKpower/vignettes/MKpower.html">vignette</a> for examples.</p>
<p><img src="MKpower.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=mvrsquared">mvrsquared</a> v0.0.3: Implements a method to compute the coefficient of determination for outcomes in n-dimensions. See <a href="arXiv:1911.11061">Jones (2019)</a> for the theory and the <a href="https://cran.r-project.org/web/packages/mvrsquared/vignettes/getting_started_with_mvrsquared.html">vignette</a> to get started.</p>
<p><a href="https://cran.r-project.org/package=pdynmc">pdynmc</a> v0.8.0: Provides functions to model linear dynamic panel data based on linear and nonlinear moment conditions as proposed by <a href="doi:10.2307/1913103">Holtz-Eakin et al.(1988)</a>, <a href="doi:10.1016/0304-4076(94)01641-C">Ahn & Schmidt (1995)</a>, and <a href="doi:10.1016/0304-4076(94)01642-D">Arellano & Bover (1995)</a>. See the <a href="https://cran.r-project.org/web/packages/pdynmc/vignettes/pdynmc.pdf">vignette</a> for the underlying theory and a sample session.</p>
<p><a href="https://cran.r-project.org/package=Superpower">Superpower</a> v0.0.3: Provides functions to simulate ANOVA designs of up to three factors, calculate the observed power and average observed effect size for all main effects and interactions. See <a href="doi:10.31234/osf.io/baxsf">Lakens & Caldwell (2019)</a> for background, and the <a href="https://cran.r-project.org/web/packages/Superpower/vignettes/intro_to_superpower.html">vignette</a> for an introduction.</p>
<p><a href="https://cran.r-project.org/package=tune">tune</a> v0.0.1: Provides functions and classes for use in conjunction with other <code>tidymodels</code> packages for finding reasonable values of hyper-parameters in models, pre-processing methods, and post-processing steps. Look <a href="https://tidymodels.github.io/tune/articles/getting_started.html">here</a> for and example.</p>
<p><img src="tune.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=xnet">xrnet</a> v0.1.7: Provides functions to fit hierarchical regularized regression models incorporating potentially informative external data as in <a href="doi:10.21105/joss.01761">Weaver & Lewinger (2019)</a>. See <a href="https://cran.r-project.org/web/packages/xrnet/readme/README.html">README</a> for examples.</p>
<p><img src="xrnet.png" height = "200" width="400"></p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://cran.r-project.org/package=seer">seer</a> v1.4.1: Implements a framework for selecting time series forecast models based on features calculated from the time series. For details see <a href="https://www.monash.edu/business/econometrics-and-business-statistics/research/publications/ebs/wp06-2018.pdf">Talagala et al. (20180)</a>.</p>
<p><img src="seer.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=testcorr">testcorr</a> v0.1.2: Provides functions for computing test statistics for the significance of autocorrelation in univariate time series, cross-correlation in bivariate time series, Pearson correlations in multivariate series and test statistics for i.i.d. property of univariate series as described in <a href="https://cowles.yale.edu/sites/default/files/files/pub/d21/d2194.pdf">Dalla et al. (2019)</a>. See the <a href="https://cran.r-project.org/web/packages/testcorr/vignettes/testcorr.pdf">vignette</a> for the math and examples.</p>
<p><img src="testcorr.png" height = "200" width="400"></p>
<h3 id="utility">Utility</h3>
<p><a href="https://cran.r-project.org/package=bioC.logs">bioC.logs</a> v1.1: Fetches download statistics BioConductor.org. See the <a href="https://cran.r-project.org/web/packages/bioC.logs/vignettes/bioC.logs.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=matricks">matricks</a> v0.8.2: Provides function to help with creation of complex matrices along with a plotting function. See the <a href="https://cran.r-project.org/web/packages/matricks/vignettes/policy_evaluation.html">vignette</a> for examples.</p>
<p><img src="matricks.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=rco">rco</a> v1.0.1: Provides functions to automatically apply different strategies to optimize R code. These functions take R code as input, and returns R code as output. There are vignettes on: <a href="https://cran.r-project.org/web/packages/rco/vignettes/contributing-an-optimizer.html">Contributing an optimizer</a>, <a href="https://cran.r-project.org/web/packages/rco/vignettes/docker-readme.html">Docker files</a>, <a href="https://cran.r-project.org/web/packages/rco/vignettes/opt-common-subexpr.html">Common Subexpression Elimination</a>, <a href="https://cran.r-project.org/web/packages/rco/vignettes/opt-constant-folding.html">Constant Folding</a>, <a href="https://cran.r-project.org/web/packages/rco/vignettes/opt-constant-propagation.html">Constant Propagation</a>, <a href="https://cran.r-project.org/web/packages/rco/vignettes/opt-dead-code.html">Dead Code Elimination</a>, <a href="https://cran.r-project.org/web/packages/rco/vignettes/opt-dead-expr.html">Dead Expression Elimination</a>, <a href="https://cran.r-project.org/web/packages/rco/vignettes/opt-dead-store.html">Dead Store Elimination</a>, and <a href="https://cran.r-project.org/web/packages/rco/vignettes/opt-loop-invariant.html">Loop-invariant Code Motion</a>.</p>
<p><a href="https://cran.r-project.org/package=slider">slider</a> v0.1.2: Provides type-stable rolling window functions over any R data type and supports both cumulative and expanding windows. See the <a href="https://cran.r-project.org/web/packages/slider/vignettes/rowwise.html">vignette</a> for examples.</p>
<p><a href="https://cran.r-project.org/package=taxadb">taxadb</a> v0.1.0: Provides fast, consistent access to taxonomic data, and supports common tasks such as resolving taxonomic names to identifiers and looking up higher classification ranks of given species. There is an <a href="https://cran.r-project.org/web/packages/taxadb/vignettes/backends.html">Introduction</a> and a <a href="https://cran.r-project.org/web/packages/taxadb/vignettes/data-sources.html">Schema</a>.</p>
<p><a href="https://cran.r-project.org/package=tidyfst">tidyfst</a> v0.8.8: Provides a toolkit of tidy data manipulation verbs with <code>data.table</code> as the backend, combining the merits of syntax elegance from <code>dplyr</code> and computing performance from <code>data.table</code>. There is a <a href="https://cran.r-project.org/web/packages/tidyfst/vignettes/chinese_tutorial.html">vignete</a> written in Chinese, an English Language <a href="https://cran.r-project.org/web/packages/tidyfst/vignettes/example1_intro.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/tidyfst/vignettes/example2_join.html">join</a>, <a href="https://cran.r-project.org/web/packages/tidyfst/vignettes/example3_reshape.html">reshape</a>, <a href="https://cran.r-project.org/web/packages/tidyfst/vignettes/example4_nest.html">nest</a>, <a href="https://cran.r-project.org/web/packages/tidyfst/vignettes/example5_fst.html">fst</a> and <a href="https://cran.r-project.org/web/packages/tidyfst/vignettes/example6_dt.html">dt</a>.</p>
<p><a href="https://cran.r-project.org/package=tidytable">tidytable</a> v0.3.2: Provides an <code>rlang</code> compatible interface to <code>data.table</code>. See <a href="https://cran.r-project.org/web/packages/tidytable/readme/README.html">README</a> for examples.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=iNZightTools">iNzightTools</a> v1.8.3: Provides wrapper functions for common variable and dataset manipulation workflows primarily used by <a href="https://www.stat.auckland.ac.nz/~wild/iNZight/">iNZight</a>, a graphical user interface providing easy exploration and visualization of data for students. Many functions return the <code>tidyverse</code> code used to obtain the result in an effort to bridge the gap between GUI and coding.</p>
<p><img src="iNZightTools.gif" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=baRulhoIPV">IPV</a> v0.1.1: Provides functions to generate item pool visualizations which are used to display the conceptual structure of a set of items. See <a href="doi:10.1177/2059799119884283">Dantlgraber et al. (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/IPV/vignettes/ipv-vignette.html">vignette</a> for examples.</p>
<p><img src="IPV.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=spacey">spacey</a> v0.1.1: Provides utilities to download <a href="https://www.usgs.gov/">USGS</a> and <a href="https://www.usgs.gov/">ESRI</a> geospatial data and produce high quality <a href="https://www.rayshader.com/">rayshader</a> maps for locations in the United States. There is an <a href="https://cran.r-project.org/web/packages/spacey/vignettes/introduction-to-spacey.html">Introduction</a></p>
<p><img src="spacey.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=Tendril">Tendril</a> v2.0.4: Provides functions to compute and display tendril plots. See the <a href="https://cran.r-project.org/web/packages/Tendril/vignettes/TendrilUsage.html">vignnette</a> for and introduction..</p>
<p><img src="Tendril.png" height = "200" width="400"></p>
<p><a href="https://cran.r-project.org/package=tidyHeatmap">tidyHeatmap</a> v0.99.9: Provides an implementation of the Bioconductor <a href="https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html">ComplexHeatmap</a> package based on tidy data frames. See the <a href="https://cran.r-project.org/web/packages/tidyHeatmap/vignettes/introduction.html">vignette</a>.</p>
<p><img src="tidyHeatMap.png" height = "200" width="400"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/03/26/february-2020-top-40-new-r-packages/';</script>
Comparing Machine Learning Algorithms for Predicting Clothing Classes: Part 4
https://rviews.rstudio.com/2020/03/24/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-4/
Tue, 24 Mar 2020 00:00:00 +0000https://rviews.rstudio.com/2020/03/24/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-4/
<p><em>Florianne Verkroost is a Ph.D. candidate at Nuffield College at the University of Oxford. She has a passion for data science and a background in mathematics and econometrics. She applies her interdisciplinary knowledge to computationally address societal problems of inequality.</em></p>
<p>This is the fourth and final post in a series devoted to comparing different machine learning methods for predicting clothing categories from images using the Fashion MNIST data by Zalando. In the <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-1/">first post</a>, we prepared the data for analysis and built a Python deep learning neural network model to predict the clothing categories of the Fashion MNIST data. In <a href="https://rviews.rstudio.com/2020/03/03/predicting-clothing-classes-part-2/">Part 2</a>, we used principal components analysis (PCA) to compress the clothing image data down from 784 to just 17 pixels. In <a href="https://rviews.rstudio.com/2020/03/10/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-3/">Part 3</a> we saw that gradient-boosted trees and random forests achieve relatively high accuracy on dimensionality-reduced data, although not as high as the neural network. In this post, we will fit a support vector machine, compare the findings from all models we have built and discuss the results. The R code for this post can be found on my <a href="https://github.com/fverkroost/RStudio-Blogs/blob/master/machine_learning_fashion_mnist_post234.R">Github</a> repository.</p>
<div id="support-vector-machine" class="section level1">
<h1>Support Vector Machine</h1>
<p>Support vector machines (SVMs) provide another method for classifying the clothing categories in the Fashion MNIST data. To better understand what SVMs entail, we’ll have to go through some more complex explanations –mainly summarizing <a href="http://faculty.marshall.usc.edu/gareth-james/ISL/">James et. al. (2013)</a>– so please bear with me! The figure below might help you in understanding the different classifiers I will discuss in the next sections (figures taken from <a href="https://slideplayer.com/slide/3266197/">here</a>, <a href="https://www.datasciencecentral.com/profiles/blogs/implementing-a-soft-margin-kernelized-support-vector-machine">here</a> and <a href="https://www.exlservice.com/optimizing-healthcare-analytics-by-choosing-the-right-predictive-model">here</a>).</p>
<p><img src="classifiers.png" /></p>
<p>For an <span class="math inline">\(n \times p\)</span> data matrix and binary outcome variable <span class="math inline">\(y_i \in \{-1, 1\}\)</span>, a hyperplane is a flat affine subspace of dimension <span class="math inline">\(p - 1\)</span> that divides the <span class="math inline">\(p\)</span>-dimensional space into two halves, defined by <span class="math inline">\(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\)</span>. An observation in the test data is assigned an outcome class depending on which side of the perfectly separating hyperplane it lies, assuming that such a hyperplane exists. Cutoff <span class="math inline">\(t\)</span> for an observation’s score <span class="math inline">\(\hat{f}(X) = \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_p\)</span> determines which class it will be assigned to. The further an observation is located from the hyperplane at zero, the more confident the classifier is about the class assignment. If existent, an infinite number of separating hyperplanes can be constructed. A good option in this case would be to use the maximal margin classifier (MMC), which maximizes the margin around the midline of the widest strip that can be inserted between the two outcome classes.</p>
<p>If a perfectly separating hyperplane does not exist, “almost separating” hyperplanes can be used by means of the support vector classifier (SVC). The SVC extends the MMC as it does not require classes to be separable by a linear boundary by including slack variables <span class="math inline">\(\epsilon_i\)</span> that allow some observations to be on the incorrect side of the margin or hyperplane. The extent to which incorrect placements are done is determined by tuning parameter cost <span class="math inline">\(C \geq \sum_{i=1}^{n} \epsilon_i\)</span>, which thereby controls the bias-variance trade-off. The SVC is preferable over the MMC as it is more confident in class assignments due to the larger margins and ensures greater robustness as merely observations on the margin or violating the margin affect the hyperplane (James et al., 2013).</p>
<p>Both MMCs and SVCs assume a linear boundary between the two classes of the outcome variable. Non-linearity can be addressed by enlarging the feature space using functions of predictors. Support vector machines combine SVCs with non-linear (e.g. radial, polynomial or sigmoid) Kernels <span class="math inline">\(K(x_i, x_{i'})\)</span> to achieve efficient computations. Kernels are generalizations of inner products that quantify the similarity of two observations (James et al., 2013). Usually, the radial Kernel is selected for non-linear models as it provides a good default Kernel in the absence of prior knowledge of invariances regarding translations. The radial Kernel is defined as <span class="math inline">\(K(x_i, x_{i'})= \exp{(-\sigma \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2)}\)</span>, where <span class="math inline">\(\sigma\)</span> is a positive constant that makes the fit more non-linear as it increases. Tuning <span class="math inline">\(C\)</span> and <span class="math inline">\(\sigma\)</span> is necessary to find the optimal trade-off between reducing the number of training errors and making the decision boundary more irregular (by increasing C). As SVMs only require the computation of <span class="math inline">\(\bigl(\begin{smallmatrix} n\\ 2 \end{smallmatrix}\bigr)\)</span> Kernels for all distinct observation pairs, they greatly improve efficiency.</p>
<p>As aforementioned, the parameters that need to be tuned are cost <code>C</code> and, in the case of a radial Kernel, non-linearity constant <code>sigma</code>. Let’s start by tuning these parameters using a random search algorithm, again making use of the <code>caret</code> framework. We set the controls to perform 5-fold cross-validation and we use the <code>multiClassSummary()</code> function from the <code>MLmetrics</code> library to perform multi-class classification. We specify a radial Kernel, use accuracy as the performance metric<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> and let the algorithm perform a random search for the cost parameter <code>C</code> over <code>pca.dims</code> (=17) random values. Note that the random search algorithm only searches for values of <code>C</code> while keeping a constant value for <code>sigma</code>. Also, contrary to previous calls to <code>trainControl()</code>, we now set <code>classProbs = FALSE</code> because the base package used for estimating SVMs in <code>caret</code>, <code>kernlab</code>, leads to lower accuracies when specifying <code>classProbs = TRUE</code> due to using a secondary regression model (also check <a href="https://github.com/topepo/caret/issues/386">this link for the Github issue</a>).</p>
<p>We begin with training the support vector machine using the PCA reduced training and test data sets <code>train.images.pca</code> and <code>test.images.pca</code> constructed in <a href="https://rviews.rstudio.com/2020/03/03/predicting-clothing-classes-part-2/">Part 2</a>.</p>
<pre class="r"><code>library(MLmetrics)
svm_control = trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
classProbs = FALSE,
allowParallel = TRUE,
summaryFunction = multiClassSummary,
savePredictions = TRUE)</code></pre>
<pre class="r"><code>set.seed(1234)
svm_rand_radial = train(label ~ .,
data = cbind(train.images.pca, label = train.classes),
method = "svmRadial",
trControl = svm_control,
tuneLength = pca.dims,
metric = "Accuracy")
svm_rand_radial$results[, c("sigma", "C", 'Accuracy')]</code></pre>
<p><img src="svm_rand_radial_print_accuracy.png" /></p>
<p>We can check the model performance on both the training and test sets by means of different metrics using a custom function, <code>model_performance</code>, which can be found in <a href="https://github.com/fverkroost/RStudio-Blogs/blob/master/machine_learning_fashion_mnist_post234.R">this code on my Github</a>.</p>
<pre class="r"><code>mp.svm.rand.radial = model_performance(svm_rand_radial, train.images.pca, test.images.pca,
train.classes, test.classes, "svm_random_radial")</code></pre>
<p><img src="svm_rand_radial_mp.png" height = "1000" width="1000"></p>
<p>The results show that the model is achieving relatively high accuracies of 88% and 87% on the training and test sets respectively, selecting <code>sigma = 0.040</code> and <code>C = 32</code> as the optimal parameters. Let’s have a look at which clothing categories are best and worst predicted by visualizing the confusion matrix. First, let’s compute the predictions for the training data. We need to use the out-of-bag predictions contained in the model object (<code>svm_rand_radial$pred</code>) rather than the manually computed in-sample (non-out-of-bag) predictions for the training data computed using the <code>predict()</code> function. Object <code>svm_rand_radial$pred</code> contains the predictions for all tuning parameter values specified by the user. However, we only need those predictions belonging to the optimal tuning parameter values. Therefore, we subset <code>svm_rand_radial$pred</code> to only contain those predictions and observations in indices <code>rows</code>. Note that we convert <code>svm_rand_radial$pred</code> to a <code>data.table</code> object to find these indices as computations on <code>data.table</code> objects are much faster for large data (e.g. <code>svm_rand_radial$pred</code> has 4.5 million rows).</p>
<pre class="r"><code>library(data.table)
pred_dt = as.data.table(svm_rand_radial$pred[, names(svm_rand_radial$bestTune)])
names(pred_dt) = names(svm_rand_radial$bestTune)
index_list = lapply(1:ncol(svm_rand_radial$bestTune), function(x, DT, tune_opt){
return(which(DT[, Reduce(`&`, lapply(.SD, `==`, tune_opt[, x])), .SDcols = names(tune_opt)[x]]))
}, pred_dt, svm_rand_radial$bestTune)
rows = Reduce(intersect, index_list)
pred_train = svm_rand_radial$pred$pred[rows]
trainY = svm_rand_radial$pred$obs[rows]
conf = table(pred_train, trainY)</code></pre>
<p>Next, we reshape the confusion matrix into a data frame with three columns: one for the true categories (<code>trainY</code>), one for the predicted categories (<code>pred_train</code>), and one for the proportion of correct predictions for the true category (<code>Freq</code>). We plot this as a tile plot with a blue color scale where lighter values indicate larger proportions of matches between a particular combination of true and predicted categories, and darker values indicate a small proportion of matches between them. Note that we use the custom plotting theme <code>my_theme()</code> as defined in the <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-2/">second blog post of this series</a>.</p>
<pre class="r"><code>conf = data.frame(conf / rowSums(conf))
ggplot() +
geom_tile(data = conf, aes(x = trainY, y = pred_train, fill = Freq)) +
labs(x = "Actual", y = "Predicted", fill = "Proportion") +
my_theme() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_continuous(breaks = seq(0, 1, 0.25)) +
coord_fixed()</code></pre>
<p><img src="svm_radial_conf_plot.png" /></p>
<p>We observe from this plot that most of the classes are predicted accurately as the light blue (high percentages of correct predictions) are on the diagonal of the tile plot. We can also observe that the categories that are most often mixed up include shirts, tops, pullovers and coats, which makes sense because these are all mostly upper body clothing parts having similar shapes. The model predicts trousers, bags, boots and sneakers well, given that these rows and columns are particularly dark except for the diagonal element. These results are in agreement with those from the random forest and gradient-boosted trees from <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-3/">the previous blog post of this series</a>.</p>
<p>Next, we repeat the above process for fitting a support vector machine but instead of a random search for the optimal parameters, we perform a grid search. As such, we can prespecify values to evaluate the model at, not only for <code>C</code> but also for <code>sigma</code>. We define the grid values in <code>grid_radial</code>.</p>
<pre class="r"><code>grid_radial = expand.grid(sigma = c(.01, 0.04, 0.1), C = c(0.01, 10, 32, 70, 150))
set.seed(1234)
svm_grid_radial = train(label ~ .,
data = cbind(train.images.pca, label = train.classes),
method = "svmRadial",
trControl = svm_control,
tuneGrid = grid_radial,
metric = "Accuracy")
svm_grid_radial$results[, c("sigma", "C", 'Accuracy')]</code></pre>
<p><img src="svm_grid_radial_print_accuracy.png" /></p>
<pre class="r"><code>mp.svm.grid.radial = model_performance(svm_grid_radial, train.images.pca, test.images.pca,
train.classes, test.classes, "svm_grid_radial")</code></pre>
<p><img src="svm_grid_radial_mp.png" /></p>
<p>The grid search selects the same optimal parameter values as the random search (<code>C=32</code> and <code>sigma = 0.040</code>), therefore also resulting in 88% and 87% training and test accuracies. To get an idea on how <code>C</code> and <code>sigma</code> influence the training set accuracy, we plot the cross-validation accuracy as a function of <code>C</code>, with separate lines for each value of <code>sigma</code>.</p>
<pre class="r"><code>ggplot() +
my_theme() +
geom_line(data = svm_grid_radial$results, aes(x = C, y = Accuracy, color = factor(sigma))) +
geom_point(data = svm_grid_radial$results, aes(x = C, y = Accuracy, color = factor(sigma))) +
labs(x = "Cost", y = "Cross-Validation Accuracy", color = "Sigma") +
ggtitle('Relationship between cross-validation accuracy and values of cost and sigma')</code></pre>
<p><img src="svm_grid_radial_cost_sigma.png" /></p>
<p>The plot shows that the green line (<code>sigma = 0.04</code>) has the highest cross-validation accuracy for all values of <code>C</code> except for smaller values of <code>C</code> such as 0.01 and 10. Although the accuracy at <code>C=10</code> and <code>sigma = 0.1</code> (blue line) comes close, the highest overall accuracy achieved is for <code>C=32</code> and <code>sigma=32</code> (green line).</p>
</div>
<div id="wrapping-up" class="section level1">
<h1>Wrapping Up</h1>
<p>To compare the models we have estimated throughout this series of blog posts, we can look at the resampled accuracies of the models. We can do this in our case because we set the same seed of 1234 before training each model.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> Essentially, resampling is an important tool to validate our models, and to what extent they are generalizeable onto data they have not been trained on. We used five repeats of five-fold cross-validation, which means that the training data was divided into five random subsets, and that throughout five iterations (“folds”) the model was trained on four of these subsets and tested on the remaining subset (changing with every fold), and that this whole process has been repeated five times. The goal of these repetitions of k-fold cross-validation is to reduce the bias in the estimator, given that the folds in non-repeated cross-validation are not independent (as data used for training at one fold is used for testing at another fold). As we performed five repeats of five-fold cross-validation, we can essentially obtain 5*5=25 accuracies per model. Let’s compare these resampled accuracies visually by means of a boxplot. First, we create a list of all models estimated, including the random forests, gradient-boosted trees and support vector machines. We then compute the resampled accuracies using the <code>resamples()</code> function from the <code>caret</code> package. From the resulting object, <code>resamp</code>, we only keep the columns containing the resample unit (e.g. <code>Fold1.Rep1</code>) and the five columns containing the accuracies for each of the five models. We melt this into a long format and from the result, <code>plotdf</code>, we remove the <code>~Accuracy</code> part from the strings in column <code>Model</code>.</p>
<pre class="r"><code>library(reshape2)
model_list = list(rf_rand, rf_grid, xgb_tune, svm_rand_radial, svm_grid_radial)
names(model_list) = c(paste0('Random forest ', c("(random ", "(grid "), "search)"), "Gradient-boosted trees",
paste0('Support vector machine ', c("(random ", "(grid "), "search)"))
resamp = resamples(model_list)
accuracy_variables = names(resamp$values)[grepl("Accuracy", names(resamp$values))]
plotdf = melt(resamp$values[, c('Resample', accuracy_variables)],
id = "Resample", value.name = "Accuracy", variable.name = "Model")
plotdf$Model = gsub("~.*","", plotdf$Model)</code></pre>
<p>Next, we create a boxplot with the estimated models on the x-axis and the accuracy on the y-axis.</p>
<pre class="r"><code>ggplot() +
geom_boxplot(data = plotdf, aes(x = Model, y = Accuracy, color = Model)) +
ggtitle('Resampled accuracy for machine learning models estimated') +
my_theme() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = NULL, color = NULL) +
guides(color = FALSE)</code></pre>
<p><img src="resampled_accuracy_models_color.png" height = "700" width="900"></p>
<p>We observe from these box plots that the support vector machines perform best, followed by the gradient-boosted trees and the random forests. Let’s also take a look at the other performance metrics from all models we have looked at.</p>
<pre class="r"><code>mp.df = rbind(mp.rf.rand, mp.rf.grid, mp.xgb, mp.svm.rand.radial, mp.svm.grid.radial, mp.svm.grid.linear)
mp.df[order(mp.df$accuracy_test, decreasing = TRUE), ]</code></pre>
<p><img src="combine_mp_models.png" /></p>
<p>After taking measures to reduce overfitting, the convolutional neural network from <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-1/">the first blog post of this series</a> achieved training and test set accuracies of 89.4% and 88.8% respectively. The random and grid search for the best value of <code>mtry</code> in the random forests resulted in the selection of <code>mtry=5</code>. The grid search performed better on the training set than the random search on the basis of all metrics except recall (i.e. sensitivity), and better on the test set on all metrics except precision (i.e. positive predictive value). The test set accuracies achieved by the random search and grid search were 84.7% and 84.8% respectively. The gradient-boosted decision trees performed slightly better than the random forests on all metrics and achieved a test set accuracy of 85.5%. Both tree-based models more often misclassified pullovers, shirts and coats, while correctly classifying trousers, boots, bags and sneakers. The random forests and gradient-boosted trees are however outperformed by the support vector machine with radial Kernel specification with tuning parameter values of <code>C=32</code> and <code>sigma=0.040</code>: this model achieved 86.9% test set accuracy upon a random search for the best parameters. The grid search resulted in slightly worse test set performance, but better training set performance in terms of all metrics except accuracy. Nonetheless, none of the models estimated beats the convolutional neural network from <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-1/">the first blog post of this series</a>, neither in performance nor computational time and feasibility. However, the differences in test set performance are only small: the convolutional neural network achieved 88.8% test set accuracy, compared to 86.9% test set accuracy achieved by the support vector machine with radial Kernel. This shows that we do not always need to resort to deep learning to obtain high accuracies, but that we can also perform image classification to a reasonable standard using basic machine learning models with dimensionality-reduced data.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Just as a side note, accuracy may not be a good model performance metric in some cases. As the Fashion MNIST data has balanced categories (i.e. each category has the same number of observations), accuracy can be a good measure of model performance. However, in the case of unbalanced data, accuracy may be a misleading metric (<a href="https://towardsdatascience.com/accuracy-paradox-897a69e2dd9b">“accuracy paradox”</a>). Imagine for example that in a binary classification problem of 100 instances, there are 99 observations of class 0 and 1 observation of class 1. If the predictions are 1 for each observation, the model performs with 99% accuracy. As this may be misleading, recall and precision are often used instead. Have a look at <a href="https://towardsdatascience.com/whats-the-deal-with-accuracy-precision-recall-and-f1-f5d8b4db1021">this blog post</a> if you are unsure what these performance metrics entail.<a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>Note that in order to compare the resampled accuracies of different models, they need to have been trained with the same seed, and they need to have the same training method and control settings as specified in the <code>trainControl()</code> function. In our case, the method used is <code>repeatedcv</code>, and so all models should have been trained with five repeats (<code>repeats = 5</code>) of five-fold cross-validation (<code>number = 5</code>). Note that the gradient-boosted model in the <a href="https://rviews.rstudio.com/2020/03/10/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-3/">previous post of this series</a> was trained with non-repeated five-fold cross-validation (<code>method = "cv"</code>). In order to compare this model with the random forests and support vector machines, the method in <code>trainControl()</code> should be changed to <code>method = "repeatedcv"</code> and the number of repeats should be five: <code>repeats = 5</code>. This should be the same for all models trained in order to compute resampled accuracies.<a href="#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/03/24/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-4/';</script>
Simulating COVID-19 interventions with R
https://rviews.rstudio.com/2020/03/19/simulating-covid-19-interventions-with-r/
Thu, 19 Mar 2020 00:00:00 +0000https://rviews.rstudio.com/2020/03/19/simulating-covid-19-interventions-with-r/
<script src="/rmarkdown-libs/htmlwidgets/htmlwidgets.js"></script>
<script src="/rmarkdown-libs/viz/viz.js"></script>
<link href="/rmarkdown-libs/DiagrammeR-styles/styles.css" rel="stylesheet" />
<script src="/rmarkdown-libs/grViz-binding/grViz.js"></script>
<p><em>Tim Churches is a Senior Research Fellow at the UNSW Medicine South Western Sydney Clinical School at Liverpool Hospital, and a health data scientist at the Ingham Institute for Applied Medical Research. This post examines simulation of COVID-19 spread using <code>R</code>, and how such simulations can be used to understand the effects of various public health interventions design to limit or slow its spread.</em></p>
<div id="disclaimer" class="section level1">
<h1>DISCLAIMER</h1>
<p>The simulation results in this blog post, or any other results produced by the <code>R</code> code described in it, should <strong>not</strong> be used as actual estimates of mortality or any other aspect of the COVID-19 pandemic. The simulations are intended to explain principles, and permit exploration of the potential effects of various combinations and timings of interventions on spread. Furthermore, the code for these simulations has been written hurriedly over just a few days, it has not yet been peer-reviewed, and is considered <em>alpha</em> quality, and the simulation parameterisation presented here has not yet been validated against real-world COVID-19 data.</p>
</div>
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<p>In a <a href="https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/">previous post</a>, we looked at the use of some <code>R</code> packages developed by the <a href="https://www.repidemicsconsortium.org"><strong>R</strong> <strong>E</strong>pidemics <strong>Con</strong>sortium (RECON)</a> to undertake epidemiological analyses COVID-19 incidence data scraped from various web sources.</p>
<p>Undertaking such value-adding analyses of COVID-19 incidence data, as the full horror of the pandemic unfolds, is a worthwhile endeavour. But it would also be useful to be able to gain a better understanding of the likely effects of various public health interventions on COVID-19 spread.</p>
<p><a href="https://www.fastcompany.com/90476143/the-story-behind-flatten-the-curve-the-defining-chart-of-the-coronavirus">“Flattening-the-curve”</a> infographics such as the one shown below are now everywhere. They are a useful and succinct way of communicating a key public health message – that social distancing and related measures will help take the strain off our health care systems in the coming months.</p>
<div class="figure">
<img src="https://upload.wikimedia.org/wikipedia/commons/c/c5/Covid-19-curves-graphic-social-v3.gif" alt="Source: Siouxsie Wiles and Toby Morris" />
<p class="caption">Source: <a href="https://thespinoff.co.nz/society/09-03-2020/the-three-phases-of-covid-19-and-how-we-can-make-it-manageable/">Siouxsie Wiles and Toby Morris</a></p>
</div>
<p>However, as pointed out by several commentators, many of these infographics miss a crucial point: that public health measures can do more than just <strong>flatten</strong> the curve, they can also <strong>shrink</strong> it, thus reducing the total number of cases (and thus serious cases) of COVID-19 in a population, rather than just spread the same number of cases over a longer period such that the area under the curve remains the same.</p>
<p>This crucial point was beautifully illustrated using R in a <a href="https://staff.math.su.se/hoehle/blog/2020/03/16/flatteningthecurve.html">post by Michael Höhle</a>, which is highly recommended reading. Michael used a dynamic model of disease transmission, which is based on solving a system of ordinary differential equations (ODEs) with the tools found in base <code>R</code>.</p>
<p>Such mathematical approaches to disease outbreak simulation are elegant, and efficient to compute, but they can become unwieldy as the complexity of the model grows. An alternative is to use a more computational approach. In this post, we will briefly look at the individual contact model (ICM) simulation capability implemented in the excellent <a href="http://www.epimodel.org"><code>EpiModel</code></a> package by Samuel Jenness and colleagues at the Emory University Rollins School of Public Health, and some extensions to it. Note also that this post is based on <a href="https://timchurches.github.io/blog/">two more detailed posts</a> that provide more technical details and access to relevant source code.</p>
</div>
<div id="the-epimodel-package" class="section level1">
<h1>The <code>EpiModel</code> package</h1>
<p>The <code>EpiModel</code> package provides facilities to explore three types of disease transmission model (or simulations): dynamic contact models (DCMs) as used by <a href="https://staff.math.su.se/hoehle/blog/2020/03/16/flatteningthecurve.html">Michael Höhle</a>, stochastic individual contact models (ICMs) and stochastic network models. The last are particularly interesting, as they can accurately model disease transmission with shifting social contact networks – they were developed to model HIV transmission, but have been used to model transmission of other diseases, including ebola, and even the propagation of <em>memes</em> in social media networks. These network models potentially have application to COVID-19 modelling – they could be used to model shifting household, workplace or school and wider community interactions, and thus opportunity for transmission of the virus. However, the networks models as currently implemented are not quite suitable for such purposes, modifying them is complex, and they are also very computationally intensive to run. For these reasons, we will focus on the simpler ICM simulation facilities provided by <code>EpiModel</code>.</p>
<p>Interested readers should consult the extensive tutorials and other documentation for <code>EpiModel</code> for a fuller treatment, but in a nutshell, an <code>EpiModel</code> ICM simulation starts with a hypothetical population of individuals who are permitted to be in one of several groups, or <em>compartments</em>, at any particular time. Out-of-the-box, <code>EpiModel</code> supports several types of models, including the popular SIR model which uses <strong>S</strong>usceptible, <strong>I</strong>nfectious and <strong>R</strong>ecovered compartments. At each time step of the simulation, individuals randomly encounter and are exposed to other individuals in the population. The intensity of this population mixing is controlled by an <em>act rate</em> parameter, with each “act” representing an opportunity for disease transmission, or at least those “acts” between susceptible individuals and infectious individuals. Recovered individuals are no longer infectious and are assumed to be immune from further re-infection, so we are not interested in their interactions with others, nor are we interested in interactions between pairs of susceptible individuals, only the interactions between susceptible and infectious individuals. However, not every such opportunity for disease transmission will result in actual disease transmission. The probability of transmission at each interaction is controlled by an <em>infection probability</em> parameter.</p>
<p>It is easy to see that decreasing the <code>act.rate</code> parameter is equivalent to increasing social distancing in the population, and decreasing the <code>inf.prob</code> parameter is equates to increased practice of hygiene measures such as hand washing, use of hand sanitisers, not touching one’s face, and mask wearing by the infectious. This was what I explored in some detail in my <a href="https://timchurches.github.io/blog/posts/2020-03-10-modelling-the-effects-of-public-health-interventions-on-covid-19-transmission-part-1/">first personal blog post on simulating COVID-19 spread</a>.</p>
<div id="extensions-to-epimodel" class="section level2">
<h2>Extensions to <code>EpiModel</code></h2>
<p>However, the SIR model type is a bit too simplistic if we want to use the model to explore the potential effect of various public health measures on COVID-19 spread. Fortunately, <code>EpiModel</code> provides a plug-in architecture that allows more elaborate models to be implemented. The full details of my recent extensions to <code>EpiModel</code> can be found in my <a href="https://timchurches.github.io/blog/posts/2020-03-18-modelling-the-effects-of-public-health-interventions-on-covid-19-transmission-part-2/">second personal blog post on COVID-19 simulation</a>, but the gist of it is that several new compartment types were added, as shown in the table below, with support for transition between them as shown in the diagram below the table. The dashed lines indicate infection interactions.</p>
<table>
<colgroup>
<col width="13%" />
<col width="86%" />
</colgroup>
<thead>
<tr class="header">
<th>Compartment</th>
<th>Functional definition</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>S</td>
<td>Susceptible individuals</td>
</tr>
<tr class="even">
<td>E</td>
<td>Exposed <strong>and</strong> infected, not yet symptomatic but potentially infectious</td>
</tr>
<tr class="odd">
<td>I</td>
<td>Infected, symptomatic <strong>and</strong> infectious</td>
</tr>
<tr class="even">
<td>Q</td>
<td>Infectious, but (self-)quarantined</td>
</tr>
<tr class="odd">
<td>H</td>
<td>Requiring hospitalisation (would normally be hospitalised if capacity available)</td>
</tr>
<tr class="even">
<td>R</td>
<td>Recovered, immune from further infection</td>
</tr>
<tr class="odd">
<td>F</td>
<td>Case fatality (death due to COVID-19, not other causes)</td>
</tr>
</tbody>
</table>
<div id="htmlwidget-1" style="width:672px;height:480px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-1">{"x":{"diagram":"\ndigraph SEIQHRF {\n\n # a \"graph\" statement\n graph [overlap = false, fontsize = 10] #, rankdir = LR]\n\n # several \"node\" statements\n node [shape = box,\n fontname = Helvetica]\n S[label=\"S=Susceptible\"];\n E[label=\"E=Exposed and infected,\nasymptomatic,\npotentially infectious\"];\n I[label=\"I=Infected and infectious\"];\n Q[label=\"Q=(Self-)quarantined\n(infectious)\"];\n H[label=\"H=Requires\nhospitalisation\"];\n R[label=\"R=Recovered/immune\"];\n F[label=\"F=Case fatality\"]\n\n # several \"edge\" statements\n S->E\n I->S[style=\"dashed\"]\n E->I\n E->S[style=\"dashed\"]\n I->Q\n Q->S[style=\"dashed\"]\n I->R\n I->H\n H->F\n H->R\n Q->R\n Q->H\n}\n","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>
<p>Another capability that has been added is the ability to specify time-variant parameters, as a vector of the same length as there are time steps in the simulation. This allows us to smoothly (or step-wise) introduce, and withdraw, various interventions at arbitrary times during the course of our simulation.</p>
<p>We won’t cover here the details of how to obtain these extensions, which at the time of writing should still be considered <em>alpha</em> quality code – please see the <a href="https://timchurches.github.io/blog/posts/2020-03-18-modelling-the-effects-of-public-health-interventions-on-covid-19-transmission-part-2/">blog post</a> for those. Let’s just proceed to running some simulations.</p>
</div>
</div>
<div id="baseline-simulation" class="section level1">
<h1>Baseline simulation</h1>
<p>First we’ll run a baseline simulation for a hypothetical population of 10,000 people, in which there are just three infectious COVID-19 cases at the outset. We’ll run it for 365 days, and we’ll set a very low rate at which infectious individuals enter self-quarantine (thereby dramatically lowering their rate of interactions with others) after they become symptomatic (or have been tested and found positive), and thus aware of their infectivity. Because it is stochastic, the simulation is run eight times, using parallel processing if available, and the results averaged.</p>
<pre class="r"><code>tic()
baseline_sim <- simulate(ncores = 4)
toc()</code></pre>
<pre><code>## 58.092 sec elapsed</code></pre>
<p>Let’s visualise the results as a set of time-series of the daily count of our 10,000 individuals in each compartment.</p>
<p><img src="/post/2020-03-19-simulating-covid-19-interventions-with-r/index_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>OK, that looks very reasonable. Note that almost the entire population ends up being infected. However, the <strong>S</strong> and <strong>R</strong> compartments dominate the plot (which is good, because it means humanity will survive!), so let’s re-plot leaving out those compartments so we can see a bit more detail.</p>
<p><img src="/post/2020-03-19-simulating-covid-19-interventions-with-r/index_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>Notice that the <strong>I</strong> compartment curve lags behind the <strong>E</strong> compartment curve – the lag is the incubation period, and that the <strong>Q</strong> curve lags still further as infected people only reluctantly and belatedly quarantine themselves (in this baseline scenario).</p>
</div>
<div id="running-intervention-experiments" class="section level1">
<h1>Running intervention experiments</h1>
<p>Now we are in a position to run an experiment, by altering some parameters of our baseline model.</p>
<p>Let’s model the effect of decreasing the infection probability at each exposure event by smoothly decreasing the <code>inf.prob</code> parameters for the <strong>I</strong> compartment. The infection probability at each exposure event (for the <strong>I</strong> compartment individuals) starts at 5%, and we’ll reduce it to 2% between days 15 and 30. This models the effect of symptomatic infected people adopting better hygiene practices such as wearing masks, coughing into their elbows, using hand sanitisers, not shaking hands and so on, perhaps in response to a concerted public health advertising campaign by the government.</p>
<p>Let’s examine the results of experiment 1, alongside the baseline for comparison:</p>
<p><img src="exp-1.png" height = "500" width="700"></p>
<p>We can see from the plots on the left that by encouraging hygiene measures in symptomatic infectious individuals, we have not only substantially “flattened the curve”, but we have actually shrunk it. The result, as shown in the plots on the right, is that demand for hospital beds is substantially reduced, and only briefly exceeds our defined hospital capacity of 40 bed. This results in a substantially reduced mortality rate, shown by the black line.</p>
<div id="more-experiments" class="section level2">
<h2>More experiments</h2>
<p>We can now embark on a series of experiments, exploring various interventions singly, or in combination, and with different timings.</p>
<div id="experiment-2" class="section level3">
<h3>Experiment 2</h3>
<p>Let’s repeat experiment 1, but let’s delay the start of the hygiene campaign until day 30 and make it less intense so it takes until day 60 to achieve the desired increase in hygiene in the symptomatic infected.</p>
<pre class="r"><code>infectious_hygiene_delayed_ramp <- function(t) {
ifelse(t < 30, 0.05, ifelse(t <= 60, 0.05 - (t - 30) * (0.05 -
0.02)/30, 0.02))
}
infectious_hygiene_delayed_ramp_sim <- simulate(inf.prob.i = infectious_hygiene_delayed_ramp(1:366))</code></pre>
</div>
<div id="experiment-3" class="section level3">
<h3>Experiment 3</h3>
<p>Let’s repeat experiment 1, except this time instead of promoting hygiene measures in the symptomatic infected, we’ll promote, starting at day 15, prompt self-quarantine by anyone who is infected as soon as they become symptomatic. By “prompt”, we mean most such people will self-quarantine themselves immediately, but with an exponentially declining tail of such people taking longer to enter quarantine, with a few never complying. Those in self-quarantine won’t or can’t achieve complete social isolation, so we have set the <code>act.rate</code> parameter for the quarantined compartment to a quarter of that for the other compartments to simulate such a reduction in social mixing (an increase in social distancing) in that group.</p>
<pre class="r"><code>quarantine_ramp <- function(t) {
ifelse(t < 15, 0.0333, ifelse(t <= 30, 0.0333 + (t - 15) *
(0.3333 - 0.0333)/15, 0.333))
}
quarantine_ramp_sim <- simulate(quar.rate = quarantine_ramp(1:366))</code></pre>
</div>
<div id="experiment-4" class="section level3">
<h3>Experiment 4</h3>
<p>Let’s add a moderate increase in social distancing for everyone (halving the <code>act.rate</code>), again ramping it down between days 15 and 30.</p>
<pre class="r"><code>social_distance_ramp <- function(t) {
ifelse(t < 15, 10, ifelse(t <= 30, 10 - (t - 15) * (10 -
5)/15, 5))
}
soc_dist_ramp_sim <- simulate(act.rate.i = social_distance_ramp(1:366),
act.rate.e = social_distance_ramp(1:366))</code></pre>
</div>
<div id="experiment-5" class="section level3">
<h3>Experiment 5</h3>
<p>Let’s combine experiments 3 and 4: we’ll add a moderate increase in social distancing for everyone, as well as prompt self-quarantining in the symptomatic.</p>
<pre class="r"><code>quar_soc_dist_ramp_sim <- simulate(quar.rate = quarantine_ramp(1:366),
act.rate.i = social_distance_ramp(1:366), act.rate.e = social_distance_ramp(1:366))</code></pre>
<p>Now let’s examine the results.</p>
<p><img src="exp-5A.png" height = "500" width="700"></p>
</div>
</div>
<div id="discussion" class="section level2">
<h2>Discussion</h2>
<p>The results of our experiments almost speak for themselves, but a few things are worth highlighting:</p>
<ul>
<li><p>Implementing interventions too late is almost worthless. Act early and decisively. You can always wind back the intervention later, whereas a failure to act early enough can never be recovered from.</p></li>
<li><p>Prompt self-quarantining of symptomatic cases is effective. In practice that means everyone with COVID-19-like symptoms, whether they actually have COVID-19 or something else, should immediately self-quarantine. Don’t wait to be tested.</p></li>
<li><p>A moderate increase in social distancing (decrease in social mixing) in everyone is also effective, mainly because it reduces exposure opportunities with both the asymptomatic-but-infected and the symptomatic infected.</p></li>
<li><p>Combining measures is even more effective, as can be seen in experiment 5. In fact, there are theoretical reasons to believe that the effect of combined measures is partially multiplicative, not just additive.</p></li>
<li><p>Public health interventions don’t just flatten the curve, they shrink it, and the result is very substantially reduced mortality due to COVID-19.</p></li>
</ul>
<p>None of these insights are novel, but it is nice to be able to independently confirm the recommendations of various expert groups, such as the WHO Collaborating Centre for Infectious Disease Modelling at Imperial College London (ICL) who have recently released a <a href="https://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/news--wuhan-coronavirus/">report on the impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand</a> which recommends similar strategies to those we have just discovered from our modest simulations in <code>R</code>.</p>
</div>
</div>
<div id="two-more-experiments" class="section level1">
<h1>Two more experiments</h1>
<p>What happens if we dramatically increase social distancing through a two week lock-down, which is then relaxed? We’ll use a step function to model this. We test such a lock-down lasting from day 15 to 30, and separately a lock-down from day 30 to day 45 instead. We’ll model the lock-down by reducing the <code>act.rate</code> parameters for all compartments from 10 to 2.5.</p>
<pre class="r"><code>twoweek_lockdown_day15_vector <- c(rep(10, 15), rep(2.5, 15),
rep(10, 336))
twoweek_lockdown_day30_vector <- c(rep(10, 30), rep(2.5, 15),
rep(10, 321))
twoweek_lockdown_day15_sim <- simulate(act.rate.i = twoweek_lockdown_day15_vector,
act.rate.e = twoweek_lockdown_day15_vector)
twoweek_lockdown_day30_sim <- simulate(act.rate.i = twoweek_lockdown_day30_vector,
act.rate.e = twoweek_lockdown_day30_vector)</code></pre>
<p><img src="exp-6A.png" height = "500" width="700"></p>
<p>Wow, that’s a bit surprising! The two week lock-down starting at day 15 isn’t effective at all - it just stops the spread in its tracks for two weeks, and then it just resumes again.
But a two-week lock-down starting at day 30 is somewhat more effective, presumably because there are more infected people being taken out of circulation from day 30 onwards. But the epidemic still partially bounces back after the two weeks are over.</p>
<p>What this tells us is that single lock-downs for only two weeks aren’t effective. What about a lock-down for a whole month, instead, combined with prompt quarantine with even more effective isolation and hygiene measures for those quarantined?</p>
<pre class="r"><code>fourweek_lockdown_day15_vector <- c(rep(10, 15), rep(2.5, 30),
rep(7.5, 321))
fourweek_lockdown_day30_vector <- c(rep(10, 30), rep(2.5, 30),
rep(7.5, 306))
fourweek_lockdown_day15_sim <- simulate(act.rate.i = fourweek_lockdown_day15_vector,
act.rate.e = fourweek_lockdown_day15_vector, quar.rate = quarantine_ramp(1:366),
inf.prob.q = 0.01)
fourweek_lockdown_day30_sim <- simulate(act.rate.i = fourweek_lockdown_day30_vector,
act.rate.e = fourweek_lockdown_day30_vector, quar.rate = quarantine_ramp(1:366),
inf.prob.q = 0.01)</code></pre>
<p><img src="exp-7A.png" height = "500" width="700"></p>
<p>Well, that’s satisfying! By acting early, and decisively, we’ve managed to stop COVID-19 dead in it tracks in experiment 8, and in doing so have saved many lives – at least, we have in our little simulated world. But even experiment 9 provides a much better outcome, indicating that decisive action, even if somewhat belated, is much better than none.</p>
<p>Of course, the real world is far more complex and messier, and COVID-19 may not behave in exactly the same way in real-life, but at least we can see the principles of public health interventions in action in our simulation, and perhaps better understand or want to question what is being done, or not done, or being done too late, to contain the spread of the virus in the real world.</p>
</div>
<div id="conclusion" class="section level1">
<h1>Conclusion</h1>
<p>Although there is still a lot of work yet to be done on the extensions to <code>EpiModel</code> demonstrated here, it seems that they offer promise as a tool for understanding real-world action and inaction on COVID-19, and prompting legitimate questions about such actions or lack thereof.</p>
<p>One would hope that governments are using far more sophisticated simulation models than the one we have described here, which was built over the course of just a few days, to plan or inform their responses to COVID-19. If not, they ought to be.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/03/19/simulating-covid-19-interventions-with-r/';</script>
Outlier Days with R and Python
https://rviews.rstudio.com/2020/03/16/outlier-days-with-r-and-python/
Mon, 16 Mar 2020 00:00:00 +0000https://rviews.rstudio.com/2020/03/16/outlier-days-with-r-and-python/
<script src="/rmarkdown-libs/htmlwidgets/htmlwidgets.js"></script>
<script src="/rmarkdown-libs/plotly-binding/plotly.js"></script>
<script src="/rmarkdown-libs/typedarray/typedarray.min.js"></script>
<script src="/rmarkdown-libs/jquery/jquery.min.js"></script>
<link href="/rmarkdown-libs/crosstalk/css/crosstalk.css" rel="stylesheet" />
<script src="/rmarkdown-libs/crosstalk/js/crosstalk.min.js"></script>
<link href="/rmarkdown-libs/plotly-htmlwidgets-css/plotly-htmlwidgets.css" rel="stylesheet" />
<script src="/rmarkdown-libs/plotly-main/plotly-latest.min.js"></script>
<p>Welcome to another installment of <a href="http://www.reproduciblefinance.com/">Reproducible Finance</a>. Today’s post will be topical as we look at the historical behavior of the stock market after days of extreme returns and it will also explore one of my favorite coding themes of 2020 - the power of RMarkdown as an R/Python collaboration tool.</p>
<p>This post originated when <a href="https://www.linkedin.com/in/rishipsingh/">Rishi Singh</a>, the founder of <a href="https://api.tiingo.com/">tiingo</a> and one of the nicest people I have encountered in this crazy world, sent over a note about recent market volatility along with some Python code for analyzing that volatility. We thought it would be a nice project to post that Python code along with the equivalent R code for reproducing the same results. For me, it’s a great opportunity to use <code>RMarkdown's</code> R and Python interoperability superpowers, fueled by the <code>reticulate</code> package. If you are an R coder and someone sends you Python code as part of a project, <code>RMarkdown</code> + <code>reticulate</code> makes it quite smooth to incorporate that Python code into your work. It was interesting to learn how a very experienced Python coder might tackle a problem and then think about how to tackle that problem with R. Unsurprisingly, I couldn’t resist adding a few elements of data visualization.</p>
<p>Before we get started, if you’re unfamiliar with using R and Python chunks throughout an <code>RMarkdown</code> file, have a quick look at the <code>reticulate</code> documentation <a href="https://rstudio.github.io/reticulate/articles/r_markdown.html#overview">here</a>.</p>
<p>Let’s get to it. Since we’ll be working with R and Python, we start with our usual R setup code chunk to load R packages, but we’ll also load the <code>reticulate</code> package and source a Python script. Here’s what that looks like.</p>
<pre class="r"><code>library(tidyverse)
library(tidyquant)
library(riingo)
library(timetk)
library(plotly)
library(roll)
library(slider)
library(reticulate)
riingo_set_token("your tiingo token here")
# Python file that holds my tiingo token
reticulate::source_python("credentials.py")
knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA)</code></pre>
<p>Note that I set my tiingo token twice: first using <code>riingo_set_token()</code> so I can use the <code>riingo</code> package in R chunks and then by sourcing the <code>credentials.py</code> file, where I have put <code>tiingoToken = 'my token'</code>. Now I can use the <code>tiingoToken</code> variable in my Python chunks. This is necessary because we will use both R and Python to pull in data from Tiingo.</p>
<p>Next we will use a Python chunk to load the necessary Python libraries. If you haven’t installed these yet, you can open the RStudio terminal and run <code>pip install</code>. Since we’ll be interspersing R and Python code chunks throughout, I will add a <code># Python Chunk</code> to each Python chunk and, um, <code># R Chunk</code> to each R chunk.</p>
<pre class="python"><code># Python chunk
import pandas as pd
import numpy as np
import tiingo</code></pre>
<p>Let’s get to the substance. The goal today is look back at the last 43 years of S&P 500 price history and analyze how the market has performed following a day that sees an extreme return. We will also take care with how we define an extreme return, using rolling volatility to normalize percentage moves.</p>
<p>We will use the mutual fund <code>VFINX</code> as a tradeable proxy for the S&P 500 because it has a much longer history than other funds like <code>SPY</code> or <code>VOO</code>.</p>
<p>Let’s start by passing a URL string from <a href="api.tiingo.com">tiingo</a> to the <code>pandas</code> function <code>read_csv</code>, along with our <code>tiingoToken</code>.</p>
<pre class="python"><code># Python chunk
pricesDF = pd.read_csv("https://api.tiingo.com/tiingo/daily/vfinx/prices?startDate=1976-1-1&format=csv&token=" + tiingoToken)</code></pre>
<p>We just created a Python object called <code>pricesDF</code>. We can look at that object in an R chunk by calling <code>py$pricesDF</code>.</p>
<pre class="r"><code># R chunk
py$pricesDF %>%
head()</code></pre>
<p>We just created a Python object called <code>pricesDF</code>. Let’s reformat the date column becomes the index, in date time format.</p>
<pre class="python"><code># Python chunk
pricesDF = pricesDF.set_index(['date'])
pricesDF.index = pd.DatetimeIndex(pricesDF.index)</code></pre>
<p>Heading back to R for viewing, we see that the date column is no longer a column - it is the index of the data frame and in <code>pandas</code> the index is more like a label than a new column. In fact, here’s what happens when call the row names of this data frame.</p>
<pre class="r"><code># R chunk
py$pricesDF %>%
head() %>%
rownames()</code></pre>
<p><img src="second-r-chunk.png" /></p>
<p>We now have our prices, indexed by date. Let’s convert adjusted closing prices to log returns and save the results in a new column called <code>returns</code>. Note the use of the <code>shift(1)</code> operator here. That is analogous to the <code>lag(..., 1)</code> function in <code>dplyr</code>.</p>
<pre class="python"><code># Python chunk
pricesDF['returns'] = np.log(pricesDF['adjClose']/pricesDF['adjClose'].shift(1))</code></pre>
<p>Next, we want to calculate the 3-month rolling standard deviation of these daily log returns, and then divide daily returns by the <em>previous</em> rolling 3-month volatility in order to prevent look-ahead error. We can think of this as normalizing today’s return by the previous 3-months’ rolling volatility and will label it as <code>stdDevMove</code>.</p>
<pre class="python"><code># Python chunk
pricesDF['rollingVol'] = pricesDF['returns'].rolling(63).std()
pricesDF['stdDevMove'] = pricesDF['returns'] / pricesDF['rollingVol'].shift(1)</code></pre>
<p>Finally, we eventually want to calculate how the market has performed on the day following a large negative move. To prepare for that, let’s create a column of next day returns using <code>shift(-1)</code>.</p>
<pre class="python"><code># Python chunk
pricesDF['nextDayReturns'] = pricesDF.returns.shift(-1)</code></pre>
<p>Now, we can filter by the size of the <code>stdDevMove</code> column and the <code>returns</code> column, to isolate days where the standard deviation move was at least 3 and the returns was less than -3%. We use <code>mean()</code> to find the mean next day return following such large events.</p>
<pre class="python"><code># Python chunk
nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < -.03), ['nextDayReturns']].mean()</code></pre>
<p>Finally, let’s loop through and see how the mean next day return changes as we filter on different extreme negative returns or we can call drop tolerances. We will label the drop tolerance as <code>i</code>, set it at <code>-.03</code> and then run a <code>while</code> loop that decrements down <code>i</code> by .0025 at each pass. In this way we can look at the mean next return following different levels of negative returns.</p>
<pre class="python"><code># Python chunk
i = -.03
while i >= -.0525:
nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < i), ['nextDayReturns']]
print(str(round(i, 5)) + ': ' + str(round(nextDayPerformanceSeries['nextDayReturns'].mean(), 6)))
i -= .0025</code></pre>
<p><img src="while-loop-snapshot.png" height = "100" width="200"></p>
<p>It appears that as the size of the drop gets larger and more negative, the mean bounce back tends to get larger.</p>
<p>Let’s reproduce these results in R.</p>
<p>First, we import prices using the <code>riingo_prices()</code> function from the <a href="package">riingo</a>.</p>
<pre class="r"><code># R chunk
sp_500_prices <-
"VFINX" %>%
riingo_prices(start_date = "1976-01-01", end_date = today())</code></pre>
<p>We can use <code>mutate()</code> to add a column of daily returns, rolling volatility, standard deviation move and next day returns.</p>
<pre class="r"><code># R chunk
sp_500_returns <-
sp_500_prices %>%
select(date, adjClose) %>%
mutate(daily_returns_log = log(adjClose/lag(adjClose)),
rolling_vol = roll_sd(as.matrix(daily_returns_log), 63),
sd_move = daily_returns_log/lag(rolling_vol),
next_day_returns = lead(daily_returns_log))</code></pre>
<p>Now let’s <code>filter()</code> on an <code>sd_move</code> greater than 3 and <code>daily_returns_log</code> less than a drop tolerance of -.03.</p>
<pre class="r"><code># R chunk
sp_500_returns %>%
na.omit() %>%
filter(sd_move < -3 & daily_returns_log < -.03) %>%
select(date, daily_returns_log, sd_move, next_day_returns) %>%
summarise(mean_return = mean(next_day_returns)) %>%
add_column(drop_tolerance = scales::percent(.03), .before = 1)</code></pre>
<pre><code># A tibble: 1 x 2
drop_tolerance mean_return
<chr> <dbl>
1 3% 0.00625</code></pre>
<p>We used a <code>while()</code> loop to iterate across different drop tolerances in Python, let’s see how to implement that using <code>map_dfr()</code> from the <code>purrr</code> package.</p>
<p>First, we will define a sequence of drop tolerances using the <code>seq()</code> function.</p>
<pre class="r"><code># R chunk
drop_tolerance <- seq(.03, .05, .0025)
drop_tolerance</code></pre>
<pre><code>[1] 0.0300 0.0325 0.0350 0.0375 0.0400 0.0425 0.0450 0.0475 0.0500</code></pre>
<p>Next, we will create a function called <code>outlier_mov_fun</code> that takes a data frame of returns, filters on a drop tolerance and gives us the mean return following large negative moves.</p>
<pre class="r"><code># R chunk
outlier_mov_fun <- function(drop_tolerance, returns) {
returns %>%
na.omit() %>%
filter(sd_move < -3 & daily_returns_log < -drop_tolerance) %>%
select(date, daily_returns_log, sd_move, next_day_returns) %>%
summarise(mean_return = mean(next_day_returns) %>% round(6)) %>%
add_column(drop_tolerance = scales::percent(drop_tolerance), .before = 1) %>%
add_column(drop_tolerance_raw = drop_tolerance, .before = 1)
}</code></pre>
<p>Notice how that function takes two arguments: a drop tolerance and data frame of returns.</p>
<p>Next, we pass our sequence of drop tolerances, stored in a variable called <code>drop_tolerance</code> to <code>map_dfr()</code>, along with our function and our <code>sp_500_returns</code> object. <code>map_dfr</code> will iterate through our sequence of drops and apply our function to each one.</p>
<pre class="r"><code># R chunk
map_dfr(drop_tolerance, outlier_mov_fun, sp_500_returns) %>%
select(-drop_tolerance_raw)</code></pre>
<pre><code># A tibble: 9 x 2
drop_tolerance mean_return
<chr> <dbl>
1 3% 0.00625
2 3% 0.00700
3 3% 0.00967
4 4% 0.0109
5 4% 0.0122
6 4% 0.0132
7 4% 0.0149
8 5% 0.0149
9 5% 0.0162 </code></pre>
<p>Have a quick glance up that the results of our Python <code>while()</code> and we should see that the results are consistent.</p>
<p>Alright, let’s have some fun and get to visualizing these results with <code>ggplot</code> and <code>plotly</code>.</p>
<pre class="r"><code># R chunk
(
sp_500_returns %>%
map_dfr(drop_tolerance, outlier_mov_fun, .) %>%
ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}
mean next day return: {mean_return * 100}%"))) +
geom_point(color = "cornflowerblue") +
labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
) %>% ggplotly(tooltip = "text")</code></pre>
<div id="htmlwidget-1" style="width:672px;height:480px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-1">{"x":{"data":[{"x":[0.03,0.0325,0.035,0.0375,0.04,0.0425,0.045,0.0475,0.05],"y":[0.006251,0.007005,0.009671,0.010899,0.01223,0.013163,0.01486,0.01486,0.016152],"text":["drop tolerance: 3%<br />mean next day return: 0.6251%","drop tolerance: 3%<br />mean next day return: 0.7005%","drop tolerance: 3%<br />mean next day return: 0.9671%","drop tolerance: 4%<br />mean next day return: 1.0899%","drop tolerance: 4%<br />mean next day return: 1.223%","drop tolerance: 4%<br />mean next day return: 1.3163%","drop tolerance: 4%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.6152%"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(100,149,237,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(100,149,237,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":54.7945205479452},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Mean Return after Large Daily Drop","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.029,0.051],"tickmode":"array","ticktext":["3.00%","3.50%","4.00%","4.50%","5.00%"],"tickvals":[0.03,0.035,0.04,0.045,0.05],"categoryorder":"array","categoryarray":["3.00%","3.50%","4.00%","4.50%","5.00%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"daily drop","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.00575595,0.01664705],"tickmode":"array","ticktext":["0.60%","0.80%","1.00%","1.20%","1.40%","1.60%"],"tickvals":[0.006,0.008,0.01,0.012,0.014,0.016],"categoryorder":"array","categoryarray":["0.60%","0.80%","1.00%","1.20%","1.40%","1.60%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"mean return","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"ac865f4a2157":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"ac865f4a2157","visdat":{"ac865f4a2157":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
<p>Here’s what happens when we expand the upper bound to a drop tolerance of -2% and make our intervals smaller, moving from .25% increments to .125% increments.</p>
<pre class="r"><code># R chunk
drop_tolerance_2 <- seq(.02, .05, .00125)
(
sp_500_returns %>%
map_dfr(drop_tolerance_2, outlier_mov_fun, .) %>%
ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}
mean next day return: {mean_return * 100}%"))) +
geom_point(color = "cornflowerblue") +
labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
) %>% ggplotly(tooltip = "text")</code></pre>
<div id="htmlwidget-2" style="width:672px;height:480px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-2">{"x":{"data":[{"x":[0.02,0.02125,0.0225,0.02375,0.025,0.02625,0.0275,0.02875,0.03,0.03125,0.0325,0.03375,0.035,0.03625,0.0375,0.03875,0.04,0.04125,0.0425,0.04375,0.045,0.04625,0.0475,0.04875,0.05],"y":[0.004042,0.004458,0.00505,0.005678,0.005703,0.005944,0.005607,0.005556,0.006251,0.006085,0.007005,0.008483,0.009671,0.009795,0.010899,0.010748,0.01223,0.013352,0.013163,0.013163,0.01486,0.01486,0.01486,0.013091,0.016152],"text":["drop tolerance: 2%<br />mean next day return: 0.4042%","drop tolerance: 2%<br />mean next day return: 0.4458%","drop tolerance: 2%<br />mean next day return: 0.505%","drop tolerance: 2%<br />mean next day return: 0.5678%","drop tolerance: 2%<br />mean next day return: 0.5703%","drop tolerance: 3%<br />mean next day return: 0.5944%","drop tolerance: 3%<br />mean next day return: 0.5607%","drop tolerance: 3%<br />mean next day return: 0.5556%","drop tolerance: 3%<br />mean next day return: 0.6251%","drop tolerance: 3%<br />mean next day return: 0.6085%","drop tolerance: 3%<br />mean next day return: 0.7005%","drop tolerance: 3%<br />mean next day return: 0.8483%","drop tolerance: 4%<br />mean next day return: 0.9671%","drop tolerance: 4%<br />mean next day return: 0.9795%","drop tolerance: 4%<br />mean next day return: 1.0899%","drop tolerance: 4%<br />mean next day return: 1.0748%","drop tolerance: 4%<br />mean next day return: 1.223%","drop tolerance: 4%<br />mean next day return: 1.3352%","drop tolerance: 4%<br />mean next day return: 1.3163%","drop tolerance: 4%<br />mean next day return: 1.3163%","drop tolerance: 4%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.3091%","drop tolerance: 5%<br />mean next day return: 1.6152%"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(100,149,237,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(100,149,237,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":54.7945205479452},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Mean Return after Large Daily Drop","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.0185,0.0515],"tickmode":"array","ticktext":["2.0%","3.0%","4.0%","5.0%"],"tickvals":[0.02,0.03,0.04,0.05],"categoryorder":"array","categoryarray":["2.0%","3.0%","4.0%","5.0%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"daily drop","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.0034365,0.0167575],"tickmode":"array","ticktext":["0.40%","0.80%","1.20%","1.60%"],"tickvals":[0.004,0.008,0.012,0.016],"categoryorder":"array","categoryarray":["0.40%","0.80%","1.20%","1.60%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"mean return","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"ac863a390034":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"ac863a390034","visdat":{"ac863a390034":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
<p>Check out what happens when we expand the lower bound, to a -6% drop tolerance.</p>
<pre class="r"><code># R chunk
drop_tolerance_3 <- seq(.02, .06, .00125)
(
sp_500_returns %>%
map_dfr(drop_tolerance_3, outlier_mov_fun, .) %>%
ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}
mean next day return: {mean_return * 100}%"))) +
geom_point(color = "cornflowerblue") +
labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
) %>% ggplotly(tooltip = "text")</code></pre>
<div id="htmlwidget-3" style="width:672px;height:480px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-3">{"x":{"data":[{"x":[0.02,0.02125,0.0225,0.02375,0.025,0.02625,0.0275,0.02875,0.03,0.03125,0.0325,0.03375,0.035,0.03625,0.0375,0.03875,0.04,0.04125,0.0425,0.04375,0.045,0.04625,0.0475,0.04875,0.05,0.05125,0.0525,0.05375,0.055,0.05625,0.0575,0.05875,0.06],"y":[0.004042,0.004458,0.00505,0.005678,0.005703,0.005944,0.005607,0.005556,0.006251,0.006085,0.007005,0.008483,0.009671,0.009795,0.010899,0.010748,0.01223,0.013352,0.013163,0.013163,0.01486,0.01486,0.01486,0.013091,0.016152,0.017723,0.017723,0.0367,0.0367,0.0367,0.039321,0.039321,0.039949],"text":["drop tolerance: 2%<br />mean next day return: 0.4042%","drop tolerance: 2%<br />mean next day return: 0.4458%","drop tolerance: 2%<br />mean next day return: 0.505%","drop tolerance: 2%<br />mean next day return: 0.5678%","drop tolerance: 2%<br />mean next day return: 0.5703%","drop tolerance: 3%<br />mean next day return: 0.5944%","drop tolerance: 3%<br />mean next day return: 0.5607%","drop tolerance: 3%<br />mean next day return: 0.5556%","drop tolerance: 3%<br />mean next day return: 0.6251%","drop tolerance: 3%<br />mean next day return: 0.6085%","drop tolerance: 3%<br />mean next day return: 0.7005%","drop tolerance: 3%<br />mean next day return: 0.8483%","drop tolerance: 4%<br />mean next day return: 0.9671%","drop tolerance: 4%<br />mean next day return: 0.9795%","drop tolerance: 4%<br />mean next day return: 1.0899%","drop tolerance: 4%<br />mean next day return: 1.0748%","drop tolerance: 4%<br />mean next day return: 1.223%","drop tolerance: 4%<br />mean next day return: 1.3352%","drop tolerance: 4%<br />mean next day return: 1.3163%","drop tolerance: 4%<br />mean next day return: 1.3163%","drop tolerance: 4%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.486%","drop tolerance: 5%<br />mean next day return: 1.3091%","drop tolerance: 5%<br />mean next day return: 1.6152%","drop tolerance: 5%<br />mean next day return: 1.7723%","drop tolerance: 5%<br />mean next day return: 1.7723%","drop tolerance: 5%<br />mean next day return: 3.67%","drop tolerance: 6%<br />mean next day return: 3.67%","drop tolerance: 6%<br />mean next day return: 3.67%","drop tolerance: 6%<br />mean next day return: 3.9321%","drop tolerance: 6%<br />mean next day return: 3.9321%","drop tolerance: 6%<br />mean next day return: 3.9949%"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(100,149,237,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(100,149,237,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":48.9497716894977},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Mean Return after Large Daily Drop","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.018,0.062],"tickmode":"array","ticktext":["2.0%","3.0%","4.0%","5.0%","6.0%"],"tickvals":[0.02,0.03,0.04,0.05,0.06],"categoryorder":"array","categoryarray":["2.0%","3.0%","4.0%","5.0%","6.0%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"daily drop","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.00224665,0.04174435],"tickmode":"array","ticktext":["1.0%","2.0%","3.0%","4.0%"],"tickvals":[0.01,0.02,0.03,0.04],"categoryorder":"array","categoryarray":["1.0%","2.0%","3.0%","4.0%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"mean return","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"ac8679f63715":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"ac8679f63715","visdat":{"ac8679f63715":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
<p>I did not expect that gap upward when the daily drop passes 5.25%.</p>
<p>A quick addendum that if I had gotten my act together and finished this 4 days ago I would not have included, but I’m curious how this last week has compared with other weeks in terms of volatility. I have in mind to visualize weekly return dispersion and that seemed a mighty tall task, until the brand new <code>slider</code> package came to the rescue! <code>slider</code> has a function called <code>slide_period()</code> that, among other things, allows us to break up time series according to different periodicities.</p>
<p>To break up our returns by week, we call <code>slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week")</code>, where <code>first_monday_december</code> is a date that falls on a Monday. We could use our eyeballs to check a calendar and find a date that’s a Monday or we could use some good ol’ code. Let’s assume we want to find the first Monday in December of 2016.</p>
<p>We first filter our data with <code>filter(between(date, as_date("2016-12-01"), as_date("2016-12-31")))</code>. Then create a column of weekday names with <code>wday(date, label = TRUE, abbr = FALSE)</code> and filter to our first value of “Monday”.</p>
<pre class="r"><code># R Chunk
first_monday_december <-
sp_500_returns %>%
mutate(date = ymd(date)) %>%
filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))) %>%
mutate(day_week = wday(date, label = TRUE, abbr = FALSE)) %>%
filter(day_week == "Monday") %>%
slice(1) %>%
pull(date)</code></pre>
<p>Now we run our <code>slide_period_dfr()</code> code and it will start on the first Monday in December of 2016, and break our returns into weeks. Since we set <code>.names_to = "week"</code>, the function will create a new column called <code>week</code> and give a unique number to each of our weeks.</p>
<pre class="r"><code># R chunk
sp_500_returns %>%
select(date, daily_returns_log) %>%
filter(date >= first_monday_december) %>%
slide_period_dfr(.,
.$date,
"week",
~ .x,
.origin = first_monday_december,
.names_to = "week") %>%
head(10)</code></pre>
<pre><code># A tibble: 10 x 3
week date daily_returns_log
<int> <dttm> <dbl>
1 1 2016-12-05 00:00:00 0.00589
2 1 2016-12-06 00:00:00 0.00342
3 1 2016-12-07 00:00:00 0.0133
4 1 2016-12-08 00:00:00 0.00226
5 1 2016-12-09 00:00:00 0.00589
6 2 2016-12-12 00:00:00 -0.00105
7 2 2016-12-13 00:00:00 0.00667
8 2 2016-12-14 00:00:00 -0.00810
9 2 2016-12-15 00:00:00 0.00392
10 2 2016-12-16 00:00:00 -0.00172</code></pre>
<p>From here, we can <code>group_by</code> that <code>week</code> column and treat each week as a discrete time period. Let’s use <code>ggplotly</code> to plot each week on the x-axis and the daily returns of each week on the y-axis, so that the vertical dispersion shows us the dispersion of weekly returns. Hover on the point to see the exact date of the return.</p>
<pre class="r"><code># R chunk
(
sp_500_returns %>%
select(date, daily_returns_log) %>%
filter(date >= first_monday_december) %>%
slide_period_dfr(.,
.$date,
"week",
~ .x,
.origin = first_monday_december,
.names_to = "week") %>%
group_by(week) %>%
mutate(start_week = ymd(min(date))) %>%
ggplot(aes(x = start_week, y = daily_returns_log, text = str_glue("date: {date}"))) +
geom_point(color = "cornflowerblue", alpha = .5) +
scale_y_continuous(labels = scales::percent,
breaks = scales::pretty_breaks(n = 8)) +
scale_x_date(breaks = scales::pretty_breaks(n = 10)) +
labs(y = "", x = "", title = "Weekly Daily Returns") +
theme_minimal()
) %>% ggplotly(tooltip = "text")</code></pre>
<div id="htmlwidget-4" style="width:672px;height:480px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-4">{"x":{"data":[{"x":[17140,17140,17140,17140,17140,17147,17147,17147,17147,17147,17154,17154,17154,17154,17154,17162,17162,17162,17162,17169,17169,17169,17169,17175,17175,17175,17175,17175,17183,17183,17183,17183,17189,17189,17189,17189,17189,17196,17196,17196,17196,17196,17203,17203,17203,17203,17203,17210,17210,17210,17210,17210,17218,17218,17218,17218,17224,17224,17224,17224,17224,17231,17231,17231,17231,17231,17238,17238,17238,17238,17238,17245,17245,17245,17245,17245,17252,17252,17252,17252,17252,17259,17259,17259,17259,17259,17266,17266,17266,17266,17273,17273,17273,17273,17273,17280,17280,17280,17280,17280,17287,17287,17287,17287,17287,17294,17294,17294,17294,17294,17301,17301,17301,17301,17301,17308,17308,17308,17308,17308,17316,17316,17316,17316,17322,17322,17322,17322,17322,17329,17329,17329,17329,17329,17336,17336,17336,17336,17336,17343,17343,17343,17343,17343,17350,17350,17350,17350,17357,17357,17357,17357,17357,17364,17364,17364,17364,17364,17371,17371,17371,17371,17371,17378,17378,17378,17378,17378,17385,17385,17385,17385,17385,17392,17392,17392,17392,17392,17399,17399,17399,17399,17399,17406,17406,17406,17406,17406,17414,17414,17414,17414,17420,17420,17420,17420,17420,17427,17427,17427,17427,17427,17434,17434,17434,17434,17434,17441,17441,17441,17441,17441,17448,17448,17448,17448,17448,17455,17455,17455,17455,17455,17462,17462,17462,17462,17462,17469,17469,17469,17469,17469,17476,17476,17476,17476,17476,17483,17483,17483,17483,17483,17490,17490,17490,17490,17497,17497,17497,17497,17497,17504,17504,17504,17504,17504,17511,17511,17511,17511,17511,17518,17518,17518,17518,17518,17526,17526,17526,17526,17533,17533,17533,17533,17539,17539,17539,17539,17539,17547,17547,17547,17547,17553,17553,17553,17553,17553,17560,17560,17560,17560,17560,17567,17567,17567,17567,17567,17574,17574,17574,17574,17574,17582,17582,17582,17582,17588,17588,17588,17588,17588,17595,17595,17595,17595,17595,17602,17602,17602,17602,17602,17609,17609,17609,17609,17609,17616,17616,17616,17616,17623,17623,17623,17623,17623,17630,17630,17630,17630,17630,17637,17637,17637,17637,17637,17644,17644,17644,17644,17644,17651,17651,17651,17651,17651,17658,17658,17658,17658,17658,17665,17665,17665,17665,17665,17672,17672,17672,17672,17672,17680,17680,17680,17680,17686,17686,17686,17686,17686,17693,17693,17693,17693,17693,17700,17700,17700,17700,17700,17707,17707,17707,17707,17707,17714,17714,17714,17714,17721,17721,17721,17721,17721,17728,17728,17728,17728,17728,17735,17735,17735,17735,17735,17742,17742,17742,17742,17742,17749,17749,17749,17749,17749,17756,17756,17756,17756,17756,17763,17763,17763,17763,17763,17770,17770,17770,17770,17770,17778,17778,17778,17778,17784,17784,17784,17784,17784,17791,17791,17791,17791,17791,17798,17798,17798,17798,17798,17805,17805,17805,17805,17805,17812,17812,17812,17812,17812,17819,17819,17819,17819,17819,17826,17826,17826,17826,17826,17833,17833,17833,17833,17833,17840,17840,17840,17840,17840,17847,17847,17847,17847,17847,17854,17854,17854,17854,17861,17861,17861,17861,17861,17868,17868,17868,17868,17875,17875,17875,17875,17875,17882,17882,17882,17882,17882,17889,17889,17889,17889,17896,17896,17896,17896,17903,17903,17903,17903,17903,17910,17910,17910,17910,17910,17918,17918,17918,17918,17924,17924,17924,17924,17924,17931,17931,17931,17931,17931,17938,17938,17938,17938,17938,17946,17946,17946,17946,17952,17952,17952,17952,17952,17959,17959,17959,17959,17959,17966,17966,17966,17966,17966,17973,17973,17973,17973,17973,17980,17980,17980,17980,17980,17987,17987,17987,17987,17987,17994,17994,17994,17994,17994,18001,18001,18001,18001,18008,18008,18008,18008,18008,18015,18015,18015,18015,18015,18022,18022,18022,18022,18022,18029,18029,18029,18029,18029,18036,18036,18036,18036,18036,18044,18044,18044,18044,18050,18050,18050,18050,18050,18057,18057,18057,18057,18057,18064,18064,18064,18064,18064,18071,18071,18071,18071,18071,18078,18078,18078,18078,18085,18085,18085,18085,18085,18092,18092,18092,18092,18092,18099,18099,18099,18099,18099,18106,18106,18106,18106,18106,18113,18113,18113,18113,18113,18120,18120,18120,18120,18120,18127,18127,18127,18127,18127,18134,18134,18134,18134,18134,18142,18142,18142,18142,18148,18148,18148,18148,18148,18155,18155,18155,18155,18155,18162,18162,18162,18162,18162,18169,18169,18169,18169,18169,18176,18176,18176,18176,18176,18183,18183,18183,18183,18183,18190,18190,18190,18190,18190,18197,18197,18197,18197,18197,18204,18204,18204,18204,18204,18211,18211,18211,18211,18211,18218,18218,18218,18218,18218,18225,18225,18225,18225,18232,18232,18232,18232,18232,18239,18239,18239,18239,18239,18246,18246,18246,18246,18246,18253,18253,18253,18253,18260,18260,18260,18260,18267,18267,18267,18267,18267,18274,18274,18274,18274,18274,18282,18282,18282,18282,18288,18288,18288,18288,18288,18295,18295,18295,18295,18295,18302,18302,18302,18302,18302,18310,18310,18310,18310,18316,18316,18316,18316,18316,18323,18323,18323,18323,18323,18330,18330,18330,18330,18330],"y":[0.00588872052062656,0.00341914073176226,0.0132721735089472,0.00225901900853694,0.00588786076653522,-0.00105057074793766,0.00666669135831657,-0.00810106884545037,0.00391576833894494,-0.00171722995863414,0.00200314847360021,0.00375698931987735,-0.00245233284504492,-0.00172479920094205,0.0013417675003589,0.00224821314001941,-0.00820404246501246,-0.000240900003399625,-0.0046365695232261,0.00848402292971288,0.00593445149173236,-0.000763759643845448,0.00381297826276414,-0.00357424070549143,0,0.00290801816260692,-0.00214444076602734,0.00185878088937569,-0.00300436573977333,0.00195624749237293,-0.00362924805020046,0.0033908860727472,-0.00267354308032826,0.00652802928183992,0.00799455522168045,-0.000706996942217174,-0.000801867923560657,-0.00601085653865374,-0.000902377112613031,0.00052252809800033,0.000664641118197784,0.007282044662885,-0.00212239202411445,0.000283246000939109,0.000990729682505355,0.0059238537128626,0.0036496390871896,0.0054496047677694,0.00431065118593242,0.00512076407867814,-0.000782562702608663,0.00170244134332327,0.00600437234217965,-0.00100585231591872,0.00054877213594458,0.00169015411539618,0.00118596921230501,-0.00255614526587626,0.0137532522102737,-0.00578689620760803,0.00049862878142739,-0.00326827344912327,-0.0028229313470212,-0.00200821610204816,0.00077637982583811,0.00328153067844677,0.000682298918859142,-0.00337053382162209,0.00840512887580106,-0.00163006601623778,-0.00131504383527178,-0.00195308126522272,-0.0124389443206164,0.00194013364636671,-0.00106198786369776,-0.000831908353447792,-0.00101771762478404,0.00719427563376599,0.00128581942761204,0.00297857935916589,-0.00229042702075644,-0.00160642604827458,0.000734686414019835,-0.00294198980281332,0.00220730338879357,-0.000827167916050775,0.000735294150512251,-0.00133311900582045,-0.00373298126900934,-0.00681046900262903,0.00851738704678414,-0.00286189264747164,-0.00157290928060245,0.00751831220654058,-0.00303741777078138,0.0108188457836989,0.00604588518009576,-0.00045330916474442,0.000634575308832628,-0.00190493526361985,0.00172366911953168,0.0011776430384145,-0.00108700585895412,0.000634230337181814,0.00411272131943391,9.01997921521291e-05,-0.000947503849431371,0.0014885318159031,-0.00184972194138911,-0.00144606649986008,0.0048721168240439,-0.000495149793075669,-0.0180365155700006,0.00366082868882626,0.0067827732442793,0.00515861590163286,0.00184880444074872,0.00251968637226631,0.00457317870192712,0.000447227199002829,-0.00107368148804497,-0.000313374373112603,0.00767175930416791,0.00368112179750812,-0.0011959869427712,-0.00275176597224848,0.00173183284481314,0.000266169817994782,-0.000798722087432535,-0.000932649457863902,0.00478724318529026,-0.000929059726010033,-0.00208245716467827,0.000310428171725903,0.00830136793503116,-0.00670640060385071,-0.000531373169463083,-0.000443026766196309,0.00155635133801604,0.000310979811313093,-0.00807229269072506,0.00896027392347576,-0.00864687148999046,0.0016102342789773,0.0023659141004358,0.00164836426085094,-0.0090324348659306,0.00635804058217186,0.000936851800492213,-0.000758336157758766,0.00733613765218236,0.00190303436436721,0.0046757914923088,-4.40092422524702e-05,0.000615953207126958,0.00543908701170366,-8.74928909429634e-05,-0.000350048135312302,-0.00105088020870407,0.0028872674151693,0.000305736937864124,-0.000873782221768919,-0.00131210655620911,-0.00074429197200847,0.00244969501324575,0.000655150598811398,-0.00196674087108766,0.00187941225329946,0.00165794104294428,-0.00226945499541703,-0.000262191925965314,-0.0142171016326582,0.0013290215927371,0.0100432524212556,-0.000263007935563413,0.00166454986087063,-0.0154816191697133,-0.00182412792776085,0.00115714998006848,0.00995825795550631,-0.00339678734952106,-0.00203476869437175,0.00176959876311367,0.000486091173402696,0.00101560956343203,0.00484305089635086,0.00573720013700744,0.0020939675231938,-0.00756767461931009,0.0031565129234988,0,-0.0014016647060674,0.0107251322350894,0.0034199907790328,0.000777605015779705,-0.000777605015779786,0.00194287995816762,0.00150852344190816,0.00111484411278496,0.000605039129934634,-0.00298552890338266,0.000649786675975387,-0.00225440137087842,0.000173596042646803,0.00407085573650316,0.00138205083737587,0.00374782354399301,0.00386233421537598,0.00218205242183077,0.00132402265916986,0.00578798909119554,-0.000764136562679528,-0.00182780438148529,0.00233729362924568,0.00182354114292986,-0.00161133054442417,0.000890793077036163,0.00177920914517331,0.000634665446583378,0.000803331755710563,0.000380300442036671,0.00518299624354316,-0.00395856661117379,0.00164428647627447,-0.0046870137529745,0.00126892834884783,0.00804130214542466,-0.00314987345964681,0.000967016409650522,0.00159563334282057,0.00029365495776713,0.00318271553030685,0.00137882070014931,-0.000208790062264938,0.00146061580462643,-0.00350906868524065,-0.000544172139558006,0.000962564681364832,-0.00213563448744618,-0.00533805357283116,0.0084771165246164,-0.00259425226015856,0.00129796739816107,0.00654782862812132,-0.000665336019119057,0.00211921709836015,-0.000290607163228265,0.00983357095650577,-0.000246720672401639,0.00851766001394787,-0.00200004148372136,-0.00102197249569697,-0.0037288248248273,-8.21085474258576e-05,0.0031153950179513,0.00559150896661265,0.0032101451811654,0.00154045757750115,-0.000445677944278044,-0.0038980072097564,0.00899119094269316,0.00538824738457709,-0.00325360291847984,-0.000684201020848296,0.00197084025610528,-0.000457371424830432,-0.00101008068998365,0.000848536325698429,0.00201743128440392,-0.00513225753225674,0.00831152656439213,0.00636830028477681,0.00422311384629617,0.00701229028713175,0.00165673975820409,0.00157529963245385,-0.00110244912334702,0.00710517425102419,0.00670541672733697,-0.00350303954673108,0.00935293516356029,-0.00154631235370503,0.00440070194873943,0.00805620653655563,0.00217553185923867,-0.000533902842146636,0.000610151412434649,0.0117864665779948,-0.00672894082467211,-0.0108690504251714,0.000498361168634869,-0.000460016875264185,-0.0213928480827801,-0.0419194313316355,0.0174931574475171,-0.00503029290660924,-0.0380739952452425,0.0151398308371826,0.013815375411172,0.00268347386063027,0.0135908369038433,0.012222219547165,0.000474721110130742,-0.00583092031065574,-0.00546503943096568,0.00103945957470873,0.0159749705841459,0.0117673712665498,-0.0126328996861261,-0.0110817614063063,-0.0132202870895453,0.00510851657011594,0.010973116508532,0.00269488531478032,-0.000435445249549667,0.00474009419702139,0.0172293442144125,-0.00124026216411052,-0.00634157006863497,-0.00547903687523453,-0.00070663057939746,0.00172643847327928,-0.0143330588289602,0.00151009407164588,-0.00174873857216107,-0.0254616340956171,-0.0211645777107907,0.026761869676189,-0.0174296351350718,-0.00274063787530736,0.0137107247933594,-0.0226070975543947,0.0125903193108707,0.0115741556074526,0.00697833842527081,-0.022163280499466,0.00361078646674249,0.0165985733587232,-0.00547498814004955,0.0083231814637994,-0.00284819332374994,0.00807585096496585,0.010614018351466,0.000839479572287366,-0.00565007213264187,-0.00851515466098712,4.05276700059018e-05,-0.0134640956848953,0.00184672259337948,0.0104008279293104,0.00113548817921407,-0.00817992148772007,0.00253030376184733,-0.00724062806210534,-0.00213719205103062,0.012917646818221,0.00344639179901106,-0.000242885480774751,0.0097097920452428,0.00957670620739111,0.00206291976204359,0.000911089607048309,-0.00683355857495073,0.00433598547142331,-0.000595580807933976,-0.00254513777749041,0.00737893616679213,-0.00312741491219433,0.00324598495762866,-0.00197800523402202,-0.0022597087852631,-0.0115765134604598,0.0127664703657799,-0.00668207084148164,0.0108752808306137,0.00449014106665587,0.000746400616046096,0.00856326959495753,-0.000545235048807458,0.00315046642438082,0.00108674569175696,0.0017440850885456,-0.00399651322501282,0.00279535842209886,-0.000930918185142667,-0.0020977398725926,-0.00401348757023277,0.00171647073435334,-0.00629510788375295,0.00192010090534972,-0.0138362772605729,0.00218076651132764,-0.0086386795618383,0.00628014624342184,0.000837036938353746,0.00310287463510126,-0.00497721467872274,0.00882253467325116,0.00851001905185041,0.00906043845583175,0.00349284322959351,-0.0071539962970243,0.00881850207175707,0.00108237669510343,-0.00100502520990986,0.00397553450245544,0.00215484152929615,-0.00385119477761847,-0.000926497904525697,0.00185213820519133,0.00480723933178195,0.00905156369968265,-0.00300809365317419,-0.00661912006725051,-0.00569758239806661,0.00492953320509934,-0.00107626086658843,0.00502542536491861,0.00477182100706507,0.00361141598713065,0.00284193366874582,-0.000264905676229683,-0.00117399797629917,-0.00684413317971266,-0.00401415047274448,0.00645299018455877,-0.00744946428212201,0.00817235129599029,0.00341711927908646,0.00242286698139752,0.00211520381014133,-0.000377386977112972,-0.001624388850214,0.00618123481698807,0.00774776964491722,0.000260955471316173,0.00576091983695275,-0.00427105067008403,0.000260499043231031,-0.0016012813667249,-0.00279908372891054,-0.00336927542443152,-0.00213976093823167,0.0018772297611243,0.00378128901896052,0.000373608313426386,0.00551314490112963,0.000445682458668617,-0.00558535857741556,0.00539968169178166,0.00122483081111181,0.00786983796090797,-0.000368093647210356,-0.00350366398716625,-0.00125619193740099,-0.00331181577964932,0.00294024813146625,3.71629781326438e-05,0.003635288520137,-0.000370342941811434,0.000703534363494577,-0.00799008282995308,-0.00549996709583203,-0.000412781213349286,-0.00112663373758661,-0.0334313191353684,-0.0207674254572238,0.0141796065939739,-0.00588375432372248,0.0212910640538393,-0.000231098102263174,-0.0143943201926949,-0.000351775492429108,-0.00430950715954468,-0.0054723468782407,-0.0313170878723618,0.018403941832028,-0.0174675168713135,-0.00657346451910667,0.0156478634297123,0.0107896680773494,0.0105165959236922,-0.00621722624366024,0.00558538955230693,0.00633972708744026,0.0209749008561237,-0.00200077019403978,-0.00905333687754058,-0.0199013496196551,-0.00146799209051544,-0.00733246291515558,0.0108599432383336,0.0022921287303344,-0.016758278545163,-0.0182696025942631,0.00310216992344155,-0.00650109618732963,0.0155079730448189,0.00326619778665917,0.0227662463390749,-0.00193008430158784,0.00828479128635069,0.0108891174729183,-0.0328297916475234,-0.00123982658305742,-0.0234437928825339,0.00176009541778109,-0.000286316135842199,0.00538535021101411,0,-0.0191685930779897,-0.0209363362344234,0.000127700329831578,-0.0153984470424081,-0.0158588258124477,-0.0208562760367049,-0.0274114301096395,0.048399443987988,0.00861301641082907,-0.00108892142207727,0.0085919382966517,0.00125224016211494,-0.0248166870893708,0.0337953748378913,0.00694698811093253,0.00967906286302529,0.00440686504231202,0.00451260532605916,-0.000166770898700756,-0.00518374138495843,0.0107140159838661,0.00223667408836545,0.00766603103120299,0.0130934099863744,-0.0142025813027834,0.00217610071303672,0.00139349995877061,0.00848292199448653,-0.00786875938574234,-0.00143363327661398,0.0155369782596732,0.00875964937310627,0.000999660198747119,0.00677131560396137,0.004712786687626,-0.00213582330146885,-0.00926834932805236,0.000998582096458154,0.000718362165132452,0.012882434336555,0.00306724581801219,-0.00227987520134125,0.0109196599792625,0.00151694927862203,0.00198023711056987,-0.00345826149470922,0.00640194632263035,0.00139135834439765,-0.000811390419972009,-0.00038660790689864,-0.00251660120817442,0.00691509583698472,-0.00389593348940266,-0.00108275339283339,-0.00652125593405341,-0.00793701506020419,-0.00204330300177222,0.0145651650412811,0.00298051276396572,0.00704811406184524,-0.000575848913283651,0.00501791794678749,0.00369939973014294,-7.6138267249883e-05,-0.00289679454636915,0.0108802539277968,-0.019128518074068,-0.000735621536272455,0.00713971734050575,-0.00462535514727059,0.00370199404764494,0.00675159845656983,0.0115177553785334,3.77936091312978e-05,0.00215187806006586,0.00229776916245451,0.00465466305837547,0.00104806118288923,-0.00577797727969383,0.00353078532313754,3.74946101802421e-05,0.00672623140483425,-0.000633324040973703,0.000521590116498186,-0.00219997478592448,0.00160384958232733,0.00104298601420304,0.00885863574045208,-0.00221655859346665,-0.000369904568720126,0.00468765276572734,0.00110411847367189,0.000992665388174759,-0.00752465650680336,-0.00207553538050465,0.00967368347223155,-0.00441908286777586,-0.0165975825625453,-0.00157728739319513,-0.00270972278252964,0.00406183566437168,-0.0243914531240554,0.00808232235707736,0.006009679187852,0.00917281784119216,-0.00580345312865928,-0.00675012043491127,0.00848714979702704,-0.00283366582361291,-0.0118749968180793,0.00149210931582765,-0.00840776620059574,-0.00692499430636563,0.00224910906037149,-0.0131001132655509,-0.0027509252594546,0.0212206456164607,0.00824913074611323,0.0063988030173087,0.0105369481420559,0.00464768452827925,-0.00033660589401486,-0.00202201821295682,0.00437588356039615,-0.00145650129371009,0.000933898721081272,0.00969817169273144,0.00295388465731732,0.00954030646309462,-0.00124241775841392,-0.00172000554716308,-0.00953176781483476,-0.00126619343780242,0.00397304794571282,0.00576434364112667,0.00767102915612445,0.0029572318517857,0.00787963190189599,-0.00173761989863138,-0.00483048862683501,0.0015279397617875,0.00453367477650045,0.00227720544454713,0.00468266830716849,0.000179665463499349,-0.00338312361541502,-0.00654672742458374,0.00365843023300098,-0.00616547124065043,0.00283358347041139,0.0068328748696599,0.00470876663091135,-0.00524935588631008,0.00736287783503274,-0.00161160389422099,-0.00251211328957508,-0.010910907042806,-0.00893987089158802,-0.00724680837034186,-0.030208795616614,0.0130424304129711,0.000788421508448197,0.0188853196402659,-0.0065026461306895,-0.0118574042970223,0.014707516344295,-0.0294442273974584,0.00266109261774477,0.0144733417480853,0.0120879979692974,-0.00786850364855841,0.00823813476076841,-0.000480547093309887,-0.0262985749771865,0.0110606816441489,-0.00323381495618574,0.00656946259655714,0.0127522873218828,0.000775494416547213,-0.00685251315560837,0.0108680649787098,0.0131489530968103,0.000942985707238418,-3.62522430598388e-05,0.000362463304776879,0.00725779061948734,0.00301756883794474,-0.000466409551974018,-0.0031269681332563,0.00258853281329045,0.000358989090439081,7.17823558255201e-05,-0.00485691084466101,-0.000108199736665373,-0.00836673418701712,0.00612328174660497,-0.00236299301069442,-0.00521832337116193,0.00507272587498624,-0.0122696431840528,-0.0180721089975081,0.00818405509640439,0.0141170119753536,-0.00448695125726428,-0.0156775630968053,0.0093543959865245,0.0064337437674376,0.0109599176360078,-0.00138625441726032,0.00991671615006132,-0.00195376161109631,0.00282087634141387,-0.0039079511304936,0.00686493283861717,-0.0035349746438483,0.0028866297018602,0.00190781361300472,0.00409130630993748,0.00553602499635452,-0.000320609874038817,0.00284626797611524,-0.00298879425447104,0.00971638849805277,0.00369855473656108,-0.00119613030954841,0.000774130022095565,0.00319573238213196,0.00273109413432751,-0.00199506550120403,0.00161036267957496,0.000769257704260529,0.00108295062203045,0.00782517426134182,0.000519507518566037,-0.000554150951087295,-0.00357471704860455,-0.0014961468321149,0.00219126711578988,0.00758012040264296,0.0022043890025671,0.00429148371292756,-0.00380992068770861,-0.00863443922046568,-0.00664731770618618,0.00640447716452307,0.00176791795758205,0.00913626185806468,-0.00309353011975973,-0.00103330701853796,0.00289057324430685,0.00869005434500262,0.000238423679983085,0.00712640453194408,0.000338089123284223,-0.00033808912328409,0.00448726271244726,0.0049868715957355,0.000840887353713678,-0.000100869155748439,0.00519830130066486,3.34498503506997e-05,-0.00566896843828003,0.00298943843223046,0.00844957853381644,-0.00697494859984315,0.00351024179914631,-0.00270681379678949,0.0049402597308271,0.00686900213730364,-0.00284806127080409,0.00694034116424453,-0.00141720110088719,0.0018781823403582,0.00835947106048487,0.00387729987416181,-0.00263753592298627,0.00032599837291488,0.00120527076170801,-0.0090587142670908,-0.0158943743447527,0.0100296413803616,-0.000826459988412042,0.00326878603192304,-0.0177931760775275,0.0072219378385445,0.0148802428175121,0.0111951819571812,0.00349922596571038,-0.00522092795178342,0.0074501510703482,0.00170893362228745,0.00645463938752394,-0.00128118911866612,0.00198514409440526,-0.00288295413963191,0.00486416524354664,-0.00380611676333751,-0.0105340428063456,-0.0339920089373183,-0.030721403622287,-0.00380768226310746,-0.0449721171175911,-0.00804943247366468,0.0449779106309723,-0.0284755867330371,0.0413628554230104,-0.034265797066975,-0.0171837294382038,-0.0788904037942658,0.0482220795960846,-0.0500007713435121,-0.0997191412471524,0.089099994786046],"text":["date: 2016-12-05","date: 2016-12-06","date: 2016-12-07","date: 2016-12-08","date: 2016-12-09","date: 2016-12-12","date: 2016-12-13","date: 2016-12-14","date: 2016-12-15","date: 2016-12-16","date: 2016-12-19","date: 2016-12-20","date: 2016-12-21","date: 2016-12-22","date: 2016-12-23","date: 2016-12-27","date: 2016-12-28","date: 2016-12-29","date: 2016-12-30","date: 2017-01-03","date: 2017-01-04","date: 2017-01-05","date: 2017-01-06","date: 2017-01-09","date: 2017-01-10","date: 2017-01-11","date: 2017-01-12","date: 2017-01-13","date: 2017-01-17","date: 2017-01-18","date: 2017-01-19","date: 2017-01-20","date: 2017-01-23","date: 2017-01-24","date: 2017-01-25","date: 2017-01-26","date: 2017-01-27","date: 2017-01-30","date: 2017-01-31","date: 2017-02-01","date: 2017-02-02","date: 2017-02-03","date: 2017-02-06","date: 2017-02-07","date: 2017-02-08","date: 2017-02-09","date: 2017-02-10","date: 2017-02-13","date: 2017-02-14","date: 2017-02-15","date: 2017-02-16","date: 2017-02-17","date: 2017-02-21","date: 2017-02-22","date: 2017-02-23","date: 2017-02-24","date: 2017-02-27","date: 2017-02-28","date: 2017-03-01","date: 2017-03-02","date: 2017-03-03","date: 2017-03-06","date: 2017-03-07","date: 2017-03-08","date: 2017-03-09","date: 2017-03-10","date: 2017-03-13","date: 2017-03-14","date: 2017-03-15","date: 2017-03-16","date: 2017-03-17","date: 2017-03-20","date: 2017-03-21","date: 2017-03-22","date: 2017-03-23","date: 2017-03-24","date: 2017-03-27","date: 2017-03-28","date: 2017-03-29","date: 2017-03-30","date: 2017-03-31","date: 2017-04-03","date: 2017-04-04","date: 2017-04-05","date: 2017-04-06","date: 2017-04-07","date: 2017-04-10","date: 2017-04-11","date: 2017-04-12","date: 2017-04-13","date: 2017-04-17","date: 2017-04-18","date: 2017-04-19","date: 2017-04-20","date: 2017-04-21","date: 2017-04-24","date: 2017-04-25","date: 2017-04-26","date: 2017-04-27","date: 2017-04-28","date: 2017-05-01","date: 2017-05-02","date: 2017-05-03","date: 2017-05-04","date: 2017-05-05","date: 2017-05-08","date: 2017-05-09","date: 2017-05-10","date: 2017-05-11","date: 2017-05-12","date: 2017-05-15","date: 2017-05-16","date: 2017-05-17","date: 2017-05-18","date: 2017-05-19","date: 2017-05-22","date: 2017-05-23","date: 2017-05-24","date: 2017-05-25","date: 2017-05-26","date: 2017-05-30","date: 2017-05-31","date: 2017-06-01","date: 2017-06-02","date: 2017-06-05","date: 2017-06-06","date: 2017-06-07","date: 2017-06-08","date: 2017-06-09","date: 2017-06-12","date: 2017-06-13","date: 2017-06-14","date: 2017-06-15","date: 2017-06-16","date: 2017-06-19","date: 2017-06-20","date: 2017-06-21","date: 2017-06-22","date: 2017-06-23","date: 2017-06-26","date: 2017-06-27","date: 2017-06-28","date: 2017-06-29","date: 2017-06-30","date: 2017-07-03","date: 2017-07-05","date: 2017-07-06","date: 2017-07-07","date: 2017-07-10","date: 2017-07-11","date: 2017-07-12","date: 2017-07-13","date: 2017-07-14","date: 2017-07-17","date: 2017-07-18","date: 2017-07-19","date: 2017-07-20","date: 2017-07-21","date: 2017-07-24","date: 2017-07-25","date: 2017-07-26","date: 2017-07-27","date: 2017-07-28","date: 2017-07-31","date: 2017-08-01","date: 2017-08-02","date: 2017-08-03","date: 2017-08-04","date: 2017-08-07","date: 2017-08-08","date: 2017-08-09","date: 2017-08-10","date: 2017-08-11","date: 2017-08-14","date: 2017-08-15","date: 2017-08-16","date: 2017-08-17","date: 2017-08-18","date: 2017-08-21","date: 2017-08-22","date: 2017-08-23","date: 2017-08-24","date: 2017-08-25","date: 2017-08-28","date: 2017-08-29","date: 2017-08-30","date: 2017-08-31","date: 2017-09-01","date: 2017-09-05","date: 2017-09-06","date: 2017-09-07","date: 2017-09-08","date: 2017-09-11","date: 2017-09-12","date: 2017-09-13","date: 2017-09-14","date: 2017-09-15","date: 2017-09-18","date: 2017-09-19","date: 2017-09-20","date: 2017-09-21","date: 2017-09-22","date: 2017-09-25","date: 2017-09-26","date: 2017-09-27","date: 2017-09-28","date: 2017-09-29","date: 2017-10-02","date: 2017-10-03","date: 2017-10-04","date: 2017-10-05","date: 2017-10-06","date: 2017-10-09","date: 2017-10-10","date: 2017-10-11","date: 2017-10-12","date: 2017-10-13","date: 2017-10-16","date: 2017-10-17","date: 2017-10-18","date: 2017-10-19","date: 2017-10-20","date: 2017-10-23","date: 2017-10-24","date: 2017-10-25","date: 2017-10-26","date: 2017-10-27","date: 2017-10-30","date: 2017-10-31","date: 2017-11-01","date: 2017-11-02","date: 2017-11-03","date: 2017-11-06","date: 2017-11-07","date: 2017-11-08","date: 2017-11-09","date: 2017-11-10","date: 2017-11-13","date: 2017-11-14","date: 2017-11-15","date: 2017-11-16","date: 2017-11-17","date: 2017-11-20","date: 2017-11-21","date: 2017-11-22","date: 2017-11-24","date: 2017-11-27","date: 2017-11-28","date: 2017-11-29","date: 2017-11-30","date: 2017-12-01","date: 2017-12-04","date: 2017-12-05","date: 2017-12-06","date: 2017-12-07","date: 2017-12-08","date: 2017-12-11","date: 2017-12-12","date: 2017-12-13","date: 2017-12-14","date: 2017-12-15","date: 2017-12-18","date: 2017-12-19","date: 2017-12-20","date: 2017-12-21","date: 2017-12-22","date: 2017-12-26","date: 2017-12-27","date: 2017-12-28","date: 2017-12-29","date: 2018-01-02","date: 2018-01-03","date: 2018-01-04","date: 2018-01-05","date: 2018-01-08","date: 2018-01-09","date: 2018-01-10","date: 2018-01-11","date: 2018-01-12","date: 2018-01-16","date: 2018-01-17","date: 2018-01-18","date: 2018-01-19","date: 2018-01-22","date: 2018-01-23","date: 2018-01-24","date: 2018-01-25","date: 2018-01-26","date: 2018-01-29","date: 2018-01-30","date: 2018-01-31","date: 2018-02-01","date: 2018-02-02","date: 2018-02-05","date: 2018-02-06","date: 2018-02-07","date: 2018-02-08","date: 2018-02-09","date: 2018-02-12","date: 2018-02-13","date: 2018-02-14","date: 2018-02-15","date: 2018-02-16","date: 2018-02-20","date: 2018-02-21","date: 2018-02-22","date: 2018-02-23","date: 2018-02-26","date: 2018-02-27","date: 2018-02-28","date: 2018-03-01","date: 2018-03-02","date: 2018-03-05","date: 2018-03-06","date: 2018-03-07","date: 2018-03-08","date: 2018-03-09","date: 2018-03-12","date: 2018-03-13","date: 2018-03-14","date: 2018-03-15","date: 2018-03-16","date: 2018-03-19","date: 2018-03-20","date: 2018-03-21","date: 2018-03-22","date: 2018-03-23","date: 2018-03-26","date: 2018-03-27","date: 2018-03-28","date: 2018-03-29","date: 2018-04-02","date: 2018-04-03","date: 2018-04-04","date: 2018-04-05","date: 2018-04-06","date: 2018-04-09","date: 2018-04-10","date: 2018-04-11","date: 2018-04-12","date: 2018-04-13","date: 2018-04-16","date: 2018-04-17","date: 2018-04-18","date: 2018-04-19","date: 2018-04-20","date: 2018-04-23","date: 2018-04-24","date: 2018-04-25","date: 2018-04-26","date: 2018-04-27","date: 2018-04-30","date: 2018-05-01","date: 2018-05-02","date: 2018-05-03","date: 2018-05-04","date: 2018-05-07","date: 2018-05-08","date: 2018-05-09","date: 2018-05-10","date: 2018-05-11","date: 2018-05-14","date: 2018-05-15","date: 2018-05-16","date: 2018-05-17","date: 2018-05-18","date: 2018-05-21","date: 2018-05-22","date: 2018-05-23","date: 2018-05-24","date: 2018-05-25","date: 2018-05-29","date: 2018-05-30","date: 2018-05-31","date: 2018-06-01","date: 2018-06-04","date: 2018-06-05","date: 2018-06-06","date: 2018-06-07","date: 2018-06-08","date: 2018-06-11","date: 2018-06-12","date: 2018-06-13","date: 2018-06-14","date: 2018-06-15","date: 2018-06-18","date: 2018-06-19","date: 2018-06-20","date: 2018-06-21","date: 2018-06-22","date: 2018-06-25","date: 2018-06-26","date: 2018-06-27","date: 2018-06-28","date: 2018-06-29","date: 2018-07-02","date: 2018-07-03","date: 2018-07-05","date: 2018-07-06","date: 2018-07-09","date: 2018-07-10","date: 2018-07-11","date: 2018-07-12","date: 2018-07-13","date: 2018-07-16","date: 2018-07-17","date: 2018-07-18","date: 2018-07-19","date: 2018-07-20","date: 2018-07-23","date: 2018-07-24","date: 2018-07-25","date: 2018-07-26","date: 2018-07-27","date: 2018-07-30","date: 2018-07-31","date: 2018-08-01","date: 2018-08-02","date: 2018-08-03","date: 2018-08-06","date: 2018-08-07","date: 2018-08-08","date: 2018-08-09","date: 2018-08-10","date: 2018-08-13","date: 2018-08-14","date: 2018-08-15","date: 2018-08-16","date: 2018-08-17","date: 2018-08-20","date: 2018-08-21","date: 2018-08-22","date: 2018-08-23","date: 2018-08-24","date: 2018-08-27","date: 2018-08-28","date: 2018-08-29","date: 2018-08-30","date: 2018-08-31","date: 2018-09-04","date: 2018-09-05","date: 2018-09-06","date: 2018-09-07","date: 2018-09-10","date: 2018-09-11","date: 2018-09-12","date: 2018-09-13","date: 2018-09-14","date: 2018-09-17","date: 2018-09-18","date: 2018-09-19","date: 2018-09-20","date: 2018-09-21","date: 2018-09-24","date: 2018-09-25","date: 2018-09-26","date: 2018-09-27","date: 2018-09-28","date: 2018-10-01","date: 2018-10-02","date: 2018-10-03","date: 2018-10-04","date: 2018-10-05","date: 2018-10-08","date: 2018-10-09","date: 2018-10-10","date: 2018-10-11","date: 2018-10-12","date: 2018-10-15","date: 2018-10-16","date: 2018-10-17","date: 2018-10-18","date: 2018-10-19","date: 2018-10-22","date: 2018-10-23","date: 2018-10-24","date: 2018-10-25","date: 2018-10-26","date: 2018-10-29","date: 2018-10-30","date: 2018-10-31","date: 2018-11-01","date: 2018-11-02","date: 2018-11-05","date: 2018-11-06","date: 2018-11-07","date: 2018-11-08","date: 2018-11-09","date: 2018-11-12","date: 2018-11-13","date: 2018-11-14","date: 2018-11-15","date: 2018-11-16","date: 2018-11-19","date: 2018-11-20","date: 2018-11-21","date: 2018-11-23","date: 2018-11-26","date: 2018-11-27","date: 2018-11-28","date: 2018-11-29","date: 2018-11-30","date: 2018-12-03","date: 2018-12-04","date: 2018-12-06","date: 2018-12-07","date: 2018-12-10","date: 2018-12-11","date: 2018-12-12","date: 2018-12-13","date: 2018-12-14","date: 2018-12-17","date: 2018-12-18","date: 2018-12-19","date: 2018-12-20","date: 2018-12-21","date: 2018-12-24","date: 2018-12-26","date: 2018-12-27","date: 2018-12-28","date: 2018-12-31","date: 2019-01-02","date: 2019-01-03","date: 2019-01-04","date: 2019-01-07","date: 2019-01-08","date: 2019-01-09","date: 2019-01-10","date: 2019-01-11","date: 2019-01-14","date: 2019-01-15","date: 2019-01-16","date: 2019-01-17","date: 2019-01-18","date: 2019-01-22","date: 2019-01-23","date: 2019-01-24","date: 2019-01-25","date: 2019-01-28","date: 2019-01-29","date: 2019-01-30","date: 2019-01-31","date: 2019-02-01","date: 2019-02-04","date: 2019-02-05","date: 2019-02-06","date: 2019-02-07","date: 2019-02-08","date: 2019-02-11","date: 2019-02-12","date: 2019-02-13","date: 2019-02-14","date: 2019-02-15","date: 2019-02-19","date: 2019-02-20","date: 2019-02-21","date: 2019-02-22","date: 2019-02-25","date: 2019-02-26","date: 2019-02-27","date: 2019-02-28","date: 2019-03-01","date: 2019-03-04","date: 2019-03-05","date: 2019-03-06","date: 2019-03-07","date: 2019-03-08","date: 2019-03-11","date: 2019-03-12","date: 2019-03-13","date: 2019-03-14","date: 2019-03-15","date: 2019-03-18","date: 2019-03-19","date: 2019-03-20","date: 2019-03-21","date: 2019-03-22","date: 2019-03-25","date: 2019-03-26","date: 2019-03-27","date: 2019-03-28","date: 2019-03-29","date: 2019-04-01","date: 2019-04-02","date: 2019-04-03","date: 2019-04-04","date: 2019-04-05","date: 2019-04-08","date: 2019-04-09","date: 2019-04-10","date: 2019-04-11","date: 2019-04-12","date: 2019-04-15","date: 2019-04-16","date: 2019-04-17","date: 2019-04-18","date: 2019-04-22","date: 2019-04-23","date: 2019-04-24","date: 2019-04-25","date: 2019-04-26","date: 2019-04-29","date: 2019-04-30","date: 2019-05-01","date: 2019-05-02","date: 2019-05-03","date: 2019-05-06","date: 2019-05-07","date: 2019-05-08","date: 2019-05-09","date: 2019-05-10","date: 2019-05-13","date: 2019-05-14","date: 2019-05-15","date: 2019-05-16","date: 2019-05-17","date: 2019-05-20","date: 2019-05-21","date: 2019-05-22","date: 2019-05-23","date: 2019-05-24","date: 2019-05-28","date: 2019-05-29","date: 2019-05-30","date: 2019-05-31","date: 2019-06-03","date: 2019-06-04","date: 2019-06-05","date: 2019-06-06","date: 2019-06-07","date: 2019-06-10","date: 2019-06-11","date: 2019-06-12","date: 2019-06-13","date: 2019-06-14","date: 2019-06-17","date: 2019-06-18","date: 2019-06-19","date: 2019-06-20","date: 2019-06-21","date: 2019-06-24","date: 2019-06-25","date: 2019-06-26","date: 2019-06-27","date: 2019-06-28","date: 2019-07-01","date: 2019-07-02","date: 2019-07-03","date: 2019-07-05","date: 2019-07-08","date: 2019-07-09","date: 2019-07-10","date: 2019-07-11","date: 2019-07-12","date: 2019-07-15","date: 2019-07-16","date: 2019-07-17","date: 2019-07-18","date: 2019-07-19","date: 2019-07-22","date: 2019-07-23","date: 2019-07-24","date: 2019-07-25","date: 2019-07-26","date: 2019-07-29","date: 2019-07-30","date: 2019-07-31","date: 2019-08-01","date: 2019-08-02","date: 2019-08-05","date: 2019-08-06","date: 2019-08-07","date: 2019-08-08","date: 2019-08-09","date: 2019-08-12","date: 2019-08-13","date: 2019-08-14","date: 2019-08-15","date: 2019-08-16","date: 2019-08-19","date: 2019-08-20","date: 2019-08-21","date: 2019-08-22","date: 2019-08-23","date: 2019-08-26","date: 2019-08-27","date: 2019-08-28","date: 2019-08-29","date: 2019-08-30","date: 2019-09-03","date: 2019-09-04","date: 2019-09-05","date: 2019-09-06","date: 2019-09-09","date: 2019-09-10","date: 2019-09-11","date: 2019-09-12","date: 2019-09-13","date: 2019-09-16","date: 2019-09-17","date: 2019-09-18","date: 2019-09-19","date: 2019-09-20","date: 2019-09-23","date: 2019-09-24","date: 2019-09-25","date: 2019-09-26","date: 2019-09-27","date: 2019-09-30","date: 2019-10-01","date: 2019-10-02","date: 2019-10-03","date: 2019-10-04","date: 2019-10-07","date: 2019-10-08","date: 2019-10-09","date: 2019-10-10","date: 2019-10-11","date: 2019-10-14","date: 2019-10-15","date: 2019-10-16","date: 2019-10-17","date: 2019-10-18","date: 2019-10-21","date: 2019-10-22","date: 2019-10-23","date: 2019-10-24","date: 2019-10-25","date: 2019-10-28","date: 2019-10-29","date: 2019-10-30","date: 2019-10-31","date: 2019-11-01","date: 2019-11-04","date: 2019-11-05","date: 2019-11-06","date: 2019-11-07","date: 2019-11-08","date: 2019-11-11","date: 2019-11-12","date: 2019-11-13","date: 2019-11-14","date: 2019-11-15","date: 2019-11-18","date: 2019-11-19","date: 2019-11-20","date: 2019-11-21","date: 2019-11-22","date: 2019-11-25","date: 2019-11-26","date: 2019-11-27","date: 2019-11-29","date: 2019-12-02","date: 2019-12-03","date: 2019-12-04","date: 2019-12-05","date: 2019-12-06","date: 2019-12-09","date: 2019-12-10","date: 2019-12-11","date: 2019-12-12","date: 2019-12-13","date: 2019-12-16","date: 2019-12-17","date: 2019-12-18","date: 2019-12-19","date: 2019-12-20","date: 2019-12-23","date: 2019-12-24","date: 2019-12-26","date: 2019-12-27","date: 2019-12-30","date: 2019-12-31","date: 2020-01-02","date: 2020-01-03","date: 2020-01-06","date: 2020-01-07","date: 2020-01-08","date: 2020-01-09","date: 2020-01-10","date: 2020-01-13","date: 2020-01-14","date: 2020-01-15","date: 2020-01-16","date: 2020-01-17","date: 2020-01-21","date: 2020-01-22","date: 2020-01-23","date: 2020-01-24","date: 2020-01-27","date: 2020-01-28","date: 2020-01-29","date: 2020-01-30","date: 2020-01-31","date: 2020-02-03","date: 2020-02-04","date: 2020-02-05","date: 2020-02-06","date: 2020-02-07","date: 2020-02-10","date: 2020-02-11","date: 2020-02-12","date: 2020-02-13","date: 2020-02-14","date: 2020-02-18","date: 2020-02-19","date: 2020-02-20","date: 2020-02-21","date: 2020-02-24","date: 2020-02-25","date: 2020-02-26","date: 2020-02-27","date: 2020-02-28","date: 2020-03-02","date: 2020-03-03","date: 2020-03-04","date: 2020-03-05","date: 2020-03-06","date: 2020-03-09","date: 2020-03-10","date: 2020-03-11","date: 2020-03-12","date: 2020-03-13"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(100,149,237,1)","opacity":0.5,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(100,149,237,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":25.5707762557078,"l":46.027397260274},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Weekly Daily Returns","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[17080.5,18389.5],"tickmode":"array","ticktext":["2017-01","2017-07","2018-01","2018-07","2019-01","2019-07","2020-01"],"tickvals":[17167,17348,17532,17713,17897,18078,18262],"categoryorder":"array","categoryarray":["2017-01","2017-07","2018-01","2018-07","2019-01","2019-07","2020-01"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.109160098048812,0.0985409515877059],"tickmode":"array","ticktext":["-10.0%","-8.0%","-6.0%","-4.0%","-2.0%","0.0%","2.0%","4.0%","6.0%","8.0%"],"tickvals":[-0.1,-0.08,-0.06,-0.04,-0.02,0,0.02,0.04,0.06,0.08],"categoryorder":"array","categoryarray":["-10.0%","-8.0%","-6.0%","-4.0%","-2.0%","0.0%","2.0%","4.0%","6.0%","8.0%"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"ac86495568f6":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"ac86495568f6","visdat":{"ac86495568f6":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
<p>We can also plot the standard deviation of returns for each week.</p>
<pre class="r"><code># R chunk
(
sp_500_returns %>%
select(date, daily_returns_log) %>%
filter(date >= first_monday_december) %>%
slide_period_dfr(.,
.$date,
"week",
~ .x,
.origin = first_monday_december,
.names_to = "week") %>%
group_by(week) %>%
summarise(first_of_week = first(date),
sd = sd(daily_returns_log)) %>%
ggplot(aes(x = first_of_week, y = sd, text = str_glue("week: {first_of_week}"))) +
geom_point(aes(color = sd)) +
labs(x = "", title = "Weekly Standard Dev of Returns", y = "") +
theme_minimal()
) %>% ggplotly(tooltip = "text")</code></pre>
<div id="htmlwidget-5" style="width:672px;height:480px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-5">{"x":{"data":[{"x":[1480896000,1481500800,1482105600,1482796800,1483401600,1483920000,1484611200,1485129600,1485734400,1486339200,1486944000,1487635200,1488153600,1488758400,1489363200,1489968000,1490572800,1491177600,1491782400,1492387200,1492992000,1493596800,1494201600,1494806400,1495411200,1496102400,1496620800,1497225600,1497830400,1498435200,1499040000,1499644800,1500249600,1500854400,1501459200,1502064000,1502668800,1503273600,1503878400,1504569600,1505088000,1505692800,1506297600,1506902400,1507507200,1508112000,1508716800,1509321600,1509926400,1510531200,1511136000,1511740800,1512345600,1512950400,1513555200,1514246400,1514851200,1515369600,1516060800,1516579200,1517184000,1517788800,1518393600,1519084800,1519603200,1520208000,1520812800,1521417600,1522022400,1522627200,1523232000,1523836800,1524441600,1525046400,1525651200,1526256000,1526860800,1527552000,1528070400,1528675200,1529280000,1529884800,1530489600,1531094400,1531699200,1532304000,1532908800,1533513600,1534118400,1534723200,1535328000,1536019200,1536537600,1537142400,1537747200,1538352000,1538956800,1539561600,1540166400,1540771200,1541376000,1541980800,1542585600,1543190400,1543795200,1544400000,1545004800,1545609600,1546214400,1546819200,1547424000,1548115200,1548633600,1549238400,1549843200,1550534400,1551052800,1551657600,1552262400,1552867200,1553472000,1554076800,1554681600,1555286400,1555891200,1556496000,1557100800,1557705600,1558310400,1559001600,1559520000,1560124800,1560729600,1561334400,1561939200,1562544000,1563148800,1563753600,1564358400,1564963200,1565568000,1566172800,1566777600,1567468800,1567987200,1568592000,1569196800,1569801600,1570406400,1571011200,1571616000,1572220800,1572825600,1573430400,1574035200,1574640000,1575244800,1575849600,1576454400,1577059200,1577664000,1578268800,1578873600,1579564800,1580083200,1580688000,1581292800,1581984000,1582502400,1583107200,1583712000],"y":[0.00428542758271371,0.00568861809441307,0.00260796479741234,0.00463955794747408,0.00391741481793808,0.00269741741395485,0.00351704941088118,0.00483308645419438,0.00474877415206346,0.00311258887176715,0.00264948870806455,0.00300703845118184,0.00742902894305432,0.00277434373390841,0.00461849866133215,0.00554449147735816,0.00371747382211484,0.00200920331663207,0.00324568935145387,0.00579427829778431,0.00529118971328767,0.00188720452501024,0.00134296587543603,0.0100826311872125,0.00194718811639403,0.00403435768384228,0.00167330728202289,0.00268409961777066,0.0053814863947088,0.00735001147793435,0.00657965023344988,0.00320277284662278,0.00244348282914921,0.00173341233786234,0.00182694746613941,0.00659512823944477,0.00921743782178293,0.00520108257894348,0.0023362211601529,0.00450002589746643,0.00447057468661814,0.00180698839992965,0.00262211305946341,0.0024901358468847,0.00193642611595415,0.00198760906022879,0.00513684015835072,0.00234250650318223,0.00201761429303829,0.00530318274327972,0.00304803399322164,0.00555381615098833,0.00363745757809798,0.00477618457516234,0.00325667459244177,0.00313404542426052,0.00170722276979584,0.00357316346616288,0.00584707324263508,0.00528158495465767,0.00891725092563695,0.0283656437260912,0.00644680930969736,0.010197568714238,0.0116325660191617,0.00705946191651277,0.00340059155871215,0.0118107180475284,0.0192565041451108,0.01807155350197,0.00886326437556684,0.00832553141187192,0.00857554846549429,0.00860643934736129,0.00451708956128494,0.00414401773579414,0.00451827900874718,0.0122838052182871,0.00355125688639771,0.0026822542682101,0.00358428118879697,0.00831022352103518,0.00645287911781114,0.0066631388121584,0.00304648234095383,0.00620342246518341,0.00482950351692109,0.00414547845924569,0.00676492408220786,0.00300557941172333,0.00481004112934702,0.00077039084883359,0.00222477657346596,0.00522813200972381,0.00265885056167373,0.00473831298598575,0.0187484488002918,0.0131839351075612,0.0183745210290145,0.0104518103764823,0.0112010313303377,0.0114762484504,0.00995575389439934,0.009785725021822,0.0200338089594582,0.00960874980868858,0.00863736436641497,0.0314432154449168,0.0241164518850932,0.00363801407935785,0.00731639356428921,0.00964701766765257,0.00910961323548259,0.00631050002959617,0.00656164495496061,0.0040330202868636,0.00362899794664017,0.00290552944247068,0.00564796998240457,0.0111248305220034,0.00505651707501505,0.00444044960599729,0.00463682405863611,0.00162891702220012,0.00440947277880287,0.00624041059658424,0.00760088530262248,0.0141827236338453,0.00778639109465693,0.00642703026645009,0.00861334204435986,0.00322672475436481,0.0050108048551864,0.00597667780771788,0.00456023815692009,0.00387030293659101,0.0043499043589986,0.00510592326839474,0.00404473230258807,0.0192291783496026,0.0188582443716662,0.0152210778853089,0.00676803340448989,0.00925356973123229,0.00322440616798477,0.00296953869446457,0.0054906827759299,0.0138508559646498,0.0112501965114402,0.00550641382972188,0.00382393175423247,0.00496105222015224,0.00202734114896572,0.00361595998846953,0.00216200838610327,0.00479241843997551,0.00783474394872398,0.00454788204552359,0.00319777606587138,0.0025050092577236,0.00732020684174918,0.00447995127355144,0.00392070249037435,0.00464557648025062,0.0121569914328974,0.00772994685840006,0.00361996158123424,0.00630444840185479,0.0176530560760263,0.0387490341140696,0.0825608101381797],"text":["week: 2016-12-05","week: 2016-12-12","week: 2016-12-19","week: 2016-12-27","week: 2017-01-03","week: 2017-01-09","week: 2017-01-17","week: 2017-01-23","week: 2017-01-30","week: 2017-02-06","week: 2017-02-13","week: 2017-02-21","week: 2017-02-27","week: 2017-03-06","week: 2017-03-13","week: 2017-03-20","week: 2017-03-27","week: 2017-04-03","week: 2017-04-10","week: 2017-04-17","week: 2017-04-24","week: 2017-05-01","week: 2017-05-08","week: 2017-05-15","week: 2017-05-22","week: 2017-05-30","week: 2017-06-05","week: 2017-06-12","week: 2017-06-19","week: 2017-06-26","week: 2017-07-03","week: 2017-07-10","week: 2017-07-17","week: 2017-07-24","week: 2017-07-31","week: 2017-08-07","week: 2017-08-14","week: 2017-08-21","week: 2017-08-28","week: 2017-09-05","week: 2017-09-11","week: 2017-09-18","week: 2017-09-25","week: 2017-10-02","week: 2017-10-09","week: 2017-10-16","week: 2017-10-23","week: 2017-10-30","week: 2017-11-06","week: 2017-11-13","week: 2017-11-20","week: 2017-11-27","week: 2017-12-04","week: 2017-12-11","week: 2017-12-18","week: 2017-12-26","week: 2018-01-02","week: 2018-01-08","week: 2018-01-16","week: 2018-01-22","week: 2018-01-29","week: 2018-02-05","week: 2018-02-12","week: 2018-02-20","week: 2018-02-26","week: 2018-03-05","week: 2018-03-12","week: 2018-03-19","week: 2018-03-26","week: 2018-04-02","week: 2018-04-09","week: 2018-04-16","week: 2018-04-23","week: 2018-04-30","week: 2018-05-07","week: 2018-05-14","week: 2018-05-21","week: 2018-05-29","week: 2018-06-04","week: 2018-06-11","week: 2018-06-18","week: 2018-06-25","week: 2018-07-02","week: 2018-07-09","week: 2018-07-16","week: 2018-07-23","week: 2018-07-30","week: 2018-08-06","week: 2018-08-13","week: 2018-08-20","week: 2018-08-27","week: 2018-09-04","week: 2018-09-10","week: 2018-09-17","week: 2018-09-24","week: 2018-10-01","week: 2018-10-08","week: 2018-10-15","week: 2018-10-22","week: 2018-10-29","week: 2018-11-05","week: 2018-11-12","week: 2018-11-19","week: 2018-11-26","week: 2018-12-03","week: 2018-12-10","week: 2018-12-17","week: 2018-12-24","week: 2018-12-31","week: 2019-01-07","week: 2019-01-14","week: 2019-01-22","week: 2019-01-28","week: 2019-02-04","week: 2019-02-11","week: 2019-02-19","week: 2019-02-25","week: 2019-03-04","week: 2019-03-11","week: 2019-03-18","week: 2019-03-25","week: 2019-04-01","week: 2019-04-08","week: 2019-04-15","week: 2019-04-22","week: 2019-04-29","week: 2019-05-06","week: 2019-05-13","week: 2019-05-20","week: 2019-05-28","week: 2019-06-03","week: 2019-06-10","week: 2019-06-17","week: 2019-06-24","week: 2019-07-01","week: 2019-07-08","week: 2019-07-15","week: 2019-07-22","week: 2019-07-29","week: 2019-08-05","week: 2019-08-12","week: 2019-08-19","week: 2019-08-26","week: 2019-09-03","week: 2019-09-09","week: 2019-09-16","week: 2019-09-23","week: 2019-09-30","week: 2019-10-07","week: 2019-10-14","week: 2019-10-21","week: 2019-10-28","week: 2019-11-04","week: 2019-11-11","week: 2019-11-18","week: 2019-11-25","week: 2019-12-02","week: 2019-12-09","week: 2019-12-16","week: 2019-12-23","week: 2019-12-30","week: 2020-01-06","week: 2020-01-13","week: 2020-01-21","week: 2020-01-27","week: 2020-02-03","week: 2020-02-10","week: 2020-02-18","week: 2020-02-24","week: 2020-03-02","week: 2020-03-09"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":["rgba(22,48,74,1)","rgba(23,50,77,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(21,48,73,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(22,49,75,1)","rgba(21,46,72,1)","rgba(20,46,71,1)","rgba(21,46,71,1)","rgba(24,53,80,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(20,45,69,1)","rgba(19,44,68,1)","rgba(26,57,85,1)","rgba(20,45,69,1)","rgba(21,48,73,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(22,50,76,1)","rgba(24,53,80,1)","rgba(23,51,78,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(20,44,69,1)","rgba(20,45,69,1)","rgba(23,51,78,1)","rgba(25,55,84,1)","rgba(22,49,76,1)","rgba(20,45,70,1)","rgba(22,48,74,1)","rgba(22,48,74,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(20,45,69,1)","rgba(22,49,75,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(22,50,76,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(21,47,72,1)","rgba(21,46,72,1)","rgba(20,44,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(25,55,83,1)","rgba(40,85,123,1)","rgba(23,51,78,1)","rgba(26,57,85,1)","rgba(27,59,88,1)","rgba(24,52,79,1)","rgba(21,47,72,1)","rgba(27,59,89,1)","rgba(33,70,104,1)","rgba(32,69,101,1)","rgba(25,55,83,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(22,48,74,1)","rgba(21,48,74,1)","rgba(22,48,74,1)","rgba(28,60,90,1)","rgba(21,47,72,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(25,54,82,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,46,71,1)","rgba(23,51,78,1)","rgba(22,49,75,1)","rgba(21,48,74,1)","rgba(23,52,79,1)","rgba(21,46,71,1)","rgba(22,49,75,1)","rgba(19,43,67,1)","rgba(20,45,70,1)","rgba(22,49,76,1)","rgba(20,46,71,1)","rgba(22,49,75,1)","rgba(32,70,103,1)","rgba(28,61,91,1)","rgba(32,69,102,1)","rgba(26,57,86,1)","rgba(27,58,87,1)","rgba(27,59,88,1)","rgba(26,56,85,1)","rgba(26,56,85,1)","rgba(33,72,106,1)","rgba(26,56,84,1)","rgba(25,54,82,1)","rgba(42,90,130,1)","rgba(37,78,114,1)","rgba(21,47,73,1)","rgba(24,52,80,1)","rgba(26,56,84,1)","rgba(25,55,83,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,48,73,1)","rgba(21,47,73,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(27,58,87,1)","rgba(22,49,75,1)","rgba(22,48,74,1)","rgba(22,49,74,1)","rgba(20,44,69,1)","rgba(22,48,74,1)","rgba(23,51,78,1)","rgba(24,53,80,1)","rgba(29,63,94,1)","rgba(24,53,81,1)","rgba(23,51,78,1)","rgba(25,54,82,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(23,50,77,1)","rgba(22,48,74,1)","rgba(21,47,73,1)","rgba(22,48,74,1)","rgba(22,49,75,1)","rgba(21,48,73,1)","rgba(33,70,104,1)","rgba(33,70,103,1)","rgba(30,64,96,1)","rgba(23,52,79,1)","rgba(25,55,84,1)","rgba(21,47,72,1)","rgba(21,46,71,1)","rgba(22,50,76,1)","rgba(29,62,93,1)","rgba(27,58,88,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(20,45,70,1)","rgba(22,49,75,1)","rgba(24,53,81,1)","rgba(22,48,74,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(24,52,80,1)","rgba(22,48,74,1)","rgba(21,48,73,1)","rgba(22,49,75,1)","rgba(27,60,89,1)","rgba(24,53,81,1)","rgba(21,47,73,1)","rgba(23,51,78,1)","rgba(32,68,101,1)","rgba(48,101,146,1)","rgba(86,177,247,1)"],"opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":["rgba(22,48,74,1)","rgba(23,50,77,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(21,48,73,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(22,49,75,1)","rgba(21,46,72,1)","rgba(20,46,71,1)","rgba(21,46,71,1)","rgba(24,53,80,1)","rgba(20,46,71,1)","rgba(22,49,74,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(20,45,69,1)","rgba(19,44,68,1)","rgba(26,57,85,1)","rgba(20,45,69,1)","rgba(21,48,73,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(22,50,76,1)","rgba(24,53,80,1)","rgba(23,51,78,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(20,44,69,1)","rgba(20,45,69,1)","rgba(23,51,78,1)","rgba(25,55,84,1)","rgba(22,49,76,1)","rgba(20,45,70,1)","rgba(22,48,74,1)","rgba(22,48,74,1)","rgba(20,44,69,1)","rgba(20,46,71,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(20,45,69,1)","rgba(22,49,75,1)","rgba(20,45,70,1)","rgba(20,45,69,1)","rgba(22,50,76,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(21,47,72,1)","rgba(21,46,72,1)","rgba(20,44,69,1)","rgba(21,47,72,1)","rgba(23,50,77,1)","rgba(22,49,76,1)","rgba(25,55,83,1)","rgba(40,85,123,1)","rgba(23,51,78,1)","rgba(26,57,85,1)","rgba(27,59,88,1)","rgba(24,52,79,1)","rgba(21,47,72,1)","rgba(27,59,89,1)","rgba(33,70,104,1)","rgba(32,69,101,1)","rgba(25,55,83,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(25,54,82,1)","rgba(22,48,74,1)","rgba(21,48,74,1)","rgba(22,48,74,1)","rgba(28,60,90,1)","rgba(21,47,72,1)","rgba(20,46,71,1)","rgba(21,47,72,1)","rgba(25,54,82,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,46,71,1)","rgba(23,51,78,1)","rgba(22,49,75,1)","rgba(21,48,74,1)","rgba(23,52,79,1)","rgba(21,46,71,1)","rgba(22,49,75,1)","rgba(19,43,67,1)","rgba(20,45,70,1)","rgba(22,49,76,1)","rgba(20,46,71,1)","rgba(22,49,75,1)","rgba(32,70,103,1)","rgba(28,61,91,1)","rgba(32,69,102,1)","rgba(26,57,86,1)","rgba(27,58,87,1)","rgba(27,59,88,1)","rgba(26,56,85,1)","rgba(26,56,85,1)","rgba(33,72,106,1)","rgba(26,56,84,1)","rgba(25,54,82,1)","rgba(42,90,130,1)","rgba(37,78,114,1)","rgba(21,47,73,1)","rgba(24,52,80,1)","rgba(26,56,84,1)","rgba(25,55,83,1)","rgba(23,51,78,1)","rgba(23,51,78,1)","rgba(21,48,73,1)","rgba(21,47,73,1)","rgba(21,46,71,1)","rgba(23,50,76,1)","rgba(27,58,87,1)","rgba(22,49,75,1)","rgba(22,48,74,1)","rgba(22,49,74,1)","rgba(20,44,69,1)","rgba(22,48,74,1)","rgba(23,51,78,1)","rgba(24,53,80,1)","rgba(29,63,94,1)","rgba(24,53,81,1)","rgba(23,51,78,1)","rgba(25,54,82,1)","rgba(21,47,72,1)","rgba(22,49,75,1)","rgba(23,50,77,1)","rgba(22,48,74,1)","rgba(21,47,73,1)","rgba(22,48,74,1)","rgba(22,49,75,1)","rgba(21,48,73,1)","rgba(33,70,104,1)","rgba(33,70,103,1)","rgba(30,64,96,1)","rgba(23,52,79,1)","rgba(25,55,84,1)","rgba(21,47,72,1)","rgba(21,46,71,1)","rgba(22,50,76,1)","rgba(29,62,93,1)","rgba(27,58,88,1)","rgba(22,50,76,1)","rgba(21,47,73,1)","rgba(22,49,75,1)","rgba(20,45,69,1)","rgba(21,47,72,1)","rgba(20,45,70,1)","rgba(22,49,75,1)","rgba(24,53,81,1)","rgba(22,48,74,1)","rgba(21,46,72,1)","rgba(20,45,70,1)","rgba(24,52,80,1)","rgba(22,48,74,1)","rgba(21,48,73,1)","rgba(22,49,75,1)","rgba(27,60,89,1)","rgba(24,53,81,1)","rgba(21,47,73,1)","rgba(23,51,78,1)","rgba(32,68,101,1)","rgba(48,101,146,1)","rgba(86,177,247,1)"]}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[1483228800],"y":[0],"name":"99_b3fb57b6de4dc5ce6cf75b6745e4a4c9","type":"scatter","mode":"markers","opacity":0,"hoverinfo":"skip","showlegend":false,"marker":{"color":[0,1],"colorscale":[[0,"#132B43"],[0.0526315789473684,"#16314B"],[0.105263157894737,"#193754"],[0.157894736842105,"#1D3E5C"],[0.210526315789474,"#204465"],[0.263157894736842,"#234B6E"],[0.315789473684211,"#275277"],[0.368421052631579,"#2A5980"],[0.421052631578947,"#2E608A"],[0.473684210526316,"#316793"],[0.526315789473684,"#356E9D"],[0.578947368421053,"#3875A6"],[0.631578947368421,"#3C7CB0"],[0.68421052631579,"#3F83BA"],[0.736842105263158,"#438BC4"],[0.789473684210526,"#4792CE"],[0.842105263157895,"#4B9AD8"],[0.894736842105263,"#4EA2E2"],[0.947368421052632,"#52A9ED"],[1,"#56B1F7"]],"colorbar":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"thickness":23.04,"title":"sd","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"tickmode":"array","ticktext":["0.02","0.04","0.06","0.08"],"tickvals":[0.235108333203902,0.479635750641963,0.724163168080025,0.968690585518086],"tickfont":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895},"ticklen":2,"len":0.5}},"xaxis":"x","yaxis":"y","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":25.5707762557078,"l":34.337899543379},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Weekly Standard Dev of Returns","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[1475755200,1588852800],"tickmode":"array","ticktext":["2017","2018","2019","2020"],"tickvals":[1483228800,1514764800,1546300800,1577836800],"categoryorder":"array","categoryarray":["2017","2018","2019","2020"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.00331913011563371,0.086650331102647],"tickmode":"array","ticktext":["0.00","0.02","0.04","0.06","0.08"],"tickvals":[0,0.02,0.04,0.06,0.08],"categoryorder":"array","categoryarray":["0.00","0.02","0.04","0.06","0.08"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"ac86148a33a8":{"colour":{},"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"ac86148a33a8","visdat":{"ac86148a33a8":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
<p>That’s all for today! Thanks for reading and stay safe out there.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/03/16/outlier-days-with-r-and-python/';</script>
Comparing Machine Learning Algorithms for Predicting Clothing Classes: Part 3
https://rviews.rstudio.com/2020/03/10/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-3/
Tue, 10 Mar 2020 00:00:00 +0000https://rviews.rstudio.com/2020/03/10/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-3/
<p><em>Florianne Verkroost is a Ph.D. candidate at Nuffield College at the University of Oxford. She has a passion for data science and a background in mathematics and econometrics. She applies her interdisciplinary knowledge to computationally address societal problems of inequality.</em></p>
<p>This is the third post in a series devoted to comparing different machine learning methods for predicting clothing categories from images using the Fashion MNIST data by Zalando. In the <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-1/">first post</a> of this series, we prepared the data for analysis and used my “go-to” Python deep learning neural network model to predict the clothing categories of the Fashion MNIST data. In <a href="https://rviews.rstudio.com/2020/03/03/predicting-clothing-classes-part-2/">Part 2</a>, we used principal components analysis (PCA) to compress the clothing image data down from 784 to just 17 pixels. In this post, we pick up where we left off in Part 2 and use the two data sets <code>train.data.pca</code> and <code>test.data.pca</code> to build and compare random forest and gradient-boosted models. The R code for this post can be found on my <a href="https://github.com/fverkroost/RStudio-Blogs/blob/master/machine_learning_fashion_mnist_post234.R">Github</a> repository.</p>
<h3 id="tree-based-methods">Tree-Based Methods</h3>
<p>Tree-based methods stratify or segment the predictor space into a number of simple regions using a set of decision rules that can be summarized in a decision tree. The focus here will be on classification trees, as the Fashion MNIST outcome variable is categorical with ten classes. Because single trees have a relatively low level of predictive accuracy compared to other classification approaches, I will not show you how to fit a single tree in this blog post, but you can find the code for this (as well as tree pruning) on my <a href="https://github.com/fverkroost/RStudio-Blogs/blob/master/machine_learning_fashion_mnist_post234.R">Github</a>.</p>
<p>Ensemble methods improve predictive accuracy and decrease variance by aggregating many single decision trees. Here, I show both random forests and gradient-boosted trees as ensemble methods because the former are easier to implement as they are more robust to overfitting and require less tuning, while the latter generally outperform other tree-based methods in terms of prediction accuracy. The models are estimated in supervised mode here as labeled data are available and the goal is to predict classes. For a more formal explanation of the tree-based methods, I refer you to <a href="http://faculty.marshall.usc.edu/gareth-james/ISL/">James et al. (2013)</a>.</p>
<h3 id="random-forest">Random Forest</h3>
<p>Random forests use bootstrap aggregating to reduce the variance of the outcomes. In the first step, bootstrapping (sampling with replacement) is used to create <code>B</code> training sets from the population with the same size as the original training set. Hereafter, a separate tree for each of these training sets is built. Trees are grown using recursive binary splitting on the training data until a node reaches some minimum number of observations. The idea is that the tree should go from impure (equal mixing of classes) to pure (each leaf corresponds to one class exactly). The splits are determined such that they decrease variance, error and impurity. Random forests decorrelate the trees by considering only <code>m</code> of all <code>p</code> predictors as split candidates, whereby often <code>m = sqrt(p)</code>.</p>
<p>Classification trees predict that each observation belongs to the most commonly occurring class (i.e. majority vote) of training observations in the region to which it belongs. The classification error rate is the fraction of the number of misclassified observations and the total number of classified observations. The Gini index and cross-entropy measures determine the level of impurity in order to decide on the best split at each node. In the final step, the average of the classification prediction results of all <code>B</code> trees is computed from the majority vote. The accuracy is computed as the out-of-bag (OOB) error and/or the test set error.</p>
<p>As each bootstrap samples from the training set with replacement, about <sup>2</sup>⁄<sub>3</sub> of the observations are not sampled and some are sampled multiple times. In the case of <code>B</code> trees in the forest, each observation is left out of approximately <code>B</code>/ 3 trees. The non-sampled observations are used as test set and the <code>B</code>/ 3 trees are used for out-of-sample predictions. In random forests, pruning is not needed as potential over-fitting is (partially) mitigated by the usage of bootstrapped samples and multiple decorrelated random trees.</p>
<p>We start by tuning the number of variables that are randomly sampled as candidates at each split,<code>mtry</code>. We make use of the <code>caret</code> framework, which makes it easy to train and evaluate a large number of different types of models. For random forests, we have the <code>repeatedcv</code> method perform five-fold cross-validation with five repetitions. For now, we build a random forest containing 200 trees because previous analyses with these data showed that the error does not decrease substantially when the number of trees is larger than 200, while a larger number of trees does require more computational power. We will see later on that 200 trees is indeed sufficient for this analysis. We let the algorithm determine what the best model is based on the accuracy metric, and we ask the algorithm to run the model for <code>pca.dims</code> (= 17) different values of <code>mtry</code>. We first specify the controls in <code>rf_rand_control</code>: we perform 5-fold cross-validation with 5 repeats (<code>method = "cv"</code>, <code>number = 5</code> and <code>repeats = 5</code>), allow parallel computation (<code>allowParallel = TRUE</code>) and save the predicted values (<code>savePredictions = TRUE</code>).</p>
<pre><code class="language-r,">library(caret)
rf_rand_control = trainControl(method = "repeatedcv",
search = "random",
number = 5,
repeats = 5,
allowParallel = TRUE,
savePredictions = TRUE)
set.seed(1234)
rf_rand = train(x = train.images.pca,
y = train.data.pca$label,
method = "rf",
ntree = 200,
metric = "Accuracy",
trControl = rf_rand_control,
tuneLength = pca.dims)
</code></pre>
<pre><code class="language-r,">print(rf_rand)
</code></pre>
<p><img src="rf_rand_print.png" alt="" /></p>
<p>We can check the model performance on both the training and test sets by means of different metrics using a custom function, <code>model_performance</code>, which can be found on my <a href="https://github.com/fverkroost/RStudio-Blogs/blob/master/machine_learning_fashion_mnist_post234.R">Github</a>.</p>
<pre><code class="language-r,">mp.rf.rand = model_performance(rf_rand, train.images.pca, test.images.pca,
train.data.pca$label, test.data.pca$label, "random_forest_random")
</code></pre>
<p><img src="rf_rand_mp.png" alt="" /></p>
<p>We can also use the <code>caret</code> framework to perform a grid search with pre-specified values for <code>mtry</code> rather than a random search as above.</p>
<pre><code class="language-r,">rf_grid_control = trainControl(method = "repeatedcv",
search = "grid",
number = 5,
repeats = 5,
allowParallel = TRUE,
savePredictions = TRUE)
set.seed(1234)
rf_grid = train(x = train.images.pca,
y = train.data.pca$label,
method = "rf",
ntree = 200,
metric = "Accuracy",
trControl = rf_grid_control,
tuneGrid = expand.grid(.mtry = c(1:pca.dims)))
</code></pre>
<pre><code class="language-r,">plot(rf_grid)
</code></pre>
<p><img src="rf_grid_plot.png" alt="" /></p>
<pre><code class="language-r,">mp.rf.grid = model_performance(rf_grid, train.images.pca, test.images.pca,
train.data.pca$label, test.data.pca$label, "random_forest_grid")
</code></pre>
<p><img src="rf_grid_mp.png" alt="" /></p>
<p>As shown by the results, the random search selects <code>mtry=4</code> as the optimal parameter, resulting in 85% training and test set accuracies. The grid search selects <code>mtry=5</code> and achieves similar accuracies for both values of 4 and 5 for <code>mtry</code>. We can see from the results that according to <code>rf_rand</code>, <code>mtry</code> values of 4 and 5 lead to very similar results, which also goes for <code>mtry</code> values of 5 and 6 for <code>rf_grid</code>. Although the results of <code>rf_rand</code> and <code>rf_grid</code> are very similar, we choose the best model on the basis of accuracy and save this in <code>rf_best</code>. For this model, we’ll look at the relationship between the error and random forest size as well as the receiver operating characteristic (ROC) curves for every class. Let’s start by subtracting the best performing model from <code>rf_rand</code> and <code>rf_grid</code>.</p>
<pre><code class="language-r,">rf_models = list(rf_rand$finalModel, rf_grid$finalModel)
rf_accs = unlist(lapply(rf_models, function(x){ sum(diag(x$confusion)) / sum(x$confusion) }))
rf_best = rf_models[[which.max(rf_accs)]]
</code></pre>
<p>Next, we plot the relationship between the size of the random forest and the error using the <code>plot()</code> function from the <code>randomForest</code> package.</p>
<pre><code class="language-r,">library(randomForest)
plot(rf_best, main = "Relation between error and random forest size")
</code></pre>
<p><img src="rf_error_trees.png" alt="" /></p>
<p>We observe from this plot that the error does not decrease anymore for any of the classes after about 100 trees, and so we can conclude that our forest size of 200 is sufficient. We can also use the <code>varImpPlot()</code> function from the <code>randomForest</code> package to plot the importance for each variable. I will not show that here because it’s not as meaningful given that our variables are principal components of the actual pixels, but it’s good to keep in mind when extending these analyses to other data.</p>
<p>Finally, we plot the ROC curves for every class. On the x-axis of an ROC plot, we usually have the false positive rate (false positive / (true negative + false positive)) and on the y-axis the true positive rate (true positive / (true positive + false negative)). Essentially, the ROC plot helps us to compare the performance of our model with respect to predicting different classes. The area underneath each curve is the proportion of correct classifications for that particular class. Therefore, the further the curve is “drawn” towards the top left from the 45 degrees line, the better the classification for that class. We first need to obtain the data for the ROC curve for every class (or clothing category) in our data, which we bind together by rows, including a label for the classes.</p>
<pre><code class="language-r,">library(ROCR)
library(plyr)
pred_roc = predict(rf_best, test.images.pca, type = "prob")
classes = unique(test.data.pca$label)
classes = classes[order(classes)]
plot_list = list()
for (i in 1:length(classes)) {
actual = ifelse(test.data.pca$label == classes[i], 1, 0)
pred = prediction(pred_roc[, i], actual)
perf = performance(pred, "tpr", "fpr")
plot_list[[i]] = data.frame(matrix(NA, nrow = length(perf@x.values[[1]]), ncol = 2))
plot_list[[i]]['x'] = perf@x.values[[1]]
plot_list[[i]]['y'] = perf@y.values[[1]]
}
plotdf = rbind.fill(plot_list)
plotdf["Class"] = rep(cloth_cats, unlist(lapply(plot_list, nrow)))
</code></pre>
<p>Next, we plot the ROC curves for every class. Note that we use the custom plotting theme <code>my_theme()</code> as defined in the <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-2/">the second blog post of this series</a>.</p>
<pre><code class="language-r,">ggplot() +
geom_line(data = plotdf, aes(x = x, y = y, color = Class)) +
labs(x = "False positive rate", y = "True negative rate", color = "Class") +
ggtitle("ROC curve per class") +
theme(legend.position = c(0.85, 0.35)) +
coord_fixed() +
my_theme()
</code></pre>
<p><img src="rf_roc.png" alt="" /></p>
<p>We observe from the ROC curves that shirts and pullovers are most often misclassified, whereas trousers, bags, boots and sneakers are most often correctly classified. A possible explanation for this could be that shirts and pullovers can be very similar in shape to other categories, such as tops, coats and dresses; whereas bags, trousers, boots and sneakers are more dissimilar to other categories in the data.</p>
<h2 id="gradient-boosted-trees">Gradient-Boosted Trees</h2>
<p>While in random forests each tree is fully grown and trained independently with a random sample of data, in boosting every newly built tree incorporates the error from the previously built tree. That is, the trees are grown sequentially on an adapted version of the initial data, which does not require bootstrap sampling. Because of this, boosted trees are usually smaller and more shallow than the trees in random forests, improving the tree where it does not work well enough yet. Boosting is often said to outperform random forests, which is mainly because the approach learns slowly. This learning rate can be controlled by the shrinkage parameter, which we’ll tune later.</p>
<p>In boosting, it’s important to tune the parameters well and play around with different values of the parameters, which can easily be done using the <code>caret</code> framework. These parameters include the learning rate, <code>eta</code>, the minimal required loss reduction to further partition on a leaf node of the tree, <code>gamma</code>, the maximal depth of a tree <code>max_depth</code>, the number of trees in the forest, <code>nrounds</code>, the minimum number of observations in the trees’ nodes, <code>min_child_weight</code>, the fraction of the training set observations randomly selected to grow trees, <code>subsample</code>, and the proportion of independent variables to use for each tree, <code>colsample_bytree</code>. An overview of all parameters can be found <a href="https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster">here</a>. Again, we use the <code>caret</code> framework to tune our boosting model.</p>
<pre><code class="language-r,">xgb_control = trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
allowParallel = TRUE,
savePredictions = TRUE
)
</code></pre>
<p>Next, we define the possible combinations of the tuning parameters in the form of a grid, named <code>xgb_grid</code>.</p>
<pre><code class="language-r,">xgb_grid = expand.grid(
nrounds = c(50, 100),
max_depth = seq(5, 15, 5),
eta = c(0.002, 0.02, 0.2),
gamma = c(0.1, 0.5, 1.0),
colsample_bytree = 1,
min_child_weight = c(1, 2, 3),
subsample = c(0.5, 0.75, 1)
)
</code></pre>
<p>We set the seed and then train the model onto the transformed principal components of the training data using <code>xgb_control</code> and <code>xgb_grid</code> as specified earlier. Note that because of the relatively large number of tuning parameters, and thus the larger number of possible combinations of these parameters (<code>nrow(xgb_grid) = 486</code>), this may take quite a long time to run.</p>
<pre><code class="language-r,">set.seed(1234)
xgb_tune = train(x = train.images.pca,
y = train.classes,
method = "xgbTree",
trControl = xgb_control,
tuneGrid = xgb_grid
)
xgb_tune
</code></pre>
<p>(Note that the output of xgb_tune has been truncated for this post.)
<img src="xgb_tune_print_1.png" alt="" /></p>
<p><img src="xgb_tune_print_4.png" alt="" /></p>
<p>Let’s have a look at the tuning parameters resulting in the highest accuracy, and the model performance overall.</p>
<pre><code class="language-r,">xgb_tune$results[which.max(xgb_tune$results$Accuracy), ]
</code></pre>
<p><img src="xgb_highest_accuracy.png" alt="" /></p>
<pre><code class="language-r,">mp.xgb = model_performance(xgb_tune, train.images.pca, test.images.pca,
train.classes, test.classes, "xgboost")
</code></pre>
<p><img src="xgb_mp.png" alt="" /></p>
<p>The optimal combination of tuning parameter values resulted in 86.2% training and 85.5% testing accuracies. Although there may be some slight overfitting going on, the model performs a bit better than the random forest, as was expected. Let’s have a look at the confusion matrix for the test set predictions to observe what clothing categories are mostly correctly or wrongly classified.</p>
<pre><code class="language-r,">table(pred = predict(xgb_tune, test.images.pca),
true = test.classes)
</code></pre>
<p><img src="xgb_confusion.png" alt="" /></p>
<p>As we saw with the random forests, pullovers, shirts and coats are most often mixed up, while trousers, boots, bags and sneakers are most often correctly classified.</p>
<p>In the next and final post of this series, we will use the PCA reduced data again, but this time to estimate and assess support vector machines. Will these models be able to achieve similar results on the reduced data as neural networks on the full data? Let’s see!</p>
<script>window.location.href='https://rviews.rstudio.com/2020/03/10/comparing-machine-learning-algorithms-for-predicting-clothing-classes-part-3/';</script>
COVID-19 epidemiology with R
https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/
Thu, 05 Mar 2020 00:00:00 +0000https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/
<p><em>Tim Churches is a Senior Research Fellow at the UNSW Medicine South Western Sydney Clinical School at Liverpool Hospital, and a health data scientist at the Ingham Institute for Applied Medical Research, also located at Liverpool, Sydney. His background is in general medicine, general practice medicine, occupational health, public health practice, particularly population health surveillance, and clinical epidemiology.</em></p>
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<p>As I write this on 4th March, 2020, the world is on the cusp of a global COVID-19 pandemic caused by the SARS-Cov2 virus. Every news report is dominated by alarming, and ever-growing cumulative counts of global cases and deaths due to COVID-19. <a href="https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html">Dashboards of global spread</a> are beginning to light up like Christmas trees.</p>
<p>For <code>R</code> users, an obvious question is: “Does <code>R</code> have anything to offer in helping to understand the situation?”.</p>
<p>The answer is: “Yes, a lot!”.</p>
<p>In fact, <code>R</code> is one of the tools of choice for outbreak epidemiologists, and a quick search will yield many <code>R</code> libraries on CRAN and elsewhere devoted to outbreak management and analysis. This post doesn’t seek to provide a review of the available packages – rather it illustrates the utility of a few of the excellent packages available in the <a href="https://www.repidemicsconsortium.org"><strong>R</strong> <strong>E</strong>pidemics <strong>Con</strong>sortium (<strong>RECON</strong>) suite</a>, as well as the use of base <code>R</code> and <a href="https://www.tidyverse.org"><code>tidyverse</code></a> packages for data acquisition, wrangling and visualization. This post is based on <a href="https://timchurches.github.io/blog/">two much longer and more detailed blog posts</a> I have published in the last few weeks on the same topic, but it uses US data.</p>
</div>
<div id="data-acquisition" class="section level1">
<h1>Data acquisition</h1>
<p>Obtaining detailed, accurate and current data for the COVID-19 epidemic is not as straightforward as it might seem. Various national and provincial/governmental web sites in affected countries provide detailed summary data on incident cases, recovered cases and deaths due to the virus, but these data tend to be in the form of counts embedded in (usually non-English) text.</p>
<p>There are several potential sources of data which have been abstracted and collated from such governmental sites. A widely-used source is a <a href="https://github.com/CSSEGISandData/COVID-19">dataset</a> which is being collated by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) and which is used as the source for <a href="https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6">the dashboard</a> mentioned above. It is very easy to use – just read CSV files from the appropriate GitHub URL. However, it lacks detail (that wasn’t its intended purpose) and contains quite a few missing or anomalous data points when examined as a differenced daily time-series of incident cases, a relatively minor issue that is explored further <a href="https://timchurches.github.io/blog/posts/2020-02-18-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1/#checking-the-data">here</a>.</p>
<p>Another set of convenient sources are relevant wikipedia pages, such as <a href="https://en.wikipedia.org/wiki/Timeline_of_the_2019–20_Wuhan_coronavirus_outbreak">this one for China</a>. There are equivalent pages for <a href="https://en.wikipedia.org/wiki/Template:2019–20_coronavirus_outbreak_data/Japan_medical_cases">Japan</a>, <a href="https://en.wikipedia.org/wiki/Template:2019–20_coronavirus_outbreak_data/South_Korea_medical_cases">South Korea</a>, <a href="https://en.wikipedia.org/wiki/Template:2019–20_coronavirus_outbreak_data/Iran_medical_cases">Iran</a>, <a href="https://en.wikipedia.org/wiki/Template:2019–20_coronavirus_outbreak_data/Italy_medical_cases">Italy</a> and many other countries. These wikipedia pages tend to be much more detailed and are well-referenced back to the original source web pages, but they are quite challenging to web-scrape because the format of the tables in which the data appears changes quite often, as various wikipedia contributors adjust their appearance. Nonetheless, we’ll scrape some detailed data about the initial set of identified COVID-19 cases in the United States (as at 4th March) from a suitable wikipedia page. The saving grace is that wikipedia pages are versioned, and thus it is possible to scrape data from a specific version of a table. But if you want daily updates to your data, using wikipedia pages as a source will involve daily maintenance of your web scraping code.</p>
<div id="incidence-data-collated-by-john-hopkins-university" class="section level2">
<h2>Incidence data collated by John Hopkins University</h2>
<p>Acquiring these data is easy. The time-series format they provide is the most convenient for our purposes. We’ll also remove columns of US cases associated with the <em>Diamond Princess</em> cruise ship because we can assume that those cases were (home) quarantined on repatriation and were unlikely, or at least a lot less likely, to give rise to further cases. We also shift the dates in the JHU data back one day reflect US time zones, somewhat approximately, because the original dates are with respect to midnight UTC (Greenwich time). That is necessary because we will be combining the JHU data with wikipedia-sourced US data, which is tabulated by dates referencing local US time zones.</p>
<p>We also need to difference the JHU data, which is provided as cumulative counts, to get daily incident counts. Incident counts of cases are a lot more epidemiologically useful than cumulative counts. <code>dplyr</code> makes short work of all that.</p>
<pre class="r"><code>jhu_url <- paste("https://raw.githubusercontent.com/CSSEGISandData/",
"COVID-19/master/csse_covid_19_data/", "csse_covid_19_time_series/",
"time_series_19-covid-Confirmed.csv", sep = "")
us_confirmed_long_jhu <- read_csv(jhu_url) %>% rename(province = "Province/State",
country_region = "Country/Region") %>% pivot_longer(-c(province,
country_region, Lat, Long), names_to = "Date", values_to = "cumulative_cases") %>%
# adjust JHU dates back one day to reflect US time, more or
# less
mutate(Date = mdy(Date) - days(1)) %>% filter(country_region ==
"US") %>% arrange(province, Date) %>% group_by(province) %>%
mutate(incident_cases = c(0, diff(cumulative_cases))) %>%
ungroup() %>% select(-c(country_region, Lat, Long, cumulative_cases)) %>%
filter(str_detect(province, "Diamond Princess", negate = TRUE))</code></pre>
<p>We can now visualize the JHU data using <code>ggplot2</code>, summarized for the whole of the US:</p>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/plot_jhu_data-1.png" width="672" /></p>
<p>So, not a lot of data to work with as yet. One thing that is missing is information on whether those cases were <em>imported</em>, that is, the infection was most likely acquired outside the US, or whether they were <em>local</em>, as a result of local transmission (possibly from imported cases).</p>
</div>
<div id="scraping-wikipedia" class="section level2">
<h2>Scraping wikipedia</h2>
<p>The wikipedia page on COVID-19 in the US contains several tables, one of which contains a line-listing of all the initial US cases, which at the time of writing appeared to be fairly complete up to 2nd March. We’ll scrape that line-listing table from <a href="https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102">this version of the wikipedia page</a>, leveraging the excellent <code>rvest</code> package, part of the <code>tidyverse</code>. Once again we need to do a bit of data wrangling to get the parsed table into a usable form.</p>
<pre class="r"><code># the URL of the wikipedia page to use is in wp_page_url
wp_page_url</code></pre>
<pre><code>## [1] "https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102"</code></pre>
<pre class="r"><code># read the page using the rvest package.
outbreak_webpage <- read_html(wp_page_url)
# parse the web page and extract the data from the eighth
# table
us_cases <- outbreak_webpage %>% html_nodes("table") %>% .[[8]] %>%
html_table(fill = TRUE)
# The automatically assigned column names are OK except that
# instead of County/city and State columns we have two
# columns called Location, due to the unfortunate use of
# colspans in the header row. The tidyverse abhors
# duplicated column names, so we have to fix those, and make
# some of the other colnames a bit more tidyverse-friendly.
us_cases_colnames <- colnames(us_cases)
us_cases_colnames[min(which(us_cases_colnames == "Location"))] <- "CityCounty"
us_cases_colnames[min(which(us_cases_colnames == "Location"))] <- "State"
us_cases_colnames <- us_cases_colnames %>% str_replace("Location",
"CityCounty") %>% str_replace("Location", "State") %>% str_replace("Case no.",
"CaseNo") %>% str_replace("Date announced", "Date") %>% str_replace("CDC origin type",
"OriginTypeCDC") %>% str_replace("Treatment facility", "TreatmentFacility")
colnames(us_cases) <- us_cases_colnames
# utility function to remove wikipedia references in square
# brackets
rm_refs <- function(x) stringr::str_split(x, "\\[", simplify = TRUE)[,
1]
# now remove references from CaseNo column, convert it to
# integer, convert the date column to date type and then lose
# all rows which then have NA in CaseNo or NA in the date
# column
us_cases <- us_cases %>% mutate(CaseNo = rm_refs(CaseNo)) %>%
mutate(CaseNo = as.integer(CaseNo), Date = as.Date(parse_date_time(Date,
c("%B %d, %Y", "%d %B, %Y")))) %>% filter(!is.na(CaseNo),
!is.na(Date)) %>% # convert the various versions of unknown into NA in the
# OriginTypeCDC column
mutate(OriginTypeCDC = if_else(OriginTypeCDC %in% c("Unknown",
"Undisclosed"), NA_character_, OriginTypeCDC))</code></pre>
<p>At this point we have an acceptably clean table, that looks like this:</p>
<style>html {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;
}
#aessoebrkv .gt_table {
display: table;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
background-color: #FFFFFF;
width: auto;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
#aessoebrkv .gt_heading {
background-color: #FFFFFF;
text-align: center;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#aessoebrkv .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
#aessoebrkv .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 0;
padding-bottom: 4px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
#aessoebrkv .gt_bottom_border {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#aessoebrkv .gt_col_headings {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#aessoebrkv .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
#aessoebrkv .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
#aessoebrkv .gt_column_spanner_outer:first-child {
padding-left: 0;
}
#aessoebrkv .gt_column_spanner_outer:last-child {
padding-right: 0;
}
#aessoebrkv .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
#aessoebrkv .gt_group_heading {
padding: 8px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
}
#aessoebrkv .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}
#aessoebrkv .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
#aessoebrkv .gt_from_md > :first-child {
margin-top: 0;
}
#aessoebrkv .gt_from_md > :last-child {
margin-bottom: 0;
}
#aessoebrkv .gt_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
#aessoebrkv .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 12px;
}
#aessoebrkv .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#aessoebrkv .gt_first_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
}
#aessoebrkv .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#aessoebrkv .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
#aessoebrkv .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#aessoebrkv .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#aessoebrkv .gt_footnote {
margin: 0px;
font-size: 90%;
padding: 4px;
}
#aessoebrkv .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#aessoebrkv .gt_sourcenote {
font-size: 90%;
padding: 4px;
}
#aessoebrkv .gt_left {
text-align: left;
}
#aessoebrkv .gt_center {
text-align: center;
}
#aessoebrkv .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
#aessoebrkv .gt_font_normal {
font-weight: normal;
}
#aessoebrkv .gt_font_bold {
font-weight: bold;
}
#aessoebrkv .gt_font_italic {
font-style: italic;
}
#aessoebrkv .gt_super {
font-size: 65%;
}
#aessoebrkv .gt_footnote_marks {
font-style: italic;
font-size: 65%;
}
</style>
<div id="aessoebrkv" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;"><table class="gt_table">
<thead class="gt_header">
<tr>
<th colspan="10" class="gt_heading gt_title gt_font_normal" style>US cases of COVID-19 as at 3rd March 2020</th>
</tr>
<tr>
<th colspan="10" class="gt_heading gt_subtitle gt_font_normal gt_bottom_border" style></th>
</tr>
</thead>
<thead class="gt_col_headings">
<tr>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">CaseNo</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">Date</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">Status</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">OriginTypeCDC</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">Origin</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">CityCounty</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">State</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">TreatmentFacility</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">Sex</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1">Age</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr class="gt_group_heading_row">
<td colspan="10" class="gt_group_heading">Last 6 rows</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">60</td>
<td class="gt_row gt_left">2020-03-04</td>
<td class="gt_row gt_left">Deceased</td>
<td class="gt_row gt_left">NA</td>
<td class="gt_row gt_left">Unknown</td>
<td class="gt_row gt_left">Placer County</td>
<td class="gt_row gt_left">California</td>
<td class="gt_row gt_left">Hospitalized</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">Elderly adult</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">61</td>
<td class="gt_row gt_left">2020-03-04</td>
<td class="gt_row gt_left">Confirmed</td>
<td class="gt_row gt_left">NA</td>
<td class="gt_row gt_left">Unknown</td>
<td class="gt_row gt_left">Santa Clara County</td>
<td class="gt_row gt_left">California</td>
<td class="gt_row gt_left">Hospitalized</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">Undisclosed</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">62</td>
<td class="gt_row gt_left">2020-03-04</td>
<td class="gt_row gt_left">Confirmed</td>
<td class="gt_row gt_left">Person-to-person spread</td>
<td class="gt_row gt_left">Undisclosed</td>
<td class="gt_row gt_left">Santa Clara County</td>
<td class="gt_row gt_left">California</td>
<td class="gt_row gt_left">In-home isolation</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">Undisclosed</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">63</td>
<td class="gt_row gt_left">2020-03-04</td>
<td class="gt_row gt_left">Confirmed</td>
<td class="gt_row gt_left">Person-to-person spread</td>
<td class="gt_row gt_left">Undisclosed</td>
<td class="gt_row gt_left">Santa Clara County</td>
<td class="gt_row gt_left">California</td>
<td class="gt_row gt_left">In-home isolation</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">Undisclosed</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">64</td>
<td class="gt_row gt_left">2020-03-04</td>
<td class="gt_row gt_left">Confirmed</td>
<td class="gt_row gt_left">NA</td>
<td class="gt_row gt_left">Unknown</td>
<td class="gt_row gt_left">Williamson County</td>
<td class="gt_row gt_left">Tennessee</td>
<td class="gt_row gt_left">In-home isolation</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">44</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">65</td>
<td class="gt_row gt_left">2020-03-05</td>
<td class="gt_row gt_left">Confirmed</td>
<td class="gt_row gt_left">Travel-related</td>
<td class="gt_row gt_left">Unknown</td>
<td class="gt_row gt_left">Clark County</td>
<td class="gt_row gt_left">Nevada</td>
<td class="gt_row gt_left">VA Southern Nevada Healthcare System</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">50's</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">66</td>
<td class="gt_row gt_left">2020-03-05</td>
<td class="gt_row gt_left">Confirmed</td>
<td class="gt_row gt_left">Travel-related</td>
<td class="gt_row gt_left">Unknown</td>
<td class="gt_row gt_left">Chicago</td>
<td class="gt_row gt_left">Illinois</td>
<td class="gt_row gt_left">Rush University Medical Center</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">20's</td>
</tr>
<tr class="gt_group_heading_row">
<td colspan="10" class="gt_group_heading">First 6 rows</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">1</td>
<td class="gt_row gt_left">2020-01-21</td>
<td class="gt_row gt_left">Recovered</td>
<td class="gt_row gt_left">Travel-related</td>
<td class="gt_row gt_left">Wuhan, China</td>
<td class="gt_row gt_left">Snohomish County</td>
<td class="gt_row gt_left">Washington</td>
<td class="gt_row gt_left">Providence Regional Medical Center Everett</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">35</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">2</td>
<td class="gt_row gt_left">2020-01-24</td>
<td class="gt_row gt_left">Recovered</td>
<td class="gt_row gt_left">Travel-related</td>
<td class="gt_row gt_left">Wuhan, China</td>
<td class="gt_row gt_left">Chicago</td>
<td class="gt_row gt_left">Illinois</td>
<td class="gt_row gt_left">St. Alexius Medical Center</td>
<td class="gt_row gt_left">Female</td>
<td class="gt_row gt_left">60s</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">3</td>
<td class="gt_row gt_left">2020-01-25</td>
<td class="gt_row gt_left">Recovered</td>
<td class="gt_row gt_left">Travel-related</td>
<td class="gt_row gt_left">Wuhan, China</td>
<td class="gt_row gt_left">Orange County</td>
<td class="gt_row gt_left">California</td>
<td class="gt_row gt_left">In-home isolation</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">50s</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">4</td>
<td class="gt_row gt_left">2020-01-26</td>
<td class="gt_row gt_left">Confirmed</td>
<td class="gt_row gt_left">Travel-related</td>
<td class="gt_row gt_left">Wuhan, China</td>
<td class="gt_row gt_left">Los Angeles County</td>
<td class="gt_row gt_left">California</td>
<td class="gt_row gt_left">Undisclosed</td>
<td class="gt_row gt_left">Undisclosed</td>
<td class="gt_row gt_left">Undisclosed</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">5</td>
<td class="gt_row gt_left">2020-01-26</td>
<td class="gt_row gt_left">Recovered</td>
<td class="gt_row gt_left">Travel-related</td>
<td class="gt_row gt_left">Wuhan, China</td>
<td class="gt_row gt_left">Tempe</td>
<td class="gt_row gt_left">Arizona</td>
<td class="gt_row gt_left">In-home isolation</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">Under 60</td>
</tr>
<tr>
<td class="gt_row gt_left gt_stub">6</td>
<td class="gt_row gt_left">2020-01-30</td>
<td class="gt_row gt_left">Recovered</td>
<td class="gt_row gt_left">Person-to-person spread</td>
<td class="gt_row gt_left">Spouse</td>
<td class="gt_row gt_left">Chicago</td>
<td class="gt_row gt_left">Illinois</td>
<td class="gt_row gt_left">St. Alexius Medical Center</td>
<td class="gt_row gt_left">Male</td>
<td class="gt_row gt_left">60s</td>
</tr>
</tbody>
<tfoot class="gt_sourcenotes">
<tr>
<td class="gt_sourcenote" colspan="10">Source: <a href="https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102">wikipedia: 2020 coronavirus outbreak in the United States</a></td>
</tr>
</tfoot>
</table></div>
<p>We’ll won’t bother cleaning all the columns, because we’ll only be using a few of them here.</p>
<p>OK, so we’ll use the wikipedia data prior to 20th February, but the JHU counts after that, but we’ll use the wikipedia data after 20th February to divide the JHU counts into travel-related or not, more or less. That will give us a data set which distinguishes local from imported cases, at least to the extent of completeness of our data sources. Outbreak epidemiology involves practicing the art of <em>good enough for now</em>.</p>
<p>Let’s visualize each of our two sources, and the combined data set.</p>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/combine_us_data-1.png" width="768" /></p>
</div>
<div id="analysis-with-the-earlyr-and-epiestim-packages" class="section level2">
<h2>Analysis with the <code>earlyR</code> and <code>EpiEstim</code> packages</h2>
<p>The <a href="https://www.repidemicsconsortium.org/earlyR/"><code>earlyR</code></a> package, as its name suggests, is intended for use early in an outbreak to calculate several key statistics. In particular the <code>get_R()</code> function in <code>earlyR</code> calculates a maximum-likelihood estimate for the reproduction number, which is the mean number of new cases each infected person give rise to. The <code>overall_infectivity()</code> function in the <code>EpiEstim</code> package calculates <span class="math inline">\(\lambda\)</span> (lambda), which is a relative measure of the current “force of infection” or infectivity of an outbreak:</p>
<p><span class="math display">\[ \lambda = \sum_{s=1}^{t-1} {y_{s} w (t - s)} \]</span></p>
<p>where <span class="math inline">\(w()\)</span> is the probability mass function (PMF) of the serial interval, and <span class="math inline">\(y_s\)</span> is the incidence at time <span class="math inline">\(s\)</span>. If <span class="math inline">\(\lambda\)</span> is falling, then that’s good: if not, bad.</p>
<p>The critical parameter for these calculations is the distribution of <em>serial intervals</em> (SI), which is the time between the date of onset of symptoms for a case and the dates of onsets for any secondary cases that case gives rise to. Typically a discrete <span class="math inline">\(\gamma\)</span> distribution for these <em>serial intervals</em> is assumed, parameterised by a mean and standard deviation, although more complex distributions are probably more realistic. See the <a href="https://timchurches.github.io/blog/posts/2020-02-18-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1/#estimating-changes-in-the-effective-reproduction-number">previous post</a> for more detailed discussion of the <em>serial interval</em>, and the paramount importance of line-listing data from which it can be empirically estimated.</p>
<p>For now, we’ll just use a discrete <span class="math inline">\(\gamma\)</span> distribution with a mean of 5.0 days and a standard deviation of 3.4 for the <em>serial interval</em> distribution. That mean is less than estimates published earlier in the outbreak in China, but appears to be closer to the behavior of the COVID-19 virus (based on a personal communication from an informed source who is party to WHO conference calls on COVID-19). Obviously a sensitivity analysis, using different but plausible <em>serial interval</em> distributions, should be undertaken, but we’ll omit that here for the sake of brevity.</p>
<p>Note that only local transmission is used to calculate <span class="math inline">\(\lambda\)</span>. If we just use the JHU data, which includes both local and imported cases, our estimates of <span class="math inline">\(\lambda\)</span> would be biased, upwards.</p>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/us_lambda-1.png" width="672" /></p>
<p>The US is not winning the war against COVID-19, but it is early days yet.</p>
<p>We can also estimate the <em>reproduction number</em>.</p>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/us_plot_r-1.png" width="672" /></p>
<p>That estimate of <span class="math inline">\(R_{0}\)</span> is consistent with those reported recently by WHO, although higher than some initial estimates from Wuhan. The key thing is that it is well above 1.0, meaning that the outbreak is growing, rapidly.</p>
</div>
<div id="fitting-a-log-linear-model-to-the-epidemic-curve" class="section level2">
<h2>Fitting a log-linear model to the <em>epidemic curve</em></h2>
<p>We can also use functions in the <strong>RECON</strong> <code>incidence</code> package to fit a log-linear model to our epidemic curve. Typically, two models are fitted, one for the growth-phase and one for the decay phase. Functions are provided in the package for finding the peak of the epidemic curve using naïve and optimizing methods. Examples of that can be found <a href="https://timchurches.github.io/blog/posts/2020-03-01-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-2/#modelling-epidemic-trajectory-in-hubei-province-using-log-linear-models">here</a>.</p>
<p>But for now, the US outbreak is still in growth phase, so we only fit one curve.</p>
<pre class="r"><code>us_incidence_fit <- incidence::fit(local_cases_obj, split = NULL)
# plot the incidence data and the model fit
plot(local_cases_obj) %>% add_incidence_fit(us_incidence_fit)</code></pre>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/us_growth_fit-1.png" width="672" /></p>
<p>That’s clearly not a good fit, because we are including the handful of very early cases that did not appear to establish sustained chains of local transmission. Let’s exclude them.</p>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/us_growth_refit-1.png" width="672" /></p>
<p>That’s a much better fit!</p>
<p>From the that model, we can extract various (very preliminary at this early stage) parameters of interest: the <strong>growth rate is 0.54</strong> (95% CI 0.32 - 0.77), which is equivalent to a <strong>doubling time of 1.3 days</strong> (95% CI 0.9 - 2.2 days).</p>
<p>That’s all a bit alarming, but these estimates are almost certainly biased because cases are being tabulated by their <strong>date of reporting</strong>, and not by their <strong>date of symptom onset</strong>. I discussed the extreme importance of health authorities reporting or providing data by <strong>date of onset</strong> <a href="https://timchurches.github.io/blog/posts/2020-03-01-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-2/#data-limitations">in an earlier post</a>. Nonetheless, as the the epidemic in the US spreads, the bias due to use of date of reporting should diminish, provided that testing and reporting of cases occurs swiftly and consistently. The ability to test and report cases promptly is a key indicator of the quality of public health intervention capability.</p>
<p>We can also project how many cases might be expected in the next week, assuming that public health controls don’t start to have an effect, and subject to the estimation biases discussed above, and bearing it mind our model is fitted to just a few days of data, so far. We’ll plot on a log scale so the observed cases so far aren’t obscured by the predicted values.</p>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/us_growth__refit_predict-1.png" width="672" /></p>
<p>On a linear scale, that looks like this:</p>
<p><img src="/post/2020-03-05-covid-19-epidemiology-with-r/index_files/figure-html/us_growth_refit_predict_linear-1.png" width="672" /></p>
<p>So, we predict, on very sketchy preliminary data, over 2211 new cases per day by 10 March. That’s probably an overestimate, due to potential reporting-date-not-onset-date bias already discussed, but it nevertheless illustrates the exponential nature of typical epidemic behavior.</p>
<p>Humans tend to use linear heuristics when contemplating trends, and thus tend to be surprised by such exponential behavior, and fail to plan for it accordingly.</p>
</div>
<div id="estimating-the-instantaneous-effective-reproduction-ratio" class="section level2">
<h2>Estimating the instantaneous <em>effective reproduction ratio</em></h2>
<p>One other statistic which the <code>EpiEstim</code> package estimates is the instantaneous effective reproduction number, based on an adjustable sliding window. This is very useful for assessing how well public health interventions are working. There isn’t enough US data available, yet, to estimate this, but here is an example of a plot of the instantaneous <span class="math inline">\(R_{e}\)</span> for Hubei province in China, taken from an <a href="https://timchurches.github.io/blog/posts/2020-02-18-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1/#estimating-changes-in-the-effective-reproduction-number">earlier blog post</a>:</p>
<p><img src="https://timchurches.github.io/blog/posts/2020-02-18-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1/analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1_files/figure-html5/Cori_empirical_si_model_fit_hubei_daily-1.png" alt="Instantaneous effect reproduction number for Hubei province" style="width:70.0%" />
You can clearly see the effect of the lock-down implemented in Hubei province and Wuhan city on or around 24th January, and the fact that the instantaneous <span class="math inline">\(R_{e}\)</span> started to fall a long time before the daily incidence of new cases reached its peak. Availability of such information helps governmental authorities to keep their nerve and to persist with unpopular public health measures, even in the face of rising incidence.</p>
</div>
</div>
<div id="conclusion" class="section level1">
<h1>Conclusion</h1>
<p>In this post we have seen how base <code>R</code>, the <code>tidyverse</code> packages, and libraries provided by <a href="https://www.repidemicsconsortium.org"><strong>R</strong> <strong>E</strong>pidemics <strong>Con</strong>sortium (<strong>RECON</strong>)</a> can be used to assemble COVID-19 outbreak data, visualize it, and estimate some key statistics from it which are vital for assessing and planning the public health response to this disease. There are several other libraries for <code>R</code> than can also be used for such purposes. It should only take a small team of data scientists a few days, using these and related tools, to construct ongoing reports or decision support tools, able to be updated continuously, or at least daily, to help support public health authorities in their (literally) life-and-death fight against COVID-19.</p>
<p>But you need to start right away: epidemic behavior is exponential.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/';</script>
Comparing Machine Learning Algorithms for Predicting Clothing Classes: Part 2
https://rviews.rstudio.com/2020/03/03/predicting-clothing-classes-part-2/
Tue, 03 Mar 2020 00:00:00 +0000https://rviews.rstudio.com/2020/03/03/predicting-clothing-classes-part-2/
<p><em>Florianne Verkroost is a Ph.D. candidate at Nuffield College at the University of Oxford. She has a passion for data science and a background in mathematics and econometrics. She applies her interdisciplinary knowledge to computationally address societal problems of inequality.</em></p>
<p>This is the second post in a series devoted to comparing different machine and deep learning methods to predict clothing categories from images using the Fashion MNIST data by Zalando. In <a href="https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-1/">the first blog post of this series</a>, we explored the data, prepared the data for analysis and learned how to predict the clothing categories of the Fashion MNIST data using my go-to model: an artificial neural network in Python. In this second blog post, we will perform dimension reduction on the data in order to make it feasible to run standard machine learning models (including tree-based methods and support vector machines) in the future. The R code for the first post can be found on my <a href="https://github.com/fverkroost/RStudio-Blogs/blob/master/machine_learning_fashion_mnist_post234.R">Github</a>.</p>
<pre class="r"><code>library(keras)
library(magrittr)
library(ggplot2)</code></pre>
<div id="data-preparation" class="section level3">
<h3>Data Preparation</h3>
<p>Let’s fetch the data again and prepare the training and test sets.</p>
<pre class="r"><code>install_keras()
fashion_mnist = keras::dataset_fashion_mnist()
c(train.images, train.labels) %<-% fashion_mnist$train
c(test.images, test.labels) %<-% fashion_mnist$test</code></pre>
<p>Next, we normalize the image data by dividing the pixel values by the maximum value of 255.</p>
<pre class="r"><code>train.images = data.frame(t(apply(train.images, 1, c))) / max(fashion_mnist$train$x)
test.images = data.frame(t(apply(test.images, 1, c))) / max(fashion_mnist$train$x)</code></pre>
<p>Now, we combine the training images <code>train.images</code> and labels <code>train.labels</code> as well as test images <code>test.images</code> and labels <code>test.labels</code> in separate data sets, <code>train.data</code> and <code>test.data</code>, respectively.</p>
<pre class="r"><code>pixs = ncol(fashion_mnist$train$x)
names(train.images) = names(test.images) = paste0('pixel', 1:(pixs^2))
train.labels = data.frame(label = factor(train.labels))
test.labels = data.frame(label = factor(test.labels))
train.data = cbind(train.labels, train.images)
test.data = cbind(test.labels, test.images)</code></pre>
<p>As <code>train.labels</code> and <code>test.labels</code> contain integer values for the clothing category (i.e. 0, 1, 2, etc.), we also create objects <code>train.classes</code> and <code>test.classes</code> that contain factor labels (i.e. Top, Trouser, Pullover etc.) for the clothing categories. We will need these for some of the machine learning models later on.</p>
<pre class="r"><code>cloth_cats = c('Top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boot')
train.classes = factor(cloth_cats[as.numeric(as.character(train.labels$label)) + 1])
test.classes = factor(cloth_cats[as.numeric(as.character(test.labels$label)) + 1])</code></pre>
</div>
<div id="principal-components-analysis" class="section level3">
<h3>Principal Components Analysis</h3>
<p>Our training and test image data sets currently contain 784 pixels or variables. We may expect a large share of these pixels, especially those towards the boundaries of the images, to have relatively small variance, because most of the fashion items are centered in the images. In other words, there may be quite some redundant pixels in our data set. To check whether this is the case, let’s plot the average pixel value on a 28 by 28 grid. We first obtain the average pixel values and store these in <code>train.images.ave</code>, after which we plot these values on the grid. We also define a custom plotting theme, <code>my_theme</code>, to make sure all our figures have the same aesthetics. Note that in the plot created, a higher cell (pixel) value means that the average value of that pixel is higher, and thus that the pixel is darker on average (as a pixel value of 0 refers to white and a pixel value of 255 refers to black).</p>
<pre class="r"><code>train.images.ave = data.frame(pixel = apply(train.images, 2, mean),
x = rep(1:pixs, each = pixs),
y = rep(1:pixs, pixs))
my_theme = function () {
theme_bw() +
theme(axis.text = element_text(size = 14),
axis.title = element_text(size = 14),
strip.text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
legend.position = "bottom",
strip.background = element_rect(fill = 'white', colour = 'white'))
}
ggplot() +
geom_raster(data = train.images.ave, aes(x = x, y = y, fill = pixel)) +
my_theme() +
labs(x = NULL, y = NULL, fill = "Average scaled pixel value") +
ggtitle('Average image in Fashion MNIST training data')</code></pre>
<p><img src="/post/2020-02-24-predicting-clothing-classes-part-2/index_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>As we can see from the plot, there are many pixels with a low average value, meaning that they are white in most of the images in our training data. These pixels are mostly redundant, while they do contribute to computational costs and sparsity. Therefore, we might be better off reducing the dimensionality in our data to reduce redundancy, overfitting and computational cost. One method to do so is principal components analysis <a href="https://www.nature.com/articles/nmeth.4346.pdf">(PCA)</a>. Essentially, PCA statistically reduces the dimensions of a set of correlated variables by transforming them into a smaller number of linearly uncorrelated variables. The resulting “principal components” are linear combinations of the original variables. The first principal component explains the largest part of the variance, followed by the second principal component and so forth. For a more extensive explanation of PCA, I refer you to James et al. (2013).</p>
<p>Let’s have a look at how many variables can explain which part of the variance in our data. We compute the 784 by 784 covariance matrix of our training images using the <code>cov()</code> function, after which we execute PCA on the covariance matrix using the <code>prcomp()</code> function in the <code>stats</code> library. Looking at the results, we observe that 50 principal components in our data explain 99.902% of the variance in the data. This can be nicely shown in a plot of the cumulative proportion of variance against component indices. Note that the component indices here are sorted by their ability to explain the variance in our data, and not based on their pixel position in the 28 by 28 image.</p>
<pre class="r"><code>library(stats)
cov.train = cov(train.images)
pca.train = prcomp(cov.train)
plotdf = data.frame(index = 1:(pixs^2),
cumvar = summary(pca.train)$importance["Cumulative Proportion", ])
t(head(plotdf, 50)) </code></pre>
<pre><code>## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## index 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 7.0000 8.0000 9.0000
## cumvar 0.6491 0.8679 0.9107 0.9421 0.9611 0.9759 0.9816 0.9862 0.9885
## PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17
## index 10.0000 11.0000 12.0000 13.0000 14.0000 15.0000 16.000 17.0000
## cumvar 0.9906 0.9918 0.9928 0.9935 0.9941 0.9945 0.995 0.9954
## PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25
## index 18.0000 19.000 20.0000 21.0000 22.0000 23.0000 24.000 25.0000
## cumvar 0.9957 0.996 0.9962 0.9965 0.9967 0.9969 0.997 0.9972
## PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33
## index 26.0000 27.0000 28.0000 29.0000 30.0000 31.000 32.000 33.0000
## cumvar 0.9974 0.9975 0.9976 0.9978 0.9979 0.998 0.998 0.9981
## PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41
## index 34.0000 35.0000 36.0000 37.0000 38.0000 39.0000 40.0000 41.0000
## cumvar 0.9982 0.9983 0.9984 0.9984 0.9985 0.9986 0.9986 0.9987
## PC42 PC43 PC44 PC45 PC46 PC47 PC48 PC49
## index 42.0000 43.0000 44.0000 45.0000 46.0000 47.0000 48.000 49.000
## cumvar 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.999 0.999
## PC50
## index 50.000
## cumvar 0.999</code></pre>
<pre class="r"><code>ggplot() +
geom_point(data = plotdf, aes(x = index, y = cumvar), color = "red") +
labs(x = "Index of primary component", y = "Cumulative proportion of variance") +
my_theme() +
theme(strip.background = element_rect(fill = 'white', colour = 'black'))</code></pre>
<p><img src="/post/2020-02-24-predicting-clothing-classes-part-2/index_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>We also observe that 99.5% of the variance is explained by only 17 principal components. As 99.5% is already a large share of the variance, and we want to reduce the number of pixels (variables) by as many as we can to reduce computation time for the models coming up, we choose to select these 17 components for further analysis. (Although this is unlikely to influence our results hugely, if you have more time I’d suggest you select the 50 components explaining 99.9% of the data, or execute the analyses on the full data set.)</p>
<p>We also save the relevant part of the rotation matrix created by the <code>prcomp()</code> function and stored in <code>pca.train</code>, such that its dimensions become 784 by 17. We then multiply our training and test image data by this rotation matrix called <code>pca.rot</code>. We further combine the transformed image data (<code>train.images.pca</code> and <code>test.images.pca</code>) with the integer labels for the clothing categories in <code>train.data.pca</code> and <code>test.data.pca</code>. We will use these reduced data in our further analyses to decrease computational time.</p>
<pre class="r"><code>pca.dims = which(plotdf$cumvar >= .995)[1]
pca.rot = pca.train$rotation[, 1:pca.dims]
train.images.pca = data.frame(as.matrix(train.images) %*% pca.rot)
test.images.pca = data.frame(as.matrix(test.images) %*% pca.rot)
train.data.pca = cbind(train.images.pca, label = factor(train.data$label))
test.data.pca = cbind(test.images.pca, label = factor(test.data$label))</code></pre>
<p>In the next post of this series, we will use the PCA reduced data to estimate and assess tree-based methods, including random forests and gradient-boosted trees.</p>
</div>
<div id="references" class="section level3">
<h3>References</h3>
<p>James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: Springer.</p>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/03/03/predicting-clothing-classes-part-2/';</script>
January 2020: "Top 40" New R Packages
https://rviews.rstudio.com/2020/02/24/january-2020-top-40-new-r-packages/
Mon, 24 Feb 2020 00:00:00 +0000https://rviews.rstudio.com/2020/02/24/january-2020-top-40-new-r-packages/
<p>One hundred forty-seven new packages made it to CRAN in January. Here are my “Top 40” picks in nine categories: Computational Methods, Genomics, Machine Learning, Mathematics, Medicine, Statistics, Time Series, Utilities and Visualization.</p>
<h3 id="computational-methods">Computational Methods</h3>
<p><a href="https://cran.r-project.org/package=FSSF">FSSF</a> v0.1.1: Provides three methods proposed by <a href="doi:10.1080/00224065.2019.1705207">Shang & Apley (2019)</a> to generate fully-sequential space-filling designs inside a unit hypercube.</p>
<p><a href="https://cran.r-project.org/package=seagull">seagull</a> v1.0.5: Implements a proximal gradient descent solver for the operators lasso, group lasso, and sparse-group lasso. There is an <a href="https://cran.r-project.org/web/packages/seagull/vignettes/seagull.pdf">Introduction</a>.</p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.r-project.org/package=babette">babette</a> v2.1.2: Provides access to the <a href="https://www.beast2.org">BEAST2</a> Bayesian phylogenetic tool, that uses DNA/RNA/protein data and many model priors to create a posterior of jointly estimated phylogenies and parameters. There is a <a href="https://cran.r-project.org/web/packages/babette/vignettes/tutorial.html">Tutorial</a>, a <a href="https://cran.r-project.org/web/packages/babette/vignettes/demo.html">Basic Demo</a>, a <a href="https://cran.r-project.org/web/packages/babette/vignettes/step_by_step.html">Step by Step Demo</a>, a vignette on <a href="https://cran.r-project.org/web/packages/babette/vignettes/nested_sampling.html">Nested Sampling</a>, and another with <a href="https://cran.r-project.org/web/packages/babette/vignettes/examples.html">Examples</a>.</p>
<p><img src="babette.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=statgenGWAS">statgenGWAS</a> v1.0.3: Provides fast single trait Genome Wide Association Studies (GWAS) following the method described in <a href="doi:10.1038/ng.548">Kang et al. (2010)</a>. See the <a href="https://cran.r-project.org/web/packages/statgenGWAS/vignettes/GWAS.html">vignette</a> for details.</p>
<p><img src="statgenGWAS.png" height = "400" width="600"></p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=akc">akc</a> 0.9.4: Provides a tidy framework for automatic knowledge classification and visualization. There is a <a href="https://cran.r-project.org/web/packages/akc/vignettes/tutorial_raw_text.html">Tutorial</a> and a <a href="https://cran.r-project.org/web/packages/akc/vignettes/akc_vignette.html">vignette</a> on classification based on keyword co-occurrence.</p>
<p><img src="akc.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=ced">ced</a> v1.0.1: Provides R bindings for the Google <a href="https://github.com/google/compact_enc_det">Compact Encoding Detection library</a> which takes as input a source buffer of raw text bytes and probabilistically determines the most likely encoding for that text.</p>
<p><a href="https://cran.r-project.org/package=forestError">forestError</a> v0.1.1: Provides functions to estimate the conditional error distributions of random forest predictions and common parameters of those distributions, including conditional mean squared prediction errors, conditional biases, and conditional quantiles as proposed by <a href="arXiv:1912.07435">Lu & Hardin (2019)</a>.</p>
<p><a href="https://cran.r-project.org/package=ksharp">ksharp</a>v0.1.0.1: Provides functions to sharpen clusters by adjusting existing clusters to create contrast between groups. The <a href="https://cran.r-project.org/web/packages/ksharp/vignettes/ksharp.html">vignette</a> provides examples and references.</p>
<p><img src="ksharp.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=mosmafs">mosmafs</a> v0.1.1: Provides functions for simultaneous hyperparameter tuning and feature selection through both single-objective and multi-objective optimization as described in <a href="rXiv:1912.12912">Binder et al. (2019)</a>. There is an <a href="https://cran.r-project.org/web/packages/mosmafs/vignettes/demo.html">Introduction</a> and another vignette on <a href="https://cran.r-project.org/web/packages/mosmafs/vignettes/multifidelity.html">Multi-Fidelity</a>.</p>
<p><img src="mosmafs.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=themis">themis</a> v0.1.0: Provides recipes for dealing with unbalanced data sets including balancing by increasing the number of minority cases using <a href="arXiv:1106.1813">SMOTE</a>, <a href="doi:10.1007/11538059_91">Borderline-SMOTE</a> and <a href="https://ieeexplore.ieee.org/document/4633969">ADASYN</a>; or by decreasing the number of majority cases using <a href="https://www.site.uottawa.ca/~nat/Workshop2003/jzhang.pdf">NearMiss</a> or <a href="https://ieeexplore.ieee.org/document/430945">Tomek link removal</a>. Look <a href="https://github.com/tidymodels/themis">here</a> examples.</p>
<h3 id="mathematics">Mathematics</h3>
<p><a href="https://cran.r-project.org/package=caracas">caracas</a> v0.1.0: Implements computer algebra by providing access to the Python <a href="https://www.sympy.org/"><code>SymPy</code></a> library making it possible to solve equations symbolically, find symbolic integrals, symbolic sums and other important quantities. There is a <a href="https://cran.r-project.org/web/packages/caracas/vignettes/introduction.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=clifford">clifford</a> v1.0-1: Provides a suite of routines for <a href="https://en.wikipedia.org/wiki/Clifford_algebra">Clifford algebras</a>. Special cases include Lorentz transforms, quaternion multiplication, and Grassman algebra. See the <a href="https://cran.r-project.org/web/packages/clifford/vignettes/clifford.pdf">vignette</a> for details.</p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=metan">metan</a> v1.3.0: Provides functions for the stability analysis of multi-environment trial data using parametric and non-parametric methods including additive main effects and multiplicative interaction analysis, <a href="doi:10.2135/cropsci2013.04.0241">Gauch (2013)</a>; genotype plus genotype-environment biplot analysis, <a href="doi:10.1201/9781420040371">Yan & Kang (2003)</a>; joint regression analysis, <a href="doi:10.2135/cropsci1966.0011183X000600010011x">Eberhart & Russel (1966)</a> and much more. See the <a href="https://cran.r-project.org/web/packages/metan/vignettes/metan_start.html">vignette</a> to get started.</p>
<p><img src="metan.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=nosoi">nosoi</a> v1.0.0: Implements a flexible agent-based stochastic transmission chain, epidemic simulator. There is a <a href="https://cran.r-project.org/web/packages/nosoi/vignettes/nosoi.html">Getting Started Guide</a> and vignettes on <a href="https://cran.r-project.org/web/packages/nosoi/vignettes/none.html">Homogenous Populations</a>, <a href="https://cran.r-project.org/web/packages/nosoi/vignettes/discrete.html">Discrete Populations</a>, <a href="https://cran.r-project.org/web/packages/nosoi/vignettes/continuous.html">Continuous Populations</a>, and <a href="https://cran.r-project.org/web/packages/nosoi/vignettes/output.html">Visualization</a>.</p>
<p><img src="nosoi.png" height = "300" width="500"></p>
<p><a href="https://cran.r-project.org/package=shinySIR">shinySIR</a> v0.1.1: Provides interactive plotting for mathematical models of infectious disease spread. Users can choose from a variety of common built-in ordinary differential equation models or create their own. See <a href="doi:10.2307/j.ctvcm4gk0">Keeling & Rohani (2008)</a> and <a href="doi:10.1007/978-3-319-97487-3">Bjornstad (2018)</a> for background and the <a href="https://cran.r-project.org/web/packages/shinySIR/vignettes/Vignette.html">vignette</a> for details.</p>
<p><img src="shinySIR.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=transplantr">transplantr</a> v0.1.0: Provides a set of vectorised functions to calculate medical equations used in transplantation, focused mainly on transplantation of abdominal organs. There are vignettes on <a href="https://cran.r-project.org/web/packages/transplantr/vignettes/egfr.html">Estimated GFR</a>, <a href="https://cran.r-project.org/web/packages/transplantr/vignettes/hla_mismatch_grade.html">HLA Mismatch Level</a>, <a href="https://cran.r-project.org/web/packages/transplantr/vignettes/kidney_risk_scores.html">Kidney Risk Scores</a>, and <a href="https://cran.r-project.org/web/packages/transplantr/vignettes/liver_recipient_scoring.html">Liver Recipient Scoring</a>.</p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=bggum">bggum</a> v1.0.2: Provides a Metropolis-coupled Markov chain Monte Carlo sampler, post-processing, parameter estimation functions, and plotting utilities for the generalized graded unfolding model of <a href="doi:10.1177/01466216000241001">Roberts et al.(2000)</a>. See the <a href="https://cran.r-project.org/web/packages/bggum/vignettes/bggum.html">vignette</a> for the math and examples.</p>
<p><img src="bggum.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=cSEM">cSEM</a> v0.1.0: Provides functions to estimate, assess, test, and study linear, nonlinear, hierarchical and multigroup structural equation models using composite-based approaches and procedures, including estimation techniques such as partial least squares path modeling (PLS-PM) and its derivatives (PLSc, ordPLSc, robustPLSc), generalized structured component analysis (GSCA) and others. There is an <a href="https://cran.r-project.org/web/packages/cSEM/vignettes/cSEM.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/cSEM/vignettes/Notation.html">Notation</a>, <a href="https://cran.r-project.org/web/packages/cSEM/vignettes/Terminology.html">Terminology</a>, and <a href="https://cran.r-project.org/web/packages/cSEM/vignettes/Using-assess.html">Post Estimation</a>.</p>
<p><a href="https://cran.r-project.org/package=mcp">mcp</a> v0.2.0: Implements regression with multiple change points which can be for means, variances, autocorrelation structure, and any combination of these. It provides a generalization of the approach described in <a href="doi:10.2307/2347570">Carlin et al. (1992)</a> and <a href="doi:10.2307/2986119">Stephens (1994)</a>. See <a href="https://cran.r-project.org/web/packages/mcp/readme/README.html">README</a> for examples.</p>
<p><img src="mcp.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=metropolis">metopolis</a> v0.1.5: Provides functions for learning how the <a href="https://doi.org/10.1063/1.1699114">Metropolis algorithm</a> works. The <a href="https://cran.r-project.org/web/packages/metropolis/vignettes/metropolis-vignette.html">vignette</a> includes examples of hand-coding a logistic model using several variants of the Metropolis algorithm.</p>
<p><img src="metropolis.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=miceRanger">miceRanger</a> v1.3.1: Implements multiple imputation by chained equations with Random Forests. There are vignettes on the <a href="The MICE Algorithm">Mice Algorithm</a>, <a href="https://cran.r-project.org/web/packages/miceRanger/vignettes/usingMiceRanger.html">Filling in Missing Data</a>, and <a href="https://cran.r-project.org/web/packages/miceRanger/vignettes/diagnosticPlotting.html">Diagnostics Plotting</a>.</p>
<p><img src="miceRanger.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=momentfit">momentfit</a> v0.1-0. Provides functions to perform method of moment fits including the Generalized method of moments, <a href="doi:10.2307/1912775">(Hansen (1982)</a> and the Generalized Empirical Likelihood, <a href="doi:10.1111/j.0013-0133.1997.174.x">(Smith (1997)</a>. There are vignettes on <a href="https://cran.r-project.org/web/packages/momentfit/vignettes/gelS4.pdf">Generalized Empirical Likelihood</a> and <a href="https://cran.r-project.org/web/packages/momentfit/vignettes/gmmS4.pdf">Generalized Method of Moments</a>.</p>
<p><a href="https://cran.r-project.org/package=nlraa">nlraa</a> v0.53: Implements nonlinear regression functions using self-start algorithms including the Beta growth function proposed by <a href="doi:10.1093/aob/mcg029">Yin et al. (2003)</a>. There is an <a href="https://cran.r-project.org/web/packages/nlraa/vignettes/nlraa.html">Introduction</a>, a <a href="https://cran.r-project.org/web/packages/nlraa/vignettes/nlraa-AgronJ-paper.html">vignette</a> with examples from <a href="doi:10.2134/agronj2012.0506">Archontoulis & Miguez (2015)</a> and another <a href="https://cran.r-project.org/web/packages/nlraa/vignettes/nlraa-Oddi-LFMC.html">vignette</a> with examples from <a href="doi:10.1002/ece3.5543">Oddi et al. (2019)</a>.</p>
<p><img src="nlraa.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=PoissonBinomial">PoissonBinomial</a> v1.0.2: Implements multiple exact and approximate methods for computing the probability mass, cumulative distribution and quantile functions, as well as generating random numbers for the Poisson Binomial distribution as described in <a href="doi:10.1016/j.csda.2012.10.006">Hong (2013)</a> and <a href="doi:10.1016/j.csda.2018.01.007">Biscarri et al. (2018)</a>. There are vignettes on <a href="https://cran.r-project.org/web/packages/PoissonBinomial/vignettes/intro.html">Efficient Computation</a>, <a href="https://cran.r-project.org/web/packages/PoissonBinomial/vignettes/proc_approx.html">Approximate Procedures</a>, <a href="https://cran.r-project.org/web/packages/PoissonBinomial/vignettes/proc_exact.html">Exact Procedures</a>, and <a href="https://cran.r-project.org/web/packages/PoissonBinomial/vignettes/use_with_rcpp.html">Usage with Rcpp</a>.</p>
<p><a href="https://cran.r-project.org/package=relgam">relgam</a> v1.0: Implements a method for fitting the entire regularization path of the reluctant generalized additive model (RGAM) for linear regression, logistic, Poisson and Cox regression models. See <a href="arXiv:1912.01808">Tay & Tibshirani (2019)</a> for details.</p>
<p><a href="https://cran.r-project.org/package=s2n2t">s2net</a> v1.0.1: Implements the generalized semi-supervised elastic-net, extending the supervised elastic-net to make it practical to perform feature selection in semi-supervised contexts. See <a href="doi:10.1080/10618600.2012.657139">Culp (2013)</a> for references on the Joint Trained Elastic-Net and the <a href="https://cran.r-project.org/web/packages/s2net/vignettes/supervised.html">vignette</a> for an example.</p>
<p><a href="https://cran.r-project.org/package=signnet">signnet</a> v0.5.1: Implements methods for the analysis of signed networks including several measures for structural balance as introduced by <a href="doi:10.1037/h0046049">Cartwright & Harary (1956)</a>, blockmodeling algorithms from <a href="doi:10.1016/j.socnet.2008.03.005">Doreian (2008)</a>, various centrality indices, and projections introduced by <a href="doi:10.1080/0022250X.2019.1711376">Schoch (2020)</a>. There are vignettes on <a href="https://cran.r-project.org/web/packages/signnet/vignettes/blockmodeling.html">Blockmodeling</a>, <a href="https://cran.r-project.org/web/packages/signnet/vignettes/centrality.html">Centrality</a>, <a href="https://cran.r-project.org/web/packages/signnet/vignettes/complex_matrices.html">Complex Matrices</a>, <a href="https://cran.r-project.org/web/packages/signnet/vignettes/signed_2mode.html">Signed Two-Mode Networks</a>, <a href="https://cran.r-project.org/web/packages/signnet/vignettes/signed_networks.html">Signed Networks</a>, and <a href="https://cran.r-project.org/web/packages/signnet/vignettes/structural_balance.html">Structural Balance</a>.</p>
<p><img src="signnet.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=survParamSim">survParamSim</a> v0.1.0: Provides functions to perform survival simulation with parametric survival model generated from <code>survreg</code> function in <code>survival</code> package. See the <a href="https://cran.r-project.org/web/packages/survParamSim/vignettes/survParamSim.html">vignette</a>.</p>
<p><img src="survparamSim.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=xnet">xnet</a> v0.1.11: Provides functions to fit a two-step kernel ridge regression model for predicting edges in networks, and carry out cross-validation. See <a href="doi:10.1093/bib/bby095">Stock et al. (2018)</a> for background. There is an <a href="https://cran.r-project.org/web/packages/xnet/vignettes/xnet_ShortIntroduction.html">Introduction</a>, and vignettes on <a href="https://cran.r-project.org/web/packages/xnet/vignettes/Preparation_example_data.html">Data Preparation</a>, and the <a href="https://cran.r-project.org/web/packages/xnet/vignettes/xnet_ClassStructure.html">S4 Class Structure</a>.</p>
<p><img src="xnet.png" height = "400" width="600"></p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://cran.r-project.org/package=pcts">pcts</a> v0.14-4: Provides classes and methods for modeling and simulating periodically correlated and periodically integrated time series. For background see <a href="doi:10.1111/j.1467-9892.2009.00617.x">Boshnakov & Iqelan (2009)</a> and <a href="doi:10.1111/j.1467-9892.1996.tb00281.x">Boshnakov (1996)</a>.</p>
<p><a href="https://cran.r-project.org/package=fdaACF">fdaACF</a> v0.1.0: Provides functions to compute autocorrelation functions for functional time series. Look <a href="https://github.com/GMestreM/fdaA">here</a> for examples.</p>
<p><img src="fdaACF.png" height = "400" width="600"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=dmdscheme">dmdScheme</a> v1.0.0: Provides a framework for developing domain specific metadata. There is an <a href="https://cran.r-project.org/web/packages/dmdScheme/vignettes/r_package_introduction.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/dmdScheme/vignettes/Howto_create_new_scheme.html">Creating a New Scheme</a> and <a href="https://cran.r-project.org/web/packages/dmdScheme/vignettes/minimum_requirements_dmdscheme.html">Minimum Requirements</a>.</p>
<p><a href="https://cran.r-project.org/package=gridtext">gridtext</a> v0.1.0: Provides support for rendering of formatted text using <code>grid</code> graphics. Look <a href="https://wilkelab.org/gridtext/">here</a> for examples.</p>
<p><img src="gridtext.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=netstat">netstat</a> v0.1.1: Implements an interface to the <a href="https://en.wikipedia.org/wiki/Netstat">netstat</a> command line utility for retrieving and parsing network statistics from Transmission Control Protocol (TCP) ports. See <a href="http://man7.org/linux/man-pages/man8/netstat.8.html"><em>The Linux System Administrator’s Manual</em></a>, and the <a href="https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/netstat">Microsoft website</a> for basic information.</p>
<p><a href="https://cran.r-project.org/package=progressr">progressr</a> v0.4.0: Provides a minimal, unifying API for scripts and packages to report progress updates including when using parallel processing. The <a href="https://cran.r-project.org/web/packages/progressr/vignettes/progressr-intro.html">vignette</a> offers an introduction.</p>
<p><a href="https://cran.r-project.org/package=PROJ">PROJ</a> v0.1.0: Implements a wrapper around the generic coordinate transformation software, <a href="https://proj.org/">PROJ</a> that transforms geospatial coordinates from one coordinate reference system to another, and cartographic projections as well as geodetic transformations. See the <a href="https://cran.r-project.org/web/packages/PROJ/vignettes/PROJ.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=round">round</a> v0.12-1: Provides functions to explore differences between current and potential future versions of the base R <code>round()</code> function along with some partly related C99 math lib functions not in base R. See the <a href="https://cran.r-project.org/web/packages/round/vignettes/Rounding.html">vignette</a> for the details.</p>
<p><a href="https://cran.r-project.org/package=warp">warp</a> v0.1.0: Implements tooling to group dates by a variety of periods including: yearly, monthly, by second, by week of the month, and more. See the <a href="https://cran.r-project.org/web/packages/warp/vignettes/hour.html">vignette</a> for examples.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=apyramid">apyramid</a> v0.1.0: Provides a quick method for visualizing non-aggregated line-list or aggregated census data stratified by age and one or two categorical variables (e.g. gender and health status) with any number of values. This package is part of the <a href="https://r4epis.netlify.com">R4Epis Project</a>. See the <a href="https://cran.r-project.org/web/packages/apyramid/vignettes/intro.html">vignette</a> for examples.</p>
<p><img src="apyramid.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=mlr3viz">mlr3viz</a> v0.1.1: Provides visualizations for <a href="https://cran.r-project.org/package=mlr3"><code>mlr3</code></a> objects including barplots, boxplots, histograms, ROC curves, and Precision-Recall curves. <a href="https://cran.r-project.org/web/packages/mlr3viz/readme/README.html">README</a> offers some examples.</p>
<p><img src="mlr3viz.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/02/24/january-2020-top-40-new-r-packages/';</script>
R, Public Health and Politics
https://rviews.rstudio.com/2020/02/19/r-public-health-and-politics/
Wed, 19 Feb 2020 00:00:00 +0000https://rviews.rstudio.com/2020/02/19/r-public-health-and-politics/
<p>Last week, Lancet published the paper <a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(19)33019-3/fulltext#%20"><em>Improving the prognosis of health care in the USA</em></a> by Alison P Galvani, Alyssa S Parpia, Eric M Foster, Burton H Singer, Meagan C Fitzpatrick of <a href="https://publichealth.yale.edu/cidma/">CIDMA</a>, the Center for Infectious Disease Modeling and Analysis, Yale School of Public Health. The paper, which, provides a detailed analysis of the single-payer system introduced by Senator Sanders in the <a href="https://www.sanders.senate.gov/download/
medicare-for-all-act?id=6CA2351C-6EAE-4A11-BBE4-CE07984813C8
&download=1&inline=file">Medicare for All Act</a> was published with a Shiny application that allows readers to test key assumptions regarding health care budgets, projected revenue, and the projected expansion of health care use.</p>
<p>While the authors are clearly arguing the case for the single-payer system, the publication in a prestigious peer-reviewed journal, the detailed, documented data presented, and the Shiny app for testing assumptions should make this paper the basis for all serious, rational discussion about the economic viability of the single-payer system.</p>
<p>This Shiny app is also a milestone for R, as it demonstrates the ability of R to help experts interactively engage with informed citizens to help them develop their own insights on complex matters.</p>
<iframe src="https://shift-cidma.shinyapps.io/single-payer_healthcare_interactive_financing_tool/" width = 100% height = 1200></iframe>
<script>window.location.href='https://rviews.rstudio.com/2020/02/19/r-public-health-and-politics/';</script>
rstudio::conf 2020 Videos
https://rviews.rstudio.com/2020/02/18/rstudio-conf-2020-videos/
Tue, 18 Feb 2020 00:00:00 +0000https://rviews.rstudio.com/2020/02/18/rstudio-conf-2020-videos/
<p><img src="conf.jpg" height = "600" width="800"></p>
<p><a href="https://rstudio.com/conference/2020-conf-schedule/">rstudio::conf 2020</a> is already receding in the rear view mirror, but the wealth of resources generated by the conference will be valuable for quite some time. All of the materials from the <a href="https://rviews.rstudio.com/2020/01/27/rstudio-conf-2020-workshopsr/">workshops</a>, and now all one hundred and four <a href="https://resources.rstudio.com/rstudio-conf-2020">videos of conference talks</a> are available. This unique video collection offers valuable insight into how developers, data scientists, statisticians, journalists, physicians, educators and other R savvy professionals are using their domain knowledge, analytical expertise and coding skills to make the world a better place. Talks range from very technical, R specific topics; to scientific contributions, best practices, applications of reproducible research, and the culture and sociology surrounding open source software.</p>
<p>There is a lot to see, but my advice is to is to begin with J.J. Allaire’s keynote which presents open source software for data science through the lens of RStudio’s history and mission.</p>
<div style="padding-top:20px;padding-bottom: 80px;">
<h3>J.J. Allaire's Keynote at rstudio::conf 2020, announcing RStudio, PBC</h3>
<script src="https://fast.wistia.com/embed/medias/i7lqqlo6ng.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_i7lqqlo6ng videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img src="https://fast.wistia.com/embed/medias/i7lqqlo6ng/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/02/18/rstudio-conf-2020-videos/';</script>
Photo Mosaics in R
https://rviews.rstudio.com/2020/02/13/photo-mosaics-in-r/
Thu, 13 Feb 2020 00:00:00 +0000https://rviews.rstudio.com/2020/02/13/photo-mosaics-in-r/
<p><em>Harrison Schramm, CAP, PStat, is a Senior Fellow at the Center for Strategic and Budgetary Assessments.</em></p>
<p>In this short piece, I’m going to discuss a fun photography project I did over the winter using R. I’m also going to touch on some of the implications of the R license, which underlies our entire ecosystem, but we don’t usually think about that often.</p>
<p>I’ve been a dedicated useR for the past 4 years. I started by using R for all the things that I previously did with spreadsheets - a great way to learn your way around the <code>magrittr</code> and <code>dplyr</code> packages. From there, I Replaced my word processor and slide software with <code>markdown</code>. All the while, I’ve been increasing the amount of time I spend on graphics, particularly within the <code>ggplot2</code> construct, as well as color scales provided by the <code>ggsci</code> package (there are Simpsons and Futurama color palettes)! At some point over the summer, my interest in developing graphics began to inspire my R-tistic side, which I wrote about in <a href="https://pubsonline.informs.org/do/10.1287/LYTX.2019.06.12/full/">INFORMS/Analytics</a> magazine this summer.</p>
<p>All of last year, I worked on a project titled ‘Mosaic’. When we finished our <a href="https://csbaonline.org/research/publications/mosaic-warfare-exploiting-artificial-intelligence-and-autonomous-systems-to-implement-decision-centric-operations">report</a>, my coauthors asked for suggestions on cover art, I naturally suggested we create a photo mosaic. While there are both free and commercially available solutions, my first choice, of course, was to find an R-centric solution. The advantages to using R for this project (as well as other things) is that it allows for the creation of <em>bespoke</em> solutions; in other words, I don’t want just any photo mosaic, but rather one that has the attributes that I want.</p>
<p>After a quick stack overflow / CRAN search, I found the <code>RsimMosaic</code> <a href="https://cran.r-project.org/web/packages/RsimMosaic/RsimMosaic.pdf">package</a> which gave me the tools I was looking for.</p>
<div id="making-a-photo-mosaic" class="section level3">
<h3>Making a photo mosaic</h3>
<p>Basically, the tools in this package take a base image and replace each pixel with a tile. If properly chosen, a close-up view will focus on the tiles, but at a distance the base image will emerge. While this is simple enough, there are a few thing to consider.</p>
<p>First, the dimension of the resulting image will be (in pixels) approximately the base image expanded by the tiles; for instance, if the base is 150x150, and the tiles are 30x30, the resulting image will be 4500 x 4500. I (somewhat embarrassingly) discovered this by having my R instance return a <code>cannot allocate vector of size 9GB</code> error <a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>.</p>
<p>Second, there is an R-tistic balance between the size of the base image and the size of the tiles; some experimentation is necessary. If you have an extensive library of tiles (I had over 600 in this instance), it is possible - but ill advised - to try to adjust their sizes manually. Fortunately, the package has a utility for doing this. However, there’s a catch; not all pictures have a base resolution that is amenable to being scaled down.</p>
<p>Building a photo mosaic is really an R-tistic thing to do. The key is to collect a library of tiles that will allow sufficient diversity so that the program can make good contrast choices. For example, pictures of ships and airplanes are heavy in blue tones, etc.</p>
</div>
<div id="making-a-tile-library" class="section level3">
<h3>Making a tile library</h3>
<p>Once you have found your tiles, you will want to resize them. For example, I have found in most images I’ve been working with that a 30 x 30 (pixel) tile is a good size, balancing resolution with ‘mosaic-ness’. However, this is not the size of most raw images, and while you can resize them in MS Paint, this is a painstaking process. Fortunately <code>RsimMosaic</code> provides a handy method: <code>createTiles</code>. It’s <em>almost</em> perfect for my application.</p>
</div>
<div id="heres-the-bit-about-the-license" class="section level3">
<h3>Here’s the bit about the license</h3>
<p>Because R and it’s packages are distributed under the GPL license, you have the ability to adjust the functions that are in packages. This is straightforward; if you want to adjust a function <code>foo</code>, you can assign it a new name <code>foo2 <- foo</code> and then <code>fix(foo2)</code> (or use a script / markdown) to make changes. You can of course simply edit the original function, but I do not recommend this as it can cause confusion on subsequent sessions.</p>
<p>Remember in the preceding paragraph where I said it was <em>almost</em> perfect? Some tiles have base sizes that cannot be coerced to rectangular shapes. The method that comes with the package simply generates an error. What is useful when generating ~600 tiles is to have a list so that I know the one that threw an error <a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>.</p>
<p>But because we can do whatever we want with existing functions, consider the following modification:</p>
<p><code>createTiles2 = createTiles</code></p>
<p><code>fix(createTiles2)</code></p>
<p>{add the following directly after line 15:}</p>
<p><code>print(filenameArray[i])</code></p>
<p>And voila! Fast and artistic! I used this list to find the tiles that could not be resized, and removed them from my tile library.</p>
</div>
<div id="making-the-mosaic" class="section level2">
<h2>Making the Mosaic</h2>
<p>Now, for the part where we make the mosaic. The command:</p>
<p><code>composeMosaicFromImageRandomOptim("RLogo_small.jpg", "RMosaic.jpg", "OutputPathGoesHere")</code></p>
<p>generates the following:</p>
<p><img src="RMosaic.jpg" height = "600" width="800"></p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>is it surprising that an accomplished professional would admit to such a silly mistake in public? I don’t think so! We would all be better off if we were as open with our success as our failures; first to prevent others from wasting precious time with similar mistakes, and second to show that if we are going to work at the cutting edge, we are all in a sense students.<a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>other programmers might suggest a <code>tryCatch()</code> environment, and I did think of that. Here was a case where I wanted something that worked fast, vice something that worked well.<a href="#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</div>
<script>window.location.href='https://rviews.rstudio.com/2020/02/13/photo-mosaics-in-r/';</script>
Some 2020 R Conferences
https://rviews.rstudio.com/2020/02/05/some-2020-r-conferences/
Wed, 05 Feb 2020 00:00:00 +0000https://rviews.rstudio.com/2020/02/05/some-2020-r-conferences/
<p>rstudio::conf kicked off the 2020 season for R conferences last week with record attendance somewhere north of twenty-one hundred. Session topics ranged from business to science, marketing to medicine and attracted R users with very varied backgrounds including DevOps professionals, data scientists, journalists, physicians, statisticians, R package developers, Shiny developers and more. Although it is true that the San Francisco Bay Area is home to a large R Community, and that a great deal of planning and promotion went into making rstudio::conf a success, I don’t think that the enthusiasm and energy that permeated the conference was a local phenomenon. I expect 2020 to be a good year for R conferences worldwide. Here is my short, somewhat eclectic, and by no means complete list of upcoming 2020 R events.</p>
<ul>
<li>While not an R specific conference, the <a href="https://ww2.amstat.org/meetings/csp/2020/?utm_source=informz&utm_medium=email&utm_campaign=asa&_zs=JaUQh1&_zl=mo2I6">Conference on Statistical Practice</a> (Sacramento, February 20 - 22) coming up soon will have some interesting R content. I am particularly looking forward to the talk by Songtao Wang on <a href="https://ww2.amstat.org/meetings/csp/2020/onlineprogram/AbstractDetails.cfm?AbstractID=303990">Thinking Statistically in Social Science and Humanities</a>.</li>
<li>The <a href="https://www.ire.org/events-and-training/conferences/nicar-2020">NICAR Confererence</a> produced by the nonprofit <a href="https://www.ire.org/">Investigative Reporters and Editors Inc.</a> (New Orleans, March 5 - 8) is a gathering of R and Python savvy journalists using R as an every day tool. Last year, I found the workshops and R training sessions to be outstanding.</li>
<li>The <a href="http://www.populationassociation.org/sidebar/annual-meeting/">PAA Annual Meeting</a> (Washington D.C., April 22 - 25) is an opportunity to meet demographers, economists public health professionals and sociologists using R.</li>
<li>The <a href="https://rstats.ai/nyr/#about">R Conference New York</a> (May, 7 - 9), the re-branded NY R Conference of previous years, promises to be the major East Coast R event of the year with a diverse and talented roster of speakers.</li>
<li>The European R Users meeting, <a href="https://2020.erum.io/">eRUM</a> (May, 27 - 30) will be the first major European R conference of the season. Note that Sharon Machlis, Director of Data & Analytics at IDG Communications will be one of the keynote speakers. 2020 may be the year that journalism captures the attention of the R Community.</li>
<li>The Symposium on Data Science & Statistics, <a href="https://ww2.amstat.org/meetings/sdss/2020/onlineprogram/index.cfm">SDSS</a>, (Pittsburgh, June 3 - 6) has become one of my favorite statistics conferences. Smaller and more manageable than the JSM, talks are sure to be infused with R content. Note that there will be a session on <a href="https://ww2.amstat.org/meetings/sdss/2020/onlineprogram/Program.cfm?date=06-05-20">Data Journalism and Visualization</a>.</li>
<li>Now in its twelfth year, <a href="http://uic.cvent.com/events/2020-r-finance-call-for-presentations/event-summary-add8ccef16bc42778b301c23ccab1a9e.aspx">R / Finance</a> (Chicago, June 5 - 6) has set the bar for small, single-session, collegial, technical R conferences. Focusing on Financial applications with a serious dose of advanced time series applications, R / Finance provides the opportunity to interact with professionals who put their money on R.</li>
<li>Always the pivotal event of the R Community, <a href="https://user2020.r-project.org/">useR! 2020</a> (St. Louis, July 7 -10) is sure to be a great event and a good time. The final program has not been published (Note that the <a href="https://user2020.r-project.org/news/2019/11/20/call-for-abstracts/">Call for papers</a> is still open.), but the <a href="https://user2020.r-project.org/program/tutorials/">tutorial sessions</a> alone indicate that useR! 2020 be an outstanding educational opportunity.<br /></li>
<li>The <a href="https://user2020muc.r-project.org/">useR! 2020 European Hub Conference</a> (Munich, July 7 - 10) will feature live talks as well as video streaming sessions from useR! 2020 in St. Louis. This innovative format promises to be an important event in its own right and a satisfying community experience.</li>
<li>The <a href="https://ww2.amstat.org/meetings/jsm/2020/">JSM</a> (Philadelphia, August 1 - 6), the mother of all statistical conferences, is expecting more than 6,500 attendees from 52 countries. This is an event you have to train for, but with some preparation and a little planning the JSM can be an opportunity to interact with statisticians who depend on the deep statistical knowledge embedded in R.</li>
<li>The Bioconductor Conference, <a href="https://bioc2020.bioconductor.org/">BioC 2020</a>, the event where <em>Software and Biology Connect</em> (Boston, July 29 - 30) is the premier conference for R and Genomics. If you have an interest in learning about cutting edge statistical applications using the big data of modern Biology, this the conference to attend.<br /></li>
<li>The <a href="http://whyr.pl/2020/">Why R? 2020 Conference</a> (Warsaw, August 27 - 30) is not only positioned to be the third significant European R conference of the year, the organizers have developed an ambitious and innovative strategy of supporting a number of satellite pre-meetings that stretch from Warsaw to Limerick and South Africa. I think this is a fantastic community initiative that represents a commitment to build the R Community in under served areas.</li>
</ul>
<p><img src="whyR.png" height = "400" width="600"></p>
<ul>
<li>There is no website yet, but it is breaking news that R / Medicine 2020 will be held in Philadelphia from August 27th to 29th. In its third year, R / Medicine is establishing itself as the R conference for physicians seeking to advance clinical practice with R fueled data science.<br /></li>
<li>The <a href="https://latin-r.com/en">LatinR 2020 Conference</a> (Montevideo, October 7 - 9) will be a major South American event. The <a href="https://latin-r.com/blog/call-for-papers">Call for papers</a> is open. Look <a href="https://latin-r.com/previous-editions/">here</a> for previous programs.<br /></li>
<li>The <a href="https://bioconductor.github.io/BiocAsia2020/">BiocAsia 2020 Conference</a> (Beijing, October 17 - 18), the yearly Asian Bioconductor event, has confirmed that Robert Gentleman will be speaking.</li>
</ul>
<p>Other conferences on my radar are:<br />
* The <a href="https://rstats.ai/dublinr/">R Conference Dublin</a> (June)<br />
* The CascadiaRConf (Eugene, OR)<br />
* R / Pharma which will most likely be held at Harvard University in August.<br />
* BioCEurope which will likely be held in December.</p>
<p>Please let me know what upcoming conferences I may have missed by adding them to the comments section of this post.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/02/05/some-2020-r-conferences/';</script>
rstudio::conf 2020 Workshops
https://rviews.rstudio.com/2020/01/27/rstudio-conf-2020-workshopsr/
Mon, 27 Jan 2020 00:00:00 +0000https://rviews.rstudio.com/2020/01/27/rstudio-conf-2020-workshopsr/
<p>rstudio::conf 2020 got underway today with a huge training event featuring eighteen workshops taught by some of the most experienced and sought after instructors in the R Community.</p>
<p><img src="conf.png" height = "400" width="600"></p>
<p>The workshops covered a wide range of topics including the Tidyverse, machine learning, deep learning, JavaScript, Shiny, R Markdown, package building, geospatial statistics, visualization, teaching R and working as an RStudio professional products administrator.</p>
<ul>
<li>Tidy Time Series Analysis and Forecasting<br /></li>
<li>R for Excel Users</li>
<li>What They Forgot to Teach You about R<br /></li>
<li>RStudio Professional Products Administration<br /></li>
<li>Big Data with R<br /></li>
<li>Introduction to Data Science in the Tidyverse<br /></li>
<li>Text Mining with Tidy Data Principles<br /></li>
<li>R Markdown and Interactive Dashboards</li>
<li>A Practical Introduction to Data Visualization with ggplot2</li>
<li>Deep Learning with Keras and TensorFlow in R Workflow<br /></li>
<li>Modern Geospatial Data Analysis with R</li>
<li>My Organization’s First R Package</li>
<li>Introduction to Machine Learning with the Tidyverse<br /></li>
<li>JavaScript for Shiny Users</li>
<li>Designing the Data Science Classroom<br /></li>
<li>Shiny from Start to Finish</li>
<li>Applied Machine Learning</li>
<li>Building Tidy Tools</li>
</ul>
<p>If you were not able to be among the over 1,400 R users who attended, you have not missed everything. You can still immerse yourself in hundreds of hours of R study. All of the training materials are available.</p>
<p>For descriptions of of each workshop look <a href="https://web.cvent.com/event/36ebe042-0113-44f1-8e36-b9bc5d0733bf/websitePage:34f3c2eb-9def-44a7-b324-f2d226e25011?RefId=conference&utm_campaign=Site%20Promo&utm_medium=Ste&utm_source=ConfPage">here</a>, and for the materials for each workshop, including slides and code, visit <a href="https://github.com/rstudio-conf-2020">the conference workshop repo</a>.</p>
<script>window.location.href='https://rviews.rstudio.com/2020/01/27/rstudio-conf-2020-workshopsr/';</script>
December 2019: "Top 40" New R Packages
https://rviews.rstudio.com/2020/01/20/december-2019-top-40-new-r-packages/
Mon, 20 Jan 2020 00:00:00 +0000https://rviews.rstudio.com/2020/01/20/december-2019-top-40-new-r-packages/
<p>One hundred fifty-two packages made it to CRAN in December. Here are my “Top 40” picks in ten categories: Data, Genomics, Machine Learning, Mathematics, Medicine, Science, Statistics, Time Series, Utilities, and Visualization.</p>
<h3 id="data">Data</h3>
<p><a href="https://cran.r-project.org/package=climate">climate</a> v0.3.0: Provides access to meteorological and hydrological data from <a href="http://ogimet.com/index.phtml.en">OGIMET</a>, University of Wyoming - <a href="http://weather.uwyo.edu/upperair">atmospheric vertical profiling data</a>, and Polish Institute of Meteorology and Water Management - <a href="https://dane.imgw.pl">National Research Institute</a>. Look <a href="https://www.mdpi.com/2071-1050/12/1/394">here</a> for more information as well as the <a href="https://cran.r-project.org/web/packages/climate/vignettes/getstarted.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=CCAMLRGIS">CCAMLRGIS</a> v3.0.1: Loads and creates spatial data, including layers and tools that are relevant to the activities of the Commission for the Conservation of Antarctic Marine Living Resources ( <a href="https://www.ccamlr.org/en/organisation/home-page">CCAMLR</a>). Have a look at the <a href="https://cran.r-project.org/web/packages/CCAMLRGIS/vignettes/CCAMLRGIS.html">vignette</a>.</p>
<p><img src="CCAMLRGIS.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=schrute">schrute</a> v0.1.1: Contains the complete scripts from the American version of the Office television show in tibble format. Have a look at the <a href="https://cran.r-project.org/web/packages/schrute/vignettes/theoffice.html">vignette</a> and practice NLP.</p>
<p><a href="https://cran.r-project.org/package=simfinR">simfinR</a> v0.1.0: Provides access to <a href="https://simfin.com/">SimFin</a> financial data including balance sheets, cash flow and income statements through the <a href="https://simfin.com/data/access/api">api</a>. Look <a href="https://www.msperlin.com/blog/post/2019-11-01-new-package-simfinr/">here</a> for details.</p>
<p><img src="simfin.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=statcanR">statcanR</a> v0.1.0: Provides access to Statistics Canada’s <a href="https://www.statcan.gc.ca/eng/developers/wds">Web Data Service</a>. See <a href="doi:10.6084/m9.figshare.10544735">Warin & Le Duc (2019)</a> and the <a href="https://cran.r-project.org/web/packages/statcanR/vignettes/statCanR.html">vignette</a>.</p>
<h3 id="genomics">Genomics</h3>
<p><a href="https://cran.r-project.org/package=ampir">ampir</a> v0.1.0: Implements a toolkit to predict antimicrobial peptides from protein sequences on a genome-wide scale, including an SVM model trained on publicly available antimicrobial peptide data using calculated physico-chemical and compositional sequence properties described in <a href="doi:10.1038/srep42362">Meher et al. (2017)</a>. There is a brief <a href="https://cran.r-project.org/web/packages/ampir/vignettes/ampir.html">Introduction</a>.</p>
<p><a href="https://cran.r-project.org/package=simplePHENOTYPES">simplePHENOTYPES</a> v1.0.5: Implements algorithms for simulating pleiotropy and Linkage Disequilibrium under additive, dominance and epistatic models. See <a href="https://academic.oup.com/bioinformatics/article/28/18/2397/252743">Lipka et al. (2012)</a> and <a href="https://dl.sciencesocieties.org/publications/tpg/articles/12/1/180052">Rice and Lipka (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/simplePHENOTYPES/vignettes/simplePHENOTYPES.html">vignette</a> for an introduction.</p>
<p><a href="https://cran.r-project.org/package=TreeTools">TreeTools</a> v0.1.3: Provides functions for the creation, modification and analysis of phylogenetic trees and the import and export of trees from Newick, Nexus <a href="doi:10.1093/sysbio/46.4.590">(Maddison et al. 1997)</a>, and <a href="http://www.lillo.org.ar/phylogeny/tnt/">TNT</a> formats. There are vignettes on <a href="https://cran.r-project.org/web/packages/TreeTools/vignettes/load-data.html">Loading Data</a>, <a href="https://cran.r-project.org/web/packages/TreeTools/vignettes/load-trees.html">Loading Trees</a>, and <a href="https://cran.r-project.org/web/packages/TreeTools/vignettes/filesystem-navigation.html">Navigating the File System</a>.</p>
<h3 id="machine-learning">Machine Learning</h3>
<p><a href="https://cran.r-project.org/package=AzureVision">AzureVision</a> v1.0.0: Implements an interface to <a href="https://docs.microsoft.com/azure/cognitive-services/Computer-vision/Home">Azure Computer Vision</a> and <a href="https://docs.microsoft.com/azure/cognitive-services/custom-vision-service/home">Azure Custom Vision</a> which allow users to leverage the cloud to carry out visual recognition tasks using advanced image processing models. There is a vignette on <a href="https://cran.r-project.org/web/packages/AzureVision/vignettes/computervision.html">Computer Vision</a> and another on <a href="https://cran.r-project.org/web/packages/AzureVision/vignettes/customvision.html">Custom Vision</a>.</p>
<p><a href="https://cran.r-project.org/package=dann">dann</a> v0.1.0: Implements discriminant Adaptive Nearest Neighbor Classification, variation of k nearest neighbors where the neighborhood is elongated along class boundaries. See <a href="https://web.stanford.edu/~hastie/Papers/dann_IEEE.pdf">Hastie (1995)</a> for details. There is an <a href="https://cran.r-project.org/web/packages/dann/dann.pdf">Introduction</a> and a vignette on <a href="https://cran.r-project.org/web/packages/dann/vignettes/dann.html">Sub-dann</a>.</p>
<p><a href="https://cran.r-project.org/package=eventstream">eventstream</a> v0.1.0: Provides functions to extract and classify events in contiguous spatio-temporal data streams of 2 or 3 dimensions. For details see <a href="doi:10.13140/RG.2.2.10051.25129">Kandanaarachchi et al. 2018</a>. There is an example in <a href="https://cran.r-project.org/web/packages/eventstream/readme/README.html">README</a>.</p>
<p><img src="eventstream.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=isotree">isotree</a> v0.1.8: Provides multi-threaded implementations of <a href="doi:10.1109/ICDM.2008.17">isolation forest</a>, <a href="arXiv:1811.02141">extended isolation forest</a>, <a href="doi:10.1007/978-3-642-15883-4_18">SCiForest</a>, and <a href="arXiv:1911.06646">fair-cut forest</a> for isolation-based outlier detection, clustered outlier detection, distance or <a href="arXiv:1910.12362">similarity approximation</a>, and imputation of missing values as described in <a href="arXiv:1911.06646">Cortes (2019)</a>. Look <a href="https://github.com/david-cortes/isotree">here</a> for an example.</p>
<p><img src="isotree.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=mlr3proba">mlr3proba</a> v0.1.1: Extends <a href="https://cran.r-project.org/package=mlr3"><code>mlr3</code></a> for probabilistic supervised learning that includes probabilistic and interval regression, survival modeling, and other specialized models. There is a vignette on <a href="https://cran.r-project.org/web/packages/mlr3proba/vignettes/survival.html">Survival Analysis</a>.</p>
<p><a href="https://cran.r-project.org/package=NLPclient">NLPclient</a> v1.0: Implements an interface to the <a href="https://stanfordnlp.github.io/CoreNLP/index.html">Stanford CoreNLP</a> annotation client which includes a part-of-speech (POS) tagger, a named entity recognizer (NER), a parser, and a co-reference resolution system. See <a href="https://cran.r-project.org/web/packages/NLPclient/readme/README.html">README</a> for installation details.</p>
<p><a href="https://cran.r-project.org/package=stray">stray</a> v0.1.0: Modifies the <a href="https://cran.r-project.org/package=HDoutliers"><code>HDoutliers</code></a> package for outlier detection in high dimensional data to include the algorithm proposed in <a href="arXiv:1908.04000">Talagala, Hyndman and Smith-Miles (2019)</a>.</p>
<p><a href="https://cran.r-project.org/package=tfhub">tfhub</a> v0.7.0: is a library for the publication, discovery, and consumption of reusable parts of machine learning models. Modules comprise self-contained parts of <code>TensorFlow</code> graphs along with weights and assets that can be reused across different tasks in a process known as transfer learning. There is an <a href="https://cran.r-project.org/web/packages/tfhub/vignettes/intro.html">Overview</a> and vignettes on <a href="https://cran.r-project.org/web/packages/tfhub/vignettes/key-concepts.html">Key Concepts</a> and using <a href="https://cran.r-project.org/web/packages/tfhub/vignettes/hub-with-keras.html"><code>TensorFlow</code> with <code>Keras</code></a>.</p>
<h3 id="mathematics">Mathematics</h3>
<p><a href="https://cran.r-project.org/package=dual">dual</a> v0.0.3: Implements automatic differentiation using dual numbers and returns the output value of a mathematical function along with its exact first derivative (or gradient). For more details see <a href="http://jmlr.org/papers/volume18/17-468/17-468.pdf">Baydin et al. (2018)</a>.</p>
<p><a href="https://cran.r-project.org/package=set6">set6</a> v0.1.0: Provides an object-oriented interface for constructing and manipulating mathematical sets, including (countably finite) sets, tuples, intervals (countably infinite or uncountable), and fuzzy variants. using <code>R6</code>. See the <a href="https://cran.r-project.org/web/packages/set6/vignettes/set6.html">vignette</a> for an introduction.</p>
<h3 id="medicine">Medicine</h3>
<p><a href="https://cran.r-project.org/package=LARisk">LARisk</a> v0.1.0: Provides functions to compute lifetime attributable risk of radiation-induced cancer. See <a href="https://doi.org/10.1088/0952-4746/32/3/205">Gonzalez et al. (2012)</a> for background and the <a href="https://cran.r-project.org/web/packages/LARisk/vignettes/vignette.pdf">vignette</a> for an example.</p>
<p><a href="https://cran.r-project.org/package=SCtools">SCtools</a> v0.3.0: Provides extensions to the synthetic controls analyses performed by the package <a href="https://cran.r-project.org/package=Synth"><code>Synth</code></a> as detailed in <a href="doi:10.18637/jss.v042.i13">Abadie et al. (2011)</a> that include generating and plotting placebos, post/pre-MSPE (mean squared prediction error) significance tests and plots, and calculating average treatment effects for multiple treated units. There is a vignette on replicating the <a href="https://cran.r-project.org/web/packages/SCtools/vignettes/replicating-basque.html">Basque Study</a> and another on <a href="https://cran.r-project.org/web/packages/SCtools/vignettes/case-study.html">Alcohol Consumption</a>.</p>
<p><img src="SCtools.png" height = "400" width="600"></p>
<h3 id="science">Science</h3>
<p><a href="https://cran.r-project.org/package=chronosphere">chronosphere</a> v0.2.0: Provides functions to facilitate the spatial analyses in (paleo)environmental/ecological research and serves as a gateway to plate tectonic reconstructions, deep time global climate model results as well as fossil occurrence datasets such as the <a href="https://www.paleobiodb.org/">Paleobiology Database</a> and the <a href="https://www.paleo-reefs.pal.uni-erlangen.de/">PaleoReefs Database</a>. See the <a href="https://cran.r-project.org/web/packages/chronosphere/vignettes/chronos.pdf">vignette</a> for an introduction.Chronosphere.png</p>
<p><img src="chronosphere.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=OCNet">OCNet</a> v0.1.1: Provides functions to generate analyze Optimal Channel Networks (OCNs): oriented spanning trees reproducing all scaling features characteristic of real, natural river networks. See <a href="doi:10.1073/pnas.1322700111">Rinaldo et al. (2014)</a> for an overview on the OCN concept, <a href="doi:10.18637/jss.v036.i10">Furrer and Sain (2010)</a> for the construct used, and the <a href="https://cran.r-project.org/web/packages/OCNet/vignettes/OCNet.html">vignette</a> for examples.</p>
<p><img src="OCNet.png" height = "400" width="600"></p>
<h3 id="statistics">Statistics</h3>
<p><a href="https://cran.r-project.org/package=bnma">bnma</a> v1.0.0: Provides functions for network meta-analyses using Bayesian framework of <a href="doi:10.1177/0272989X12458724">Dias et al. (2013)</a>. See the <a href="https://cran.r-project.org/web/packages/bnma/vignettes/my-vignette.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=npsurvSS">npsurvSS</a> v1.0.1: Provides sample size and power calculations for common non-parametric tests in survival analysis including the difference in (or ratio of) t-year survival, difference in (or ratio of) p-th percentile survival, difference in (or ratio of) restricted mean survival time, and the weighted log-rank test. There are vignettes on <a href="https://cran.r-project.org/web/packages/npsurvSS/vignettes/basic_functionalities.html">Basic Functions</a>, <a href="https://cran.r-project.org/web/packages/npsurvSS/vignettes/example1.html">Optimal Randomization Ratio</a>, and <a href="https://cran.r-project.org/web/packages/npsurvSS/vignettes/example2.html">Delayed Treatment Effect</a>.</p>
<p><img src="npsurvSS.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=sail">sail</a> v0.1.0: Implements sparse additive interaction learning with the strong heredity property, i.e., an interaction is selected only if its corresponding main effects are also included. See <a href="doi:10.1101/445304">Bhatnagar et al. (2019)</a> for background. There is also an <a href="https://cran.r-project.org/web/packages/sail/vignettes/introduction-to-sail.html">Introduction</a> and a <a href="https://cran.r-project.org/web/packages/sail/vignettes/user-defined-design.html">vignette</a> on supplying a user-defined design matrix.</p>
<p><img src="sail.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=https://cran.r-project.org/package=SequenceSpikeSlab">SequenceSpikeSlab</a> v0.1.1: Implements the algorithms described in <a href="arXiv:1810.10883">Van Erven & Szabo (2018)</a> to calculate the exact Bayes posterior for the Sparse Normal Sequence Model. See the <a href="https://cran.r-project.org/web/packages/SequenceSpikeSlab/vignettes/SequenceSpikeSlab-vignette.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=tcensReg">tcensReg</a> v0.1.5: Implements maximum likelihood estimation (MLE) assuming an underlying left truncated normal distribution with left censoring described in <a href="arXiv:1911.11221">Williams et al. (2019)</a>. See the <a href="https://cran.r-project.org/web/packages/tcensReg/vignettes/tcensReg.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=univariateML">univariateML</a> v1.0.0: Looks back to the roots of maximum likelihood estimation <a href="doi:10.1098/rsta.1922.0009">(Fisher (1921)</a> to provide functions for the ML estimation of uni variate densities. There is an <a href="https://cran.r-project.org/web/packages/univariateML/vignettes/overview.html">Overview</a> and vignettes on <a href="https://cran.r-project.org/web/packages/univariateML/vignettes/copula.html">Copula Modeling</a> and <a href="https://cran.r-project.org/web/packages/univariateML/vignettes/distributions.html">Distributions</a>.</p>
<p><img src="univariateML.png" height = "400" width="600"></p>
<h3 id="time-series">Time Series</h3>
<p><a href="https://cran.r-project.org/package=imputeFin">imputeFin</a> v0.1.0: Provides functions to impute the missing values based on modeling the time series with a random walk or an autoregressive (AR) model, convenient to model log-prices and log-volumes in financial data. See <a href="doi:10.1109/TSP.2019.2899816">Liu et al. (2019)</a> for background and the <a href="https://cran.r-project.org/web/packages/imputeFin/vignettes/ImputeFinancialTimeSeries.html">vignette</a> for examples.</p>
<p><img src="imputeFin.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=VLTimeCausality">VLTimeCausality</a> v0.1.0: Implements a framework to infer causality on a pair of time series of real numbers based on variable-lag <a href="https://www.statisticshowto.datasciencecentral.com/granger-causality/">Granger causality</a> and transfer entropy. See <a href="https://www.cs.uic.edu/~elena/pubs/amornbunchornvej-dsaa19.pdf">Zheleva & Berger-Wolf (2019)</a> for the details and the <a href="Zheleva, and Tanya Berger-Wolf (2019) <https://www.cs.uic.edu/~elena/pubs/amornbunchornvej-dsaa19.pdf">vignette</a> for examples.</p>
<p><img src="VLTimeCausality.png" height = "400" width="600"></p>
<h3 id="utilities">Utilities</h3>
<p><a href="https://cran.r-project.org/package=asciicast">asciicast</a> v1.0.0: Implements tools to record screen casts from R scripts and convert them to animated SVG images for use in <code>README</code> files and blog posts. It includes <code>asciinema-player</code> as an <code>HTML</code> widget, and a <code>knitr</code> engine, to embed <code>ascii</code> screen casts in R Markdown documents. There is a <a href="https://cran.r-project.org/web/packages/asciicast/vignettes/asciicast-demo.html">vignette</a>.</p>
<p><a href="https://cran.r-project.org/package=funneljoin">funneljoin</a> v0.1.0: Implements a time-based joins to analyze sequence of events, both in memory and out of memory. See the <a href="https://cran.r-project.org/web/packages/funneljoin/vignettes/funneljoin.html">vignette</a> for details.</p>
<p><a href="https://cran.r-project.org/package=hardhat">hardhat</a> v0.1.1: Provides tools to reduce the burden around building new modeling packages by providing functionality for preprocessing, predicting, and validating input. There is an <a href="https://cran.r-project.org/web/packages/hardhat/vignettes/package.html">Introduction</a> and vignettes on <a href="https://cran.r-project.org/web/packages/hardhat/vignettes/forge.html">Forging Data</a> and <a href="https://cran.r-project.org/web/packages/hardhat/vignettes/mold.html">Molding Data</a>.</p>
<p><img src="hardhat.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=proffer">proffer</a> v0.0.2: Builds on <a href="https://github.com/google/pprof"><code>pprof</code></a> to provide profiling tools capable of detecting sources of slowness in R code. Look <a href="https://r-prof.github.io/proffer/">here</a> for more information.</p>
<p><img src="proffer.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=ropenblas">robenblas</a> v0.2.0: Facilitates downloading, compiling and linking the <a href="https://www.openblas.net/"><code>OpenBLAS</code> library</a> for users of any <code>GNU/Linux</code> distribution. See <a href="https://cran.r-project.org/web/packages/ropenblas/readme/README.html">README</a> for help.</p>
<p><a href="https://cran.r-project.org/package=sortable">sortable</a> v0.4.2: Provides functions to enables drag-and-drop behavior in Shiny apps, by exposing the functionality of the <a href="https://sortablejs.github.io/Sortable/"><code>SortableJS</code></a> JavaScript library as an <a href="http://htmlwidgets.org"><code>htmlwidget</code></a>. There is a live demo on <a href="https://cran.r-project.org/web/packages/sortable/vignettes/novel_solutions.html">Using Sortable</a> and another on <a href="https://cran.r-project.org/web/packages/sortable/vignettes/shiny_apps.html">Using Sortable widgets</a>, and a vignette on the <a href="https://cran.r-project.org/web/packages/sortable/vignettes/understanding_sortable_js.html">Interface to SortableJS</a>.</p>
<p><img src="sortable.gif" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=sparkhail">sparkhail</a> v0.1.1: Implements a <code>sparklyr</code> interface to <a href="https://hail.is/"><code>Hail</code></a>, an open-source, general-purpose, <code>Python</code> based data analysis tool with additional data types and methods for working with genomic data, that has been built to scale and provide first-class support for multi-dimensional structured data which is typical of genome-wide association studies. See <a href="https://cran.r-project.org/web/packages/sparkhail/readme/README.html">README</a> for information on how to use the package.</p>
<p><a href="https://cran.r-project.org/package=trimmer">trimmer</a> v0.8.1: Implements a lightweight toolkit to reduce the size of a list object based on user input. See the <a href="https://cran.r-project.org/web/packages/trimmer/index.html">vignette</a>.</p>
<h3 id="visualization">Visualization</h3>
<p><a href="https://cran.r-project.org/package=gggibbous">gggibbous</a> v0.1.0: Extends <code>ggplot2</code> to offer <em>moon charts</em>, pie charts where the proportions are shown as crescent or gibbous portions of a circle, like the lit and unlit portions of the moon. It i all illuminated in the <a href="https://cran.r-project.org/web/packages/gggibbous/vignettes/gggibbous.html">vignette</a>.</p>
<p><img src="gggibbous.png" height = "400" width="600"></p>
<p><a href="https://cran.r-project.org/package=patchwork">patchwork</a> v1.0.0: Extends the <code>ggplot2</code> API to allow for arbitrarily complex plot compositions by providing mathematical operators for combining multiple plots. See the <a href="https://cran.r-project.org/web/packages/patchwork/vignettes/patchwork.html">vignette</a> for examples.</p>
<p><img src="patchwork.png" height = "400" width="600"></p>
<script>window.location.href='https://rviews.rstudio.com/2020/01/20/december-2019-top-40-new-r-packages/';</script>
No Framework, No Problem! Structuring your project folder and creating custom Shiny components
https://rviews.rstudio.com/2020/01/13/no-framework-no-problem-structuring-your-project-folder-and-creating-custom-shiny-components/
Mon, 13 Jan 2020 00:00:00 +0000https://rviews.rstudio.com/2020/01/13/no-framework-no-problem-structuring-your-project-folder-and-creating-custom-shiny-components/
<p><em>Pedro Coutinho Silva is a software engineer at Appsilon Data Science.</em></p>
<p>It is not always possible to create a dashboard that fully meets your expectations or requirements using only existing libraries. Maybe you want a specific function that needs to be custom built, or maybe you want to add your own style or company branding. Whatever the case, a moment might come when you need to expand and organize your code base, and dive into creating a custom solution for your project; but where to start? In this post, I will explain the relevant parts of my workflow for Shiny projects using our hiring funnel application as an example.</p>
<p><img src="dashboard.png" alt="Hiring Funnel Dashboard" /></p>
<p>I will cover:</p>
<ul>
<li>Structuring the project folder: what goes where?
<ul>
<li>Managers and extracting values into settings files.</li>
<li>Using modules to organize your code.</li>
</ul></li>
<li>What does it actually take to create a custom component?</li>
</ul>
<p>Hopefully, these topics will be as valuable to you as they have been to me!</p>
<h3 id="structuring-your-project-folder">Structuring your project folder</h3>
<p>Your typical dashboard project doesn’t have a very complex structure, but this can change a lot as your project grows, so I typically try to keep things as separate as possible, and to provide some guidance for future collaborators.</p>
<p>I wont go over styles since these are basically an implementation of <a href="https://github.com/rstudio/sass">Sass</a>. Sass lets you keep your sanity while avoiding inline styling. You can read a bit more about it on my <a href="https://appsilon.com/how-to-make-your-css-awesome-with-sass/">previous post</a> about it.</p>
<p>So what does our project folder actually look like? Here is our <a href="https://appsilon.com/journey-from-basic-prototype-to-production-ready-shiny-dashboard/">hiring funnel</a> project folder for example:</p>
<pre><code>│ app.R
└─── app
│ global.R
│ server.R
│ ui.R
└─── managers
│ constants_manager.R
│ data_manager.R
└─── settings
│ app.json
│ texts.json
└─── modules
│ header.R
│ sidebar.R
│ ui_components.R
└─── styles
│ │ main.scss
│ ...
└─── www
│ sass.min.css
└─── assets
└─── scripts
</code></pre>
<p>Quite a lot to unpack, but lets go over the important bits:</p>
<h3 id="managers-and-settings">Managers and Settings</h3>
<p>Managers are scripts that make use of R6 classes. The constants manager has already been covered before by one of my colleagues in <a href="https://appsilon.com/super-solutions-for-shiny-architecture-3-5-softcoding-constants-in-the-app/">the super solutions series</a>. The data manager contains all of the abstraction when dealing with data loading and processing.</p>
<p>Settings has all of the values that should not be hard coded. This includes constants, texts and other values that can be extracted.</p>
<h3 id="modules">Modules</h3>
<p>Modules let you easily create files for managing parts of your code. This means you can have modules for specific elements or even layout sections without bloating your main files too much.</p>
<p>They are great when it comes to code structure. Let’s take our header for example, instead of growing our <code>ui.R</code> with all of the header code, we can extract it to a separate file:</p>
<pre><code># Header.R
import("shiny")
import("modules")
export("ui")
ui_components <- use("modules/ui_components.R")
ui <- function(id) {
tags$header(
...
)
}
</code></pre>
<p>All it takes is importing the libraries you plan to use, and exporting the functions you would like to make available. You can even call other modules from inside a module!</p>
<p>After this we just instance the module in <code>UI.R</code>:</p>
<pre><code># ui.R
header <- use("modules/header.R")
</code></pre>
<p>And can now use by simply calling the function we want:</p>
<pre><code>fluidPage(
header$ui()
...
</code></pre>
<h3 id="custom-components">Custom components</h3>
<p>When you cannot find a component that does what you want, sometimes the only option is to create it yourself. Since we are talking about HTML components, we can expect the average component to have three main parts:
- Layout
- Style
- Behavior</p>
<p>We have already covered modules, but how do we deal with styling and behavior? Lets take for example our navigation. What we are looking for behaves as a tab system, but where the navigation is split from the content. So we need two different <code>ui</code> functions:</p>
<pre><code>tabs_navigation <- function(id = "main-tabs-navigation", options) {
tagList(
tags$head(
tags$script(src = "scripts/tab-navigation.js")
),
tags$div(
id = id,
class = "tabs-navigation",
`data-tab-type`= "navigation",
lapply(options, tabs_single_navigation)
)
)
}
tabs_panel <- function(
id = "main_tab_set",
class = "tabs-container",
tabs_content
) {
div(
id = "main-tabs-container",
class = "tabs-container",
`data-tab-type`= "tabs-container",
lapply(tabs_content, single_tab),
tags$script("init_tab_navigation()")
)
}
</code></pre>
<p>By giving the different elements id’s and classes, we can use sass to easily style these components. And by including a <code>Javascript</code> file in the element we can load and initialize browser behavior. In this case our <code>tab-navigation.js</code> just initializes the first tab and binds a click event to cycle through the different tabs when clicked.</p>
<pre><code>init_tab_navigation = function() {
$( document ).ready(function() {
$("[data-tab-type='navigation']")
.find("[data-tab-type='controller']")
.first().addClass("active")
$(`[data-tab-type="tabs-container"]`)
.find(`[data-tab-type="tab"]`)
.first().addClass("active")
$("[data-tab-type='controller']").on("click", function(e){
$(this)
.addClass("active")
.siblings(`[data-tab-type="controller"]`)
.removeClass("active")
let target = $(this).data("target")
$(`[data-tab-id="${target}"]`).addClass("active")
.siblings(`[data-tab-type="tab"]`)
.removeClass("active")
})
});
}
</code></pre>
<p>It takes a bit of effort, but the result is something truly custom.
<img src="tabs.png" alt="Hiring Funnel Dashboard" /></p>
<p>We barely scratched the surface of what can be done when it comes to custom solutions, but I hope it already gives you an idea of how to start or improve your next project!</p>
<p>Craving more, or have any questions? Feel free to reach out and ask!</p>
<h3 id="references">References</h3>
<ul>
<li>Modules (<a href="https://cran.r-project.org/web/packages/modules/vignettes/modulesInR.html">https://cran.r-project.org/web/packages/modules/vignettes/modulesInR.html</a>)</li>
<li>Sass (<a href="https://github.com/rstudio/sass">https://github.com/rstudio/sass</a>)</li>
<li>R6 classes (<a href="https://adv-r.hadley.nz/r6.html">https://adv-r.hadley.nz/r6.html</a>)</li>
</ul>
<script>window.location.href='https://rviews.rstudio.com/2020/01/13/no-framework-no-problem-structuring-your-project-folder-and-creating-custom-shiny-components/';</script>