Education on R Views

Interview with Oscar Baruffa, Creator of the Big Book of R

Tue, 04 Jan 2022 00:00:00 +0000

Welcome to the new year! If you’re itching to improve your R skills in 2022, we have the resource for you.

We’re excited to share the Big Book of R. “Your last-ever bookmark”, the Big Book of R is an impressive collection of R-related books from a variety of subjects. Creator Oscar Baruffa first published the book in August 2020. Since then, it has grown from a list of 80 to over 200, garnering 73,000 unique visitors and 195,000 pageviews from readers around the globe.

The Book is organized by different subjects that range from introduction to R programming to big data to archeology. Its organization and search functionality make it easy for newcomers to find books related to their topic of interest.

The Big Book of R is a wonderful example of collaboration in the R Community. Oscar wrote the book using the bookdown package. Contributors can file an issue or create a pull request on Github. And of course, the book would not be possible without the authors who have written books to guide others on their R journey.

While you are preparing for 2022, we encourage you to cozy up to one of the great books in the Big Book of R.

Check out the Big Book of R
Contribute to the Github Repository
Follow Oscar on Twitter
Subscribe to the newsletter

Interview with Oscar

Hello! Could you tell us a bit about yourself, please?

I’m a South African now living in the Netherlands, working as a Data Specialist in an international development non-profit focused on sustainable trade systems. My role is basically that of a senior analytics manager and I’m the first one, so I get to have all the fun directing the development of our data pipeline! No really, it’s a lot of fun :).

I studied Mechanical Engineering in my undergraduate and masters degrees and have been dabbling in tech-related side projects and hobbies for many years.

How did you get started with the R Community?

I think it was sometime in late 2018 when I was busy learning a bit of Python and Jesse Mostipak started popping up on my Twitter feed. She made R sound quite fun so I thought I’d give it a try. After I got to the exercise in R for Data Science by Hadley Wickham and Garrett Grolemund where you’re introduced to faceting a plot, my mind was blown and I was hooked.

I then started participating in the #TidyTuesday challenge and recorded some screencasts on my “Other People’s RStats” YouTube channel. I took a bit of a leap of faith submitting a lightning talk for satRday Johannesburg in April 2019, which was being organised by Andrew Collier and Megan Beckett. I was selected, which was a welcome surprise for my little topic. They were so helpful when I was trying to figure out the pull request flow for submitting my presentation. I also attended my first-ever R workshop the day before (which Andrew presented), which was the first time I’d ever sat in a room with other people who love R. It was awesome!

While doing all of this, I was also following more and more people on Twitter who were tweeting about R, collecting bookmarks of packages, tutorials, and, of course, books. I was having a lot of fun. By early 2020, I wrote my first book with Veerle van Son called Twitter for R Programmers as a way to introduce others to the R community on the social platform. So by that point, you could say I’d become heavily invested in the community :).

What inspired you to start the Big Book of R?

I had been diligently collecting bookmarks of books as I was finding them. After about 2 years of doing so, I had an inkling that I must have a large collection. One day, I counted about 80. I compared it to other lists of books that I’d seen published and I had way more, so I figured this might be quite unique to have so many.

Having had written my own short book, I also appreciated how much effort it took and I felt there must be better usability for readers and discoverability for authors if books were all listed in one place and grouped by topic (I invented a library - haha!). I also hoped (still do) that it might encourage more writing too. I put all the books together, spent some time categorising them, and in August 2020 published it. It made quite a splash!

Tell us about the design choices you made to make the book inclusive and open to the community.

I opted to use bookdown and git as I already had a bit of experience using them for Twitter for R Programmers. It felt very “meta” and fitting to use bookdown, which had in turn been used for almost all of the books in the collection. I was hoping this format would allow others to submit books as well — and they’ve generously done so. I kept the collection as a plain text format in the hope that it would slightly lower the barrier to people submitting books, but in the end, many people just tag me in a tweet which is welcome :).

What has surprised you the most about this project?

I knew people would like this but I didn’t expect how much it would be appreciated. Every now and then, I get a message of thanks for creating and maintaining it that really makes me feel warm and fuzzy inside. I hope people also reach out to the authors and do the same. Their effort in writing these books is immense and a little bit of appreciation will make their day — guaranteed!

What also surprises me is how there’s very consistent spikes in views and a steady growth in daily visitors whenever someone else shares the Big Book of R. If you’re interested to see the analytics, I’ve made them publicly accessible.

What’s in store for the Big Book of R in 2022?

I’m sure the collection will keep growing :).

I’ve just remodeled how the content is generated. Instead of capturing the book entries directly into markdown, it’s now generated by reading the data from a Google sheet. This new setup gives me the flexibility to do things like alphabetize the books more easily, add additional fields and tags more easily, set up Twitter/LinkedIn bots to automatically post about books in the collection, set up scripts to detect book updates, etc. Basically, I want to open up more possibilities for further automation and discoverability. If anyone is building anything using this data, I’d love to hear about it.

If anyone has ideas of how to improve, please get in touch with me or submit an issue in the repo.

Do you have any ongoing or upcoming projects you’d like the R Community to know about?

If things work out, there’s a chance I’ll be a technical reviewer on two R books being worked on in 2022, so that’ll be a new experience for me that I’m looking forward to. I’m going to keep writing useful articles about R, data and data careers over on my blog (and releasing fun R-related products here and there). The best way to be notified of those is to sign up to my newsletter. I’m also looking to write some more on the topic of Project Management to build upon my other bit of work that I’m really proud of, Project Management Fundamentals for Data Analysts.

Closing question: what is your favorite R package right now?

I think I’d give that accolade to {dplyr}! I’m pretty sure that filter() and group_by() are my most-used functions. Nothing gives me greater pleasure than a good anti_join().

We at RStudio would like to thank Oscar for his contribution to RViews and the creation of a great resource for the R Community. Happy reading in 2022!

Cheat Sheets

Wed, 10 Mar 2021 00:00:00 +0000

In a previous post, I described how I was captivated by the virtual landscape imagined by the RStudio education team while looking for resources on the RStudio website. In this post, I’ll take a look at Cheatsheets another amazing resource hiding in plain sight.

Apparently, some time ago when I wasn’t paying much attention, cheat sheets evolved from the home made study notes of students with highly refined visual cognitive skills, but a relatively poor grasp of algebra or history or whatever to an essential software learning tool. I don’t know how this happened in general, but master cheat sheet artist Garrett Grolemund has passed along some of the lore of the cheat sheet at RStudio. Garrett writes:

One day I put two and two together and realized that our Winston Chang, who I had known for a couple of years, was the same “W Chang” that made the LaTex cheatsheet that I’d used throughout grad school. It inspired me to do something similarly useful, so I tried my hand at making a cheatsheet for Winston and Joe’s Shiny package. The Shiny cheatsheet ended up being the first of many. A funny thing about the first cheatsheet is that I was working next to Hadley at a co-working space when I made it. In the time it took me to put together the cheatsheet, he wrote the entire first version of the tidyr package from scratch.

It is now hard to imagine getting by without cheat sheets. It seems as if they are becoming expected adjunct to the documentation. But, as Garret explains in the README for the cheat sheets GitHub repository, they are not documentation!

RStudio cheat sheets are not meant to be text or documentation! They are scannable visual aids that use layout and visual mnemonics to help people zoom to the functions they need. … Cheat sheets fall squarely on the human-facing side of software design.

Cheat sheets live in the space where human factors engineering gets a boost from artistic design. If R packages were airplanes then pilots would want cheat sheets to help them master the controls.

The RStudio site contains sixteen RStudio produced cheat sheets and nearly forty contributed efforts, some of which are displayed in the graphic above. The Data Transformation cheat sheet is a classic example of a straightforward mnemonic tool. It is likely that even someone who just beginning to work with dplyr will immediately grok that it organizes functions that manipulate tidy data. The cognitive load then is to remember how functions are grouped by task. The cheat sheet offers a canonical set of classes: “manipulate cases”, “manipulate variables” etc. to facilitate the process. Users that work with dplyr on a regular basis will probably just need to glance at the cheat sheet after a relatively short time.

The Shiny cheat sheet is little more ambitious. It works on multiple levels and goes beyond categories to also suggest process and workflow.

The Apply functions cheat sheet takes on an even more difficult task. For most of us, internally visualizing multi-level data structures is difficult enough, imaging how data elements flow under transformations is a serious cognitive load. I for one, really appreciate the help.

Cheat sheets are immensely popular. And even in this ebook age where nearly everything you can look at is online, and conference attending digital natives travel light, the cheat sheets as artifacts retain considerable appeal. Not only are they useful tools and geek art (Take a look at cartography) for decorating a workplace, my guess is that they are perceived as runes of power enabling the cognoscenti to grasp essential knowledge and project it in the world.

When in-person conferences resume again, I fully expect the heavy paper copies to disappear soon after we put them out at the RStudio booth.

Learning R With Education Datasets

Thu, 11 Jun 2020 00:00:00 +0000

Ryan A. Estrellado is a public education leader and data scientist helping administrators use practical data analysis to improve the student experience.

Timothy Gallwey wrote in The Inner Game of Tennis:

…There is a natural learning process which operates within everyone, if it is allowed to. This process is waiting to be discovered by all those who do not know of its existence … It can be discovered for yourself, if it hasn’t been already. If it has been experienced, trust it.

Discovering a new R concept like a function or package is exciting. You never know if you’re about to learn something that fundamentally changes the way you code or solve data science problems. But I get even more excited when I see somebody use new R concepts. For example, I learned about random forest models when I read about them in An Introduction to Statistical Learning (ISL). Then I imagined myself using them when I watched Julia Silge fit a random forest model to predict attendance at NFL games. I need the reading to give me language for what I see data scientists do. Then I need to see what data scientists do for me to imagine myself doing what I’ve read.

Still, for most people using R in their jobs, there’s another step. They have to imagine how to apply what they’ve read and seen to the problems they’re solving at work. But what if we used education datasets to help them imagine using R on the job, just as the authors of ISL use words and code to teach about models and Julia Silge uses video to inspire coding?

We learned from writing Data Science in Education Using R (DSIEUR) that we can combine words, code, and professional context. Professional context includes scenarios, language, and data that readers will recognize in their education jobs. We wanted readers to feel motivated and engaged by seeing words and data that reminds them of their everyday work tasks. This connection to their professional lives is a hook for readers as they engage R syntax which is, if you’ve never used it, literally a foreign language.

Let’s use pivot_longer() as an example. We’ll describe this process in three steps: discovering the concept, seeing how the concept is used, and seeing how the concept is used in education.

Step 1: See the concept

When I read something like “Use pivot_longer() to transform a dataset from wide to long”, I can imagine the shape of a dataset changing. But it’s harder to imagine what happens with the variables and their contents as the dataset’s shape changes. I’ve been using R for over five years and I still struggle to visualize the contents of many columns rearranging themselves into one.

Step 2: See how the concept is used

The concept gets much clearer when you add an example—even one with little context—to the explanation. Here’s one from the pivot_longer() vignette, which you can view with vignette("pivot"):

library(tidyverse)

# Simplest case where column names are character data
relig_income

#> # A tibble: 18 x 11
#>    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#>    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
#>  1 Agnostic      27        34        60        81        76       137        122
#>  2 Atheist       12        27        37        52        35        70         73
#>  3 Buddhist      27        21        30        34        33        58         62
#>  4 Catholic     418       617       732       670       638      1116        949
#>  5 Don’t k…      15        14        15        11        10        35         21
#>  6 Evangel…     575       869      1064       982       881      1486        949
#>  7 Hindu          1         9         7         9        11        34         47
#>  8 Histori…     228       244       236       238       197       223        131
#>  9 Jehovah…      20        27        24        24        21        30         15
#> 10 Jewish        19        19        25        25        30        95         69
#> 11 Mainlin…     289       495       619       655       651      1107        939
#> 12 Mormon        29        40        48        51        56       112         85
#> 13 Muslim         6         7         9        10         9        23         16
#> 14 Orthodox      13        17        23        32        32        47         38
#> 15 Other C…       9         7        11        13        13        14         18
#> 16 Other F…      20        33        40        46        49        63         46
#> 17 Other W…       5         2         3         4         2         7          3
#> 18 Unaffil…     217       299       374       365       341       528        407
#> # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
#> #   know/refused` <dbl>

relig_income %>%
 pivot_longer(-religion, names_to = "income", values_to = "count")

#> # A tibble: 180 x 3
#>    religion income             count
#>    <chr>    <chr>              <dbl>
#>  1 Agnostic <$10k                 27
#>  2 Agnostic $10-20k               34
#>  3 Agnostic $20-30k               60
#>  4 Agnostic $30-40k               81
#>  5 Agnostic $40-50k               76
#>  6 Agnostic $50-75k              137
#>  7 Agnostic $75-100k             122
#>  8 Agnostic $100-150k            109
#>  9 Agnostic >150k                 84
#> 10 Agnostic Don't know/refused    96
#> # … with 170 more rows

Sharing an idea by pairing an abstract programming concept with a reproducible example is a common practice for experienced R programmers. Community guidelines for Stack Overflow posts and the {reprex} package are two artifacts of a popular R community norm: help folks understand an idea by using words and code.

Step 3: See how the concept is used in education

Combining the explanation with a reproducible example makes pivot_longer() more concrete by showing how it works. What happens when we connect the explanation and reproducible example to the everyday work of a data scientist in education?

In chapter seven of DSIEUR, we use pivot_longer() to transform a dataset of coursework survey responses from wide to long. Before using pivot_longer(), the dataset had a column for each survey question. When we use pivot_longer(), the name of each survey question moves to a new column called “question”. Another new column is added, “response”, which contains the corresponding response to each survey question.

To run this code, you’ll need the DSIEUR companion R package, {dataedu}:

# Install the {dataedu} package if you don't have it
# devtools::install_github("data-edu/dataedu")
library(dataedu)

Here’s the survey data in its original, wide format:

# Wide format
pre_survey

#> # A tibble: 1,102 x 12
#>    opdata_username opdata_CourseID Q1Maincellgroup… Q1Maincellgroup…
#>    <chr>           <chr>                      <dbl>            <dbl>
#>  1 _80624_1        FrScA-S116-01                  4                4
#>  2 _80623_1        BioA-S116-01                   4                4
#>  3 _82588_1        OcnA-S116-03                  NA               NA
#>  4 _80623_1        AnPhA-S116-01                  4                3
#>  5 _80624_1        AnPhA-S116-01                 NA               NA
#>  6 _80624_1        AnPhA-S116-02                  4                2
#>  7 _80624_1        AnPhA-T116-01                 NA               NA
#>  8 _80624_1        BioA-S116-01                   5                3
#>  9 _80624_1        BioA-T116-01                  NA               NA
#> 10 _80624_1        PhysA-S116-01                  4                4
#> # … with 1,092 more rows, and 8 more variables: Q1MaincellgroupRow3 <dbl>,
#> #   Q1MaincellgroupRow4 <dbl>, Q1MaincellgroupRow5 <dbl>,
#> #   Q1MaincellgroupRow6 <dbl>, Q1MaincellgroupRow7 <dbl>,
#> #   Q1MaincellgroupRow8 <dbl>, Q1MaincellgroupRow9 <dbl>,
#> #   Q1MaincellgroupRow10 <dbl>

The third through eighth columns are named after each survey question—“Q1MaincellgroupRow1”, “Q1MaincellgroupRow2”, “Q1MaincellgroupRow3”, etc. These are the column names we’ll be moving to a single column called “question” when the dataset transforms from wide to long.

Here’s the new dataset, where a column called “question” contains the question names and a column called “response” contains the corresponding responses:

# Pivot the dataset from wide to long format
pre_survey %>%
  pivot_longer(cols = Q1MaincellgroupRow1:Q1MaincellgroupRow10,
               names_to = "question",
               values_to = "response")

#> # A tibble: 11,020 x 4
#>    opdata_username opdata_CourseID question             response
#>    <chr>           <chr>           <chr>                   <dbl>
#>  1 _80624_1        FrScA-S116-01   Q1MaincellgroupRow1         4
#>  2 _80624_1        FrScA-S116-01   Q1MaincellgroupRow2         4
#>  3 _80624_1        FrScA-S116-01   Q1MaincellgroupRow3         4
#>  4 _80624_1        FrScA-S116-01   Q1MaincellgroupRow4         1
#>  5 _80624_1        FrScA-S116-01   Q1MaincellgroupRow5         5
#>  6 _80624_1        FrScA-S116-01   Q1MaincellgroupRow6         4
#>  7 _80624_1        FrScA-S116-01   Q1MaincellgroupRow7         1
#>  8 _80624_1        FrScA-S116-01   Q1MaincellgroupRow8         5
#>  9 _80624_1        FrScA-S116-01   Q1MaincellgroupRow9         5
#> 10 _80624_1        FrScA-S116-01   Q1MaincellgroupRow10        5
#> # … with 11,010 more rows

When you put it all together, the learning thought process is something like this:

There’s a function called pivot_longer(), which turns a wide dataset into a long dataset
pivot_longer() does this by putting multiple column names into its own column, then creating a new column that pairs each column name with a value
I can use pivot_longer() to change an education survey dataset that has question names for columns into one that has a “question” column and a “response” column

We’ll be back with the next post in about two weeks. Until then, do share with us about the people and tools that inspire you to work on collaborative projects. You can reach us on Twitter: Emily @ebovee09, Jesse @kierisi, Joshua @jrosenberg6432, Isabella @ivelasq3 and me @RyanEs.

A Few Old Books

Thu, 25 Apr 2019 00:00:00 +0000

Greg Wilson is a data scientist and professional educator at RStudio.

My previous column looked at a few new books about R. In this one, I’d like to explore a few books about programming that people coming from data science backgrounds may not have stumbled upon.

The first is Michael Nygard’s Release It!, which more than lives up to its subtitle, “Design and Deploy Production-Ready Software”. Most of us can write programs that work for us on our machines; this book explores what it takes to create software that will work reliably for other people, on machines you’ve never met, long after you’ve moved on to your next project. It focuses on software that’s deployed for general use rather than installed on individuals’ machines, and covers stability patterns and anti-patterns, designing software to meet production needs, security, and a range of other pragmatic issues. You might not need to take care of these things yourself, but whoever has to get your software running on the departmental cluster will be grateful that you thought about it, and can have a sensible conversation about trade-offs.

The second book is Andreas Zeller’s Why Programs Fail, which bills itself as “a guide to pragmatic debugging”, and has been turned into a Udacity course. Programmers spend anywhere from a quarter to three quarters of their time debugging, but most only get an in-passing overview of how to do this well, and are never shown tools more advanced than print statements and break-point debuggers. Zeller starts with that, but goes much further to look at automatic and semi-automatic ways of simplifying programs to localize problems, isolating values’ origins, program slicing, anomaly detection, and much more. Some of the methods he describes will seem very familiar to data scientists, though the domain is new; others will take readers without a computer-science background into new territory in the same way that Advanced R does.

Our third entry is Michael Feathers’ Working Effectively with Legacy Code. Feathers defines legacy code as software that we’re reluctant to modify because we don’t understand how it works and are afraid of breaking. Having a comprehensive test suite allays this fear, but how can we construct one after the fact for a tangled mess of code? The bulk of the book explores answers to this question, including how to identify seams where code can be split, how to break dependencies so that parts can be improved incrementally, and so on. Some of the examples may seem a little out of date (the book is almost 15 years old), but they all apply directly to the unholy mixture of Perl, shell scripts, hundred-line SQL statements, and ten-page R scripts that you were just handed.

Number four is Jeff Johnson’s GUI Bloopers. I was in two startups in the 1990s, and in both of them, I was told after a few weeks that I was never allowed to work on the user interface again. It was the right decision, but this book might have made it unnecessary. Rather than trying to explain the rules for designing a good user interface, Johnson gives example after example of how to fix bad ones. The companion book, Web Bloopers, is less useful today because web interfaces have evolved so rapidly, but either will help you make an interface that is at least not bad.

The last entry for this post is Ashley Davis’s Data Wrangling with JavaScript. As its title suggests, it doesn’t spend very much time on statistical theory; instead, it covers the “other 90%” of squeezing answers out of data, from establishing your data pipeline and getting started with Node (a widely-used command-line version of JavaScript) to cleaning, analyzing, and visualizing data. There are lots of code samples and plenty of diagrams, and you can download both the data sets the author uses in examples and his Data-Forge library. I suspect readers will need some prior familiarity with JavaScript to dive into this, but Davis shows just how far you can go with what’s available today, and that the journey is a lot smoother than people might think.

A Few New R Books

Wed, 20 Feb 2019 00:00:00 +0000

Greg Wilson is a data scientist and professional educator at RStudio.

As a newcomer to R who prefers to read paper rather than pixels, I’ve been working my way through a more-or-less random selection of relevant books over the past few months. Some have discussed topics that I’m already familiar with in the context of R, while others have introduced me to entirely new subjects. This post describes four of them in brief; I hope to follow up with a second post in a few months as I work through the backlog on my desk.

First up is Sharon Machlis’ Practical R for Mass Communcation and Journalism, which is based on the author’s workshops for journalists. This book dives straight into doing the kinds of things a busy reporter or news analyst needs to do to meet a 5:00 pm deadline: data cleaning, presentation-quality graphics, and maps take precedence over control flow or the niceties of variable scope. I particularly enjoyed the way each chapter starts with a realistic project and works through what’s needed to build it. People who’ve never programmed before will be a little intimidated by how many packages they need to download if they try to work through the material on their own, but the instructions are clear, and the author’s enthusiasm for her material shines through in every example. (If anyone is working on a similar tutorial for sports data, please let me know - I have more than a few friends it would make very happy.)

In contrast, Chris Beeley and Shitalkumar Sukhdeve’s Web Application Development with R Using Shiny focuses on a particular tool rather than a industry vertical. It covers exactly what its title promises, step by step from the basics through custom JavaScript functions and animations through persistent storage. Every example I ran was cleanly written and clearly explained, and it’s clear that the authors have tested their material with real audiences. I particularly appreciated the chapter on code patterns - while I’m still not sure I fully understand when and how to use isolate() and req(), I’m much less confused than I was.

Functional programming has been the next big thing in computing since I was a graduate student in the 1980s. It does finally seem to be getting some traction outside the craft-beer-and-Emacs community, and Functional Programming in R by Thomas Mailund looks at how these ideas can be used in R. Mailund writes clearly, and readers who don’t have a background in computer science may find this a gentle way into a complex subject. However, despite the subtitle “Advanced Statistical Programming for Data Science, Analysis and Finance”, there’s nothing particularly statistical or financial about the book’s content. Some parts felt rushed, such as the lightning coverage of point-free programming (which should have had either a detailed exposition or no mention at all), but my biggest complaint about the book is its price: I think $34 for 100 pages is more than most people will want to pay.

Finally, we have Stefano Allesina and Madlen Wilmes’ Computing Skills for Biologists. As the subtitle says, this book presents a toolbox that includes Python, Git, LaTeX, and SQL as well as R, and is aimed at graduate students in biology who have just realized that a few hundred megabytes of messy data are standing between them and their thesis. The authors present the basics of each subject clearly and concisely using real-world data analysis examples at every turn. They freely admit in the introduction that coverage will be broad and shallow, but that’s exactly what books like this should aim for, and they hit a bulls eye. The book’s only weakness - unfortunately, a significant one - is an almost complete lack of diagrams. There are only six figures in its 400 pages, and none in the material on visualization. I realize that readers who are coding along with the examples will be able to view some plots and charts as they go, but I would urge the authors to include these in a second edition.

R is growing by leaps and bounds, and so is the literature about it. If you have written or read a book on R recently that you think others would be interested in, please let us know - we’d enjoy checking it out.

Stefano Allesina and Madlen Wilmes: Computing Skills for Biologists: A Toolbox. Princeton University Press, 978-0691182759.

Chris Beeley and Shitalkumar Sukhdeve: Web Application Development with R Using Shiny (3rd ed.). Packt, 2018, 978-1788993128.

Sharon Machlis: Practical R for Mass Communcation and Journalism. Chapman & Hall/CRC, 2018, 978-1138726918.

Thomas Mailund: Functional Programming in R: Advanced Statistical Programming for Data Science, Analysis and Finance. Apress, 2017, 978-1484227459.

Chapman University DataFest Highlights

Fri, 18 Aug 2017 00:00:00 +0000

Editor’s Note: The 2017 Chapman University DataFest was held during the weekend of April 21-23. The 2018 DataFest will be held during the weekend of April 27-29.

DataFest was founded by Rob Gould in 2011 at UCLA with 40 students. In just seven years, it has grown to 31 sites in three countries. Have a look at Mine Çetinkaya-Rundel’s post Growth of DataFest over the years for the details. In recent years, it has been difficult for UCLA to keep up with the growing interest and demand from southern California universities. This year, the Chapman DataFest became the second DataFest site in southern California, and the largest inaugural DataFest in the history of the event. We had 65 students who stayed the whole weekend from seven universities organized into 15 teams.

The event began on a Friday evening with Professor Rob Gould, the “founder” of DataFest, giving advice on goals for the weekend. He then introduced the Expedia dataset: nearly 11 million records representing users’ online searches for hotels, plus an associated file with detailed information about the hotel destinations.

Throughout the weekend, the organizers kept students motivated with data challenges (with cell phone chargers awarded as prizes), a mini-talk on tools for joining and merging data files, and a tutorial from bitScoop on using their API integration platform.

At noon on Sunday, the students submitted their two-slide presentations via email. At 1 pm, each team had five minutes to show their findings to the six-judge panel: Johnny Lin (UCLA), Joe Kurian (Mitsubishi UFG Union Bank, Irvine), Tao Song (Spectrum Pharmaceuticals), Pamela Hsu (Spectrum Pharmaceuticals), Lynn Langit (AWS, GCP IoT), and Brett Danaher (Chapman University).

The judges announced winners in three official categories:

Best Insight: CSU Northridge team “Mean Squares” (Jamie Decker, Matthew Jones, Collin Miller, Ian Postel, and Seyed Sajjadi). [See Seyed’s description of his team’s experience!]

Best Visualization: Chapman University team “Winners ‘); Drop Table” (Dylan Bowman, William Cortes, Shevis Johnson, and Tristan Tran).

Best Use of External Data: Chapman University team “BEST” (Brandon Makin, Sarah Lasman, and Timothy Kristedja).

Additionally, “Judges’ Choice” awards for “Best Use of Statistical Models” went to the USC “Big Data” team (Hsuanpei Lee, Omar Lopez, Yi Yang Tan, Grace Xu, and Xuejia Xu) and the USC “Quants” team (Cheng (Serena) Cheng, Chelsea Lee, and Hossein Shafii).

All winners were given certificates and medallions designed by Chapman’s Ideation Lab and printed on Chapman’s MLAT Lab 3D printer (see photo).

Winners also received free student memberships in the American Statistical Association.

Many thanks go to the Silver Sponsors: Children’s Hospital Orange County Medical Intelligence and Innovation Institute, Southern California Chapter of the American Statistical Association, and Chapman University MLAT Lab; and Bronze Sponsors: Experian, RStudio, Chapman University Computational and Data Sciences and Schmid College of Science and Technology, Orange County Long Beach ASA Chapter, the Missing Variables, USC Stats Club, Luke Thelen, and Google.

Thanks also to the 45 VIP consultants from BitScoop Labs, Chapman University, Compatiko, CSU Fullerton, CSU Long Beach, CSU San Bernardino, Education Management Services, Freelance Data Analysis, Hiner Partners, Mater Dei High School, Nova Technologies, Otoy, Southern California Edison, Sonikpass, Startup, SurEmail, UC Irvine, UCLA, USC, and Woodbridge High School, many of whom spent most of the weekend working with the students.

Overall, participants were enthusiastic about meeting students from other schools and the opportunity to work with the local professionals. (See the two student perspectives below.) DataFest will continue to grow as these students return to their schools and share their enthusiasm with their classmates!

The Mean Squares Perspective

by Seyed Sajjadi

For most of our team, this DataFest was only the first or second hackathon they ever attended, but the group gelled instantly.

Culture is important for a hackathon group, but talent and preparation play key roles in the success or failure. Our group spent more than a month in advance preparing for this competition. We practiced, practiced, and practiced some more for this event. We had weekly workshops where we presented the assignments that we had worked on for the past week.

The next essential for the competition may come as a surprise to most: having an artist design and prepare the presentation took an enormous amount of work off our shoulders. During the entire competition, we had a very talented artist design a fabulous slideshow for the presentation. This may sound boastful, but allowing specialized talent to work on the slideshow the entire competition is a lot better than designing it at the last minute.

The questions that were asked were not specific at all, and it was on the participants to form and ask the proper questions. We focused on optimizing two questions of customer acquisition and retention/conversion. We proved that online targeting and marketing can be optimized by regional historical data feedback, meaning that most states residents tend to have similar preferences when it comes to same destinations. For instance, most Californians go to Las Vegas to gamble, but most people from Texas go to Las Vegas for music events; these analyses can be used to better target potential customers from neighboring regions.

Regarding customer retention and conversion of lookers to bookers, we calculated the optimum point in time where Expedia can offer more special packages; this time frame happened to be around 14 sessions of interaction between the customer and the website. The biggest part of our analysis was achieved via hierarchical clustering.

A big aspect of the event had to do with the atmosphere and the organization. They invited people from industry to come and roam around the halls, which led to a great opportunity to meet professionals in the field of data science. We were situated in a huge room with all of the teams. We ended up crowding around a small table with everyone on their laptops and chairs. The room was big enough to have impromptu meetings, which allowed a lot of room to breathe. This hackathon was a huge growing experience for all of us on “The Mean Squares”.

Team Pineapples’ Perspective

by Annelise Hitchman

On day one, I could tell my enthusiasm to start working on the dataset was matched by the other dozens of students participating. The room was filled with interaction, and not just among the individual teams. I enjoyed talking with all the consultants in the room about the data, our approach, and even just learning about what they did for work. DataFest introduced me to real-world data that I had never seen in my classes. I learned quite a bit about data analysis from both my own team members and nearly everyone else at the event. Watching the final presentations was an inspiring and insightful end to DataFest. I really hope that DataFest is able to continue and be available to universities such as my own, so that all students interested in data analysis can participate.

Michael Fahy is Professor of Mathematics and Computer Science and Associate Dean, Schmid College of Science & Technology at Chapman University

Growth of DataFest over the years

Wed, 24 May 2017 00:00:00 +0000

In a previous post, I introduced DataFest and how one can streamline the organization of this event using Google Forms and tools from the tidyverse. In this post, I’ll walk through building a Shiny app that demonstrates the growth of DataFest over the years, both in terms of host locations and participating institutions, as well as in terms of the number of students who participated in each event.

Here is a list of all packages used in this article:

library(tidyverse)
library(googlesheets)
library(devtools)
library(ggmap)
library(stringr)
library(leaflet)

The data were contributed by the event organizers, and were collected using a Google Form.

To begin, the data are read using the googlesheets package.

datafest_wide <- gs_title("DataFest over the years (Responses)") %>%
  gs_read()

Data prep

Then minimal manipulation is applied to column names, and a new column concatenating city, state, and country is added to be used in geocoding.

# rename columns
yrs <- sort(rep(2011:2017, 3))
cols <- c("df_", "num_part_", "other_inst_")

names(datafest_wide) <- c("timestamp", "host", "city", "state", "country", "url",
                     paste0(cols, yrs))

# geocode host location
datafest_wide <- datafest_wide %>%
  mutate(address = paste(city, state, country)) %>% 
  mutate_geocode(address)

Note that we need to use the development version of the ggmap package for mutate_geocode() to play nicely with a tbl_df. You can install this version with install_github("dkahle/ggmap").

Next, we convert the data from wide to long format using functionality from the tidyr package. First, we gather the columns that contain yearly information (for each year, we have an indicator for whether an event was hosted at the location, the number of students that participated, and other participating institutions, if any). Then, we strip the year information from variable names, and instead save it as a variable in the dataset. Finally, we spread the key-value pair across three columns.

datafest_long <- datafest_wide %>% 
  gather(key, value, df_2011:other_inst_2017) %>%
  mutate(year = as.numeric(str_match(key, "[0-9]+"))) %>%
  mutate(key = str_replace(key, "_[0-9]+", "")) %>%
  spread(key, value) %>%
  mutate(num_part = as.numeric(num_part))

Map of 2017 ASA DataFests

The eventual goal of this post is to make a Shiny app that maps DataFest spread and growth over the years; however, I’ll start by making a map for just one year, 2017, to develop the code for the map, and then use this code within a Shiny app.

Going forward, I’ll refer to the long dataset as datafest.

datafest <- datafest_long

First, I take a subset of the data for hosts that held an event in 2017:

datafest_2017 <- filter(datafest, year == 2017 & df == "Yes")

Then, I set a few colors to be used in the plot,

href_color <- "#A7C6C6"
marker_color <- "black"
part_color <- "#89548A"

as well as the bounds of the plot based on the min/max longitude/latitude.

left <- floor(min(datafest$lon))
right <- ceiling(max(datafest$lon))
bottom <- floor(min(datafest$lat))
top <- ceiling(max(datafest$lat))

I will be making the map using the leaflet package, as this package allows for easily overlaying markers and popups to maps. The popups are text bubbles that appear when a point is clicked, and that contain additional information about that data point. This is a good place to add some event-specific information, such as name of host, and link to their event homepage, other participating institutions (if any), and number of participants.

host_text <- paste0(
  "<b><a href='", datafest_2017$url, "' style='color:", 
  href_color, "'>", datafest_2017$host, "</a></b>"
)

other_inst_text <- paste0(
  ifelse(is.na(datafest_2017$other_inst), 
         "", 
         paste0("<br>", "with participation from ", datafest_2017$other_inst))
)

part_text <- paste0(
  "<font color=", part_color,">", datafest_2017$num_part, 
  " participants</font>"
)

popups <- paste0(
  host_text, other_inst_text, "<br>", part_text
)

We’re finally ready to make our map! Note that the radii of the points are proportional to the log of the number of participants (times an arbitrary factor for visual appeal).

leaflet() %>%
  addTiles() %>%
  fitBounds(lng1 = left, lat1 = bottom, lng2 = right, lat2 = top) %>%
  addCircleMarkers(lng = datafest_2017$lon, lat = datafest_2017$lat,
                   radius = log(datafest_2017$num_part) * 1.2, 
                   fillColor = marker_color,
                   color = marker_color,
                   weight = 1,
                   fillOpacity = 0.5,
                   popup = popups)

Shiny app

Next, we build upon our earlier plot to create a Shiny app that has the following three components:

A slider input with animation for values between 2011 and 2017 (DataFest years, so far)
A line plot that shows the increase in the number participants over the year
A map that shows the spread of DataFest geographically over the years

You can find and interact with the app at https://gallery.shinyapps.io/datafest-map-all-years/, and the code for the app, as well as all steps up to this point, can be found at this GitHub repo.

Organizing DataFest the tidy way

Wed, 05 Apr 2017 00:00:00 +0000

Organizing an event can be a full-time task in and of its own. I have been organizing ASA DataFest for six years at Duke, and over this time, the number of participants has grown from 23 students from Duke only, to 360 students from seven area schools this year!

First, a bit about ASA DataFest: ASA DataFest is a data “hackathon” for students around the U.S., Canada, and Germany (for now; this list has been growing each year). Students spend a weekend working in small teams, around the clock, to find insight and meaning in a large, messy, and rich data set. For almost all students, it is the most complex data they have encountered, and they push themselves to master new skills, resurrect forgotten knowledge, and bring everything they’ve got to compete for the honor of being declared the best by a panel of expert judges.

As an educator, statistician, and data scientist, growth in student interest in this event sounds fantastic to me. However, as the person responsible for running the event at Duke, it has also meant that for the couple months leading up to DataFest, I have almost an additional full-time job dealing with everything from student registrations to promoting the event to putting in food orders. While I have not found an R-based solution for ordering food (yet!), this year I incorporated R and R Markdown in my organization workflow for grabbing, processing, and reporting registration information.

This post highlights using Google Forms for data collection (e.g., registration), the googlesheets package to pull that data into R, and packages from the tidyverse to manipulate, summarise, and visualize that data. Then, we use RPubs for publishing documents to be shared with participants and other constituents.

Here is a list of all packages used in this article:

library(googlesheets)
library(tidyverse)
library(stringr)
library(DT)
library(knitr)

In an effort to make it easier for others organizing DataFests to replicate my workflow, I have created a Google Drive containing all forms needed for registering participants and collecting information from consultants (mentors), and judges. I have also populated these forms with randomly generated names to showcase how these data are processed to yield the rosters and reports that are useful for organizing the event and disseminating registration information. All Google Forms mentioned can be found in the DataFest Organization Google Drive, which is available for public viewing. You can make a copy for your own use.

Additionally, all R scripts and R Markdown documents used to process these data are available on the datafest GitHub repo.

Team sign ups

If a group of students has already formed a team, it makes sense for them to sign up as a team to ensure that they use the same team name and that everyone registers at once. This Google Form is used to sign such students up.

One issue with registering each team as a single entry is that we end up with what we call “wide” data: each row represents a team, and within that row we have information on all students in that team. However for most practical purposes (counting participants, plotting distributions of years and majors, figuring out how many of each size t-shirt to order, etc.) we need the data to be in “long” format, where each row represents a student.

To accomplish this transformation, we first read the data in using the googlesheets package:

part_wide <- gs_title("DataFest [YEAR] @ [HOST] - Team Sign up (Responses)") %>%
  gs_read()

Then, we realize that the variable names are a mess since they come directly from questions on the Google form!

names(part_wide)

##  [1] "timestamp"       "team_name"       "last_name_1"    
##  [4] "first_name_1"    "school_1"        "tshirt_size_1"  
##  [7] "class_year_1"    "major_1"         "email_1"        
## [10] "participation_1" "diet_1"          "last_name_2"    
## [13] "first_name_2"    "school_2"        "tshirt_size_2"  
## [16] "class_year_2"    "major_2"         "email_2"        
## [19] "participation_2" "diet_2"          "last_name_3"    
## [22] "first_name_3"    "school_3"        "tshirt_size_3"  
## [25] "class_year_3"    "major_3"         "email_3"        
## [28] "participation_3" "diet_3"          "last_name_4"    
## [31] "first_name_4"    "school_4"        "tshirt_size_4"  
## [34] "class_year_4"    "major_4"         "email_4"        
## [37] "participation_4" "diet_4"          "last_name_5"    
## [40] "first_name_5"    "school_5"        "tshirt_size_5"  
## [43] "class_year_5"    "major_5"         "email_5"        
## [46] "participation_5" "diet_5"          "photo"

Using stringr, we can get these variable names in concise snake_case shape:

names(part_wide) <- names(part_wide) %>%
  str_replace(" of team member", "") %>%
  str_replace(" in DataFest before", "") %>%
  str_replace(" Check all that apply.", "") %>%
  str_replace("Email address", "email") %>%
  str_replace("Dietary restrictions", "diet") %>%
  str_replace("Check if you agree", "photo") %>%
  str_replace("\\:", "") %>%
  str_replace("-", "") %>%
  str_replace_all(" ", "_") %>%
  tolower()

We can see that things look a lot better now:

names(part_wide)

##  [1] "timestamp"       "team_name"       "last_name_1"    
##  [4] "first_name_1"    "school_1"        "tshirt_size_1"  
##  [7] "class_year_1"    "major_1"         "email_1"        
## [10] "participation_1" "diet_1"          "last_name_2"    
## [13] "first_name_2"    "school_2"        "tshirt_size_2"  
## [16] "class_year_2"    "major_2"         "email_2"        
## [19] "participation_2" "diet_2"          "last_name_3"    
## [22] "first_name_3"    "school_3"        "tshirt_size_3"  
## [25] "class_year_3"    "major_3"         "email_3"        
## [28] "participation_3" "diet_3"          "last_name_4"    
## [31] "first_name_4"    "school_4"        "tshirt_size_4"  
## [34] "class_year_4"    "major_4"         "email_4"        
## [37] "participation_4" "diet_4"          "last_name_5"    
## [40] "first_name_5"    "school_5"        "tshirt_size_5"  
## [43] "class_year_5"    "major_5"         "email_5"        
## [46] "participation_5" "diet_5"          "photo"

Finally, then we use dplyr and tidyr to transform the data from wide to long:

participants <- part_wide %>%
  select(-photo) %>%
  gather(column, entry, last_name_1:diet_5, -timestamp, -team_name) %>%
  mutate(person_in_team = str_match(column, "[0-9]")) %>%
  mutate(column = str_replace(column, "_[0-9]", "")) %>%
  spread(column, entry) %>%
  filter(!is.na(last_name)) %>%
  arrange(team_name, last_name, first_name) %>%
  select(-person_in_team) %>%
  select(timestamp, team_name, first_name, last_name, email, school, 
         class_year, major, participation, diet, tshirt_size)    # reorder

Let’s take a peek:

participants

## # A tibble: 16 x 11
##             timestamp          team_name first_name last_name
##                 <chr>              <chr>      <chr>     <chr>
## 1   4/2/2017 22:20:05      Bae's Theorem   Adrienne    Fuller
## 2   4/2/2017 22:20:05      Bae's Theorem     Sylvia     Hicks
## 3   4/2/2017 22:20:05      Bae's Theorem       Toni   Simpson
## 4   4/2/2017 22:20:05      Bae's Theorem      Vicky     Water
## 5    4/4/2017 1:03:26      Bayes Anatomy   Meredith      Gray
## 6    4/4/2017 1:03:26      Bayes Anatomy      Derek  Shepherd
## 7   4/3/2017 16:14:00         Fake iid's    Carolyn      Byrd
## 8   4/3/2017 16:14:00         Fake iid's     Gordon   Hawkins
## 9   4/3/2017 16:14:00         Fake iid's    Cecilia   Pittman
## 10  4/3/2017 16:14:00         Fake iid's       Paul      Rios
## 11  4/3/2017 16:14:00         Fake iid's       Ryan      Rose
## 12 3/31/2017 23:55:00 Passive Regression     Amanda      Boyd
## 13 3/31/2017 23:55:00 Passive Regression       Rosa       Fox
## 14 3/31/2017 23:55:00 Passive Regression      Lucas  Gonzales
## 15  4/1/2017 20:14:05            The Pit      James   Andrews
## 16  4/1/2017 20:14:05            The Pit        Tom  Lawrence
## # ... with 7 more variables: email <chr>, school <chr>, class_year <chr>,
## #   major <chr>, participation <chr>, diet <chr>, tshirt_size <chr>

We can now easily look for duplicates (sometimes students sign up twice or more times) or use these data to explore the various features of participants.

Individual sign-ups

If a student is wanting to participate in DataFest but they don’t have a team in mind, we ask them to fill out a brief survey where they answer questions about their background as well as how much time they are wanting to commit to DataFest, ranging from “I’m in it to win it” to “I’m more interested in the experience, and am not really sure if I’ll submit a final presentation.”

Sometimes students find a team and register with that team after having filled out this survey. These students should be removed from the list of those looking for teammates, though there is no easy way for them to do so in Google Forms (they can’t go back and remove their response).

However we can easily do this with an anti_join. Suppose this data frame is called looking, and remember that the earlier data frame of students registering with teams was called participants.

looking <- anti_join(looking, participants, by = "email")

Then, the survey results are made available to the same students who are looking for teammates so that they can match up with others and form a team.

looking %>%
  select(first_name, last_name, participation_level, class_year, major, school, participation_before, email) %>%
  arrange(participation_level, class_year, major, school) %>%
  datatable()

Here we make use of the datatable function in the DT package to display the list of students in a pretty and easily sortable and searchable format.

Consultants and judges

Using a similar approach we can also grab, organize, and report lists of consultants and judges. All relevant code for this can be found in the datafest GitHub repo.

Participant summary

Now that we have our participant data in a tidy format, we can visualize distributions of majors, years, previous participation etc.

For example, we can count how many teams are participating from each school:

participants %>%
  distinct(team_name, .keep_all = TRUE) %>%
  count(school) %>%
  arrange(desc(n))

## # A tibble: 3 x 2
##                    school     n
##                     <chr> <int>
## 1           Faber College     2
## 2 Port Chester University     2
## 3     Harrison University     1

or visualize the distribution of class years per school:

ggplot(data = participants, aes(x = school, fill = class_year)) +
  geom_bar(position = "fill") +
  labs(title = "Schools and class years")

Information guides

We can also use R Markdown to create documents that are mostly text, that introduce the event to the participants, consultants, and judges. Then, summary statistics and visualizations of the participants can easily be included in these guides.

Sample guides for participants and consultants/judges can also be found on the GitHub repo.

And finally, all of these can be published on RPubs. However, note that these documents will be publicly available.

How to Teach R: Common mistakes

Wed, 22 Feb 2017 00:00:00 +0000

Would you like to teach people to use R? If so, I would like to jump-start your efforts.

I’m one half of RStudio’s education team, and I’ve taught thousands of people to use R, usually in face-to-face workshops. Over time, I’ve come to appreciate that teaching R in a short workshop is an unusual challenge that requires an unusual approach: you cannot teach a short workshop in the same way that you would teach a college course, and you should not teach R in the same way you would teach Python, UNIX or C.

In the next few blog posts, I’ll share the pedagogy that I’ve adopted for teaching R workshops. These ideas have made my life easier and my students happier (based on student feedback). I think they can do the same for you.

We’ll begin in this post by identifying common mistakes that ensnare new R teachers. Each of these mistakes seems like a good idea at first glance, but leads to an unsuccessful short workshop, and I’ll tell you why. To make things simple, I’ve recast each mistake as a principle to follow. Let’s examine them one by one:

DO NOT teach R as if it were a programming language. Why not? Because R is a programming language for doing data science. You can be confident that your students want to use R to make graphs, fit models, and impress their colleagues. Show them how to do these empowering things and then teach programming later, as a way to do these things even better. To be honest, if your students only wanted to learn how to program, they would be studying another language.
DO NOT assume that methods that work well in a college classroom will work well in a one-, two-, or half-day workshop. Active learning, peer-led instruction, group projects, flipped classrooms, and other techniques have improved college so much that I wish I could go back and do college over again. But these techniques take more time to convey information than you have in a workshop. They also work best with motivated students who are accustomed to learning. Do you have those? If your workshops are like mine, you have busy individuals who have set aside precious time and money to take your workshop. To be frank, they want to acquire more information than you can provide in a day of active learning or peer-led instruction. I’m not saying that you shouldn’t use these techniques (please try!), but expect to modify them heavily.
DO NOT avoid the lecture. Some teaching gurus will do somersaults to avoid lectures because lectures are too passive for students and too easy to do poorly. Some extremists even extend this notion to slides, claiming that good teachers should not use slides. If you adopt this mindset, you will fail at the one thing that your students expect you to do well: to convey large amounts of information in a short period of time. Not only should you embrace lecturing, i.e., presenting information, you should become an expert at it. Learn to present effectively, and learn to intermix presentations with activities that keep your workshop engaging.
DO NOT assume that you can teach someone else’s workshop out of the box, even if it is your own. A workshop is not like a video that you make once and then replay when needed. A workshop is more like a play that must be cast, costumed, and rehearsed each time you present it in a new venue. If you think you can reproduce a workshop quickly because “it already exists,” you are setting yourself up for failure. If you let your manager think this, you are setting yourself up for stress!
DO NOT let your workshop become a consulting clinic for installation bugs. Workshops make first impressions just like people do. You want to use the first minutes of your workshop to set an energetic tone, to engage your students, and to inspire them — not to hop from student to student debugging installation problems. Do what it takes to avoid this situation. My favorite solution is to provide a classroom RStudio Server for students to use.

But what if you feel that students deserve to leave your workshop with the software successfully installed on their computer? Then you are in good company! My mentor, Hadley Wickham, argues for this persuasively and enthusiastically. But make it happen in a way that does not torpedo your workshop. Hold a real clinic. Pass out instructions in advance and demand that any problems be reported ahead of time. Make successful installation a prerequisite for registering. Be sure that your students know that if they do not have permission to install necessary software on their work laptop, they should bring a different laptop. Be creative and cover the bases.

Whatever you do, remember that the hour immediately before class is less than ideal for installing software. You have other tasks to attend to, and inevitably some students will come late and bring bugs.

I’ll have more to say about each of these topics in the posts that follow. In those posts, I’ll try to layout a fun, inspiring vision for how an R workshop works; no more “thou shall nots.” See you there!

Three Tips for Training Excel Users in R

Fri, 10 Feb 2017 00:00:00 +0000

“I’m not a coder” or “I was never good at math” is a frequent refrain I hear when I ask professionals about their data analysis skills. Through popular culture and stereotypes, most people who don’t have a background in programming automatically underestimate their ability to create amazing things with code. However, Data Society has proven that this is a false narrative through our training program – with students in over 20 countries and many government and enterprise clients, we’ve seen so-called “non-coders” proficiently put together automated data cleaning code scripts and analyses within a few weeks. So how do we do it? Well, we’ve singled out three key steps to get someone started on their journey to an amazing skill set and more powerful data analytics:

Start simple. My first time training up a group of instructors to teach R programming was a great experience (especially given that these R trainings will train up under-served adults), but it wasn’t without its challenges. Anyone with a non-programming background can feel intimidated when they get started, and it’s critical to start off by using R as a glorified calculator. Start with basic operations to get your students comfortable with the interface of RStudio (an awesome and intuitive interface for R), then you can move on to more advanced functions.

See? This doesn’t look so complicated.

Show the similarities. Excel and R do have many similarities in syntax that makes it easier for Excel users to transfer their skills over to R. For example, RStudio has a tabular data view that may look very familiar to spreadsheet users:

This looks like another program I’ve used before…

And this is not just limited to viewing data. There is a lot of syntax from Excel that is easily transferable to R. For example, using if-else statements in Excel looks like this:

And here is what it looks like in R:

The learning curve for R is a fast one, especially for Excel users. Highlighting those similarities puts new users at ease and gives them a way to connect R functionality to the functionality they already use in Excel.

Bring out the wow factor. One of the main advantages of R is its amazing visualizations and advanced analysis that are not even possible in many more basic tools. Our “wow factor” is the collaboration project we did with the White House initiative, The Opportunity Project, which aimed to connect federal agencies with tech companies to build out tools with their open data. In a couple of short weeks, our team combined several disparate open data sources and built an interactive application entirely in R to provide schools and superintendents with a community resource mapper.

Showing students how to eliminate duplicates from data sets with 1 – 2 lines of code or quickly manipulating data into a different format is a wow factor in the amount of time it can save regular Excel users. The applications are immediately apparent for those who have struggled to go through thousands of rows manually or upload a data set with millions of records.

R is gaining in popularity, with millions of users worldwide and growing. Not only that, but we’ve seen an increase in demand for data analysis skills across all job sectors. Adding R programming and data analysis to your resume can add $10,000 - $15,000 to your salary. With that type of incentive, both in pay and in time saved, there’s no better time to take the “not” out of “I’m not a coder”.

The Data Society is a data science training platform for professionals. Among other government and corporate clients, Data Society has trained staff at the Department of Commerce and the U.S. Army through their enterprise firm, Data Society Solutions, which provides customized corporate data science training and consulting services. If you’d like to learn more, please email solutions@datasociety.co.

Writing Good R Code and Writing Well

Fri, 02 Dec 2016 00:00:00 +0000

If you are aspiring to write good R code, you may find it helpful to occasionally spend some time reading about writing: reading about writing R code, and reading about writing about R code that you’ve written. (If you write some excellent R code, you will likely have the opportunity to write about it.)

For reading about writing good R code, a place to start might be one of the many R style guides available. Hadley Wickham includes a style guide in his Advanced R Book that is short and sweet and based on the only slightly less concise Google’s R Style Guide. Also have a look at Graham Williams Sharing R Code - With Style, a hybrid tutorial / style guide that is part of his HandsOnDataScience work in progress. In his elegant introduction, Graham writes:

Data scientists write programs to ingest, manage, wrangle, visualise, analyse and model data in many ways. It is an art to be able to communicate our explorations and understandings through a language, albeit a programming language. Of course our programs must be executable by computers but computers care little about our programs except that they be syntactically correct. Our focus should be on engaging others to read and understand the narratives we present through our programs.

Along these lines, Laurent Gatto offers his ideas on “better R programming in terms of cleaner and elegant syntax …” in his twenty-page tutorial Writing better R code.

If you are just beginning to write R functions, you may find Slawa Rokicki’s post on How to write and debug an R function or Cosma Shalizi’s write-up on Writing R Functions helpful. If you do take the trouble to work through one of these style guides, please do adhere to at least the first rule of veteran programmer Diomidis Spinellis’ 15 Rules for Writing Quality Code.

Finally, I’m for taking Graham Williams seriously: programming is an art of communication, and I believe that practicing this art may also involve occasionally writing engaging narrative about programs. Here are a few short but valuable guides to writing well. William Zinsser is the master of explaining the art of clear and unaffected writing. His 10 Writing Tips may lead you to spend some time with his book. Also worth a look are the writing tips from Morgan Ostrowsky, C.S Lakin, and the New York Times’ Amanda Christy Brown and Katherine Schulten. Three of these four writing coaches advise would-be writers to read, read widely beyond their area of expertise, and read the masters. With regard to this advice, I’ve found this short video by novelist Yaan Martel inspiring.