In a previous post, I showed a very simple example of using the R function tools::CRAN_package_db()
to analyze information about CRAN packages. CRAN_package_db()
extracts the metadata CRAN stores on all of its 12,000 plus packages and arranges it into a “database”, actually a complicated data frame in which some columns have vectors or lists as entries.
It’s simple to run the function and it doesn’t take very long on my Mac Book Air.
p_db <- tools::CRAN_package_db()
The following gives some insight into what’s contained in the data frame.
dim(p_db)
## [1] 12635 65
matrix(names(p_db),ncol=2)
## [,1] [,2]
## [1,] "Package" "Collate.windows"
## [2,] "Version" "Contact"
## [3,] "Priority" "Copyright"
## [4,] "Depends" "Date"
## [5,] "Imports" "Description"
## [6,] "LinkingTo" "Encoding"
## [7,] "Suggests" "KeepSource"
## [8,] "Enhances" "Language"
## [9,] "License" "LazyData"
## [10,] "License_is_FOSS" "LazyDataCompression"
## [11,] "License_restricts_use" "LazyLoad"
## [12,] "OS_type" "MailingList"
## [13,] "Archs" "Maintainer"
## [14,] "MD5sum" "Note"
## [15,] "NeedsCompilation" "Packaged"
## [16,] "Additional_repositories" "RdMacros"
## [17,] "Author" "SysDataCompression"
## [18,] "Authors@R" "SystemRequirements"
## [19,] "Biarch" "Title"
## [20,] "BugReports" "Type"
## [21,] "BuildKeepEmpty" "URL"
## [22,] "BuildManual" "VignetteBuilder"
## [23,] "BuildResaveData" "ZipData"
## [24,] "BuildVignettes" "Published"
## [25,] "Built" "Path"
## [26,] "ByteCompile" "X-CRAN-Comment"
## [27,] "Classification/ACM" "Reverse depends"
## [28,] "Classification/ACM-2012" "Reverse imports"
## [29,] "Classification/JEL" "Reverse linking to"
## [30,] "Classification/MSC" "Reverse suggests"
## [31,] "Classification/MSC-2010" "Reverse enhances"
## [32,] "Collate" "MD5sum"
## [33,] "Collate.unix" "Package"
Looking at a few rows and columns gives a feel for how complicated its structure is.
p_db[1:10, c(1,2,4,5)]
## Package Version Depends
## 1 A3 1.0.0 R (>= 2.15.0), xtable, pbapply
## 2 abbyyR 0.5.4 R (>= 3.2.0)
## 3 abc 2.1 R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit
## 4 abc.data 1.0 R (>= 2.10)
## 5 ABC.RAP 0.9.0 R (>= 3.1.0)
## 6 ABCanalysis 1.2.1 R (>= 2.10)
## 7 abcdeFBA 0.4 Rglpk,rgl,corrplot,lattice,R (>= 2.10)
## 8 ABCoptim 0.15.0 <NA>
## 9 ABCp2 1.2 MASS
## 10 abcrf 1.7 R(>= 3.1)
## Imports
## 1 <NA>
## 2 httr, XML, curl, readr, plyr, progress
## 3 <NA>
## 4 <NA>
## 5 graphics, stats, utils
## 6 plotrix
## 7 <NA>
## 8 Rcpp, graphics, stats, utils
## 9 <NA>
## 10 readr, MASS, matrixStats, ranger, parallel, stringr, Rcpp (>=\n0.11.2)
So, having spent a little time leaning how vexing working with this data can be, I was delighted when I discovered Ioannis Kosmidis’ cranly
package during my March “Top 40” review. cranly
is a very impressive package, built along tidy principles, that is helpful for learning about individual packages, analyzing the structure of package and author relationships, and searching for packages.
library(cranly)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The first really impressive feature is a “one button” clean function that does an amazing job of getting the data in shape to work with. In my preliminary work, I struggled just to get the author data clean. In the approach that I took, getting rid of text like [aut, cre] to get a count of authors took more regular expression work than I wanted to deal with. But clean_CRAN_db
does a good job of cleaning up the whole database. Note that the helper function clean_up_author
has a considerable number of hard-coded text strings that must have taken hours to get right.
package_db <- clean_CRAN_db(p_db)
Once you have the clean data, it is easy to run some pretty interesting analyses. This first example, straight out of the package vignette, builds the network of package relationships based on which packages import which, and then plots a summary for the top 20 most imported packages.
package_network <- build_network(package_db)
package_summaries <- summary(package_network)
plot(package_summaries, according_to = "n_imported_by", top = 20)
There is also a built-in function to compute the importance or relevance of a package using the page rank algorithm.
plot(package_summaries, according_to = "page_rank", top = 20)
The build_network
function also offers the opportunity to investigate the collaboration of package authors by building a network from the authors’ perspective.
author_network <- build_network(object = package_db, perspective = "author")
Here, we look at J.J. Allaire’s network. exact = FALSE
means that the algorithm is not using exact matching.
plot(author_network, author = "JJ Allaire", exact = FALSE)
It is also possible to study individual packages. Here, I plot the very simple dependency tree for the time series package xts
. There is a very good argument to be made that the simpler the dependency tree the more stable and reliable the package.
xts_tree <- build_dependence_tree(package_network, "xts")
plot(xts_tree)
As a final example, consider how the package_with()
function might be used to search for Bayesian packages by searching for packages with “Bayes” or “MCMC” in the description. I don’t believe that this exhausts the possibilities of cranly
, but it should be clear that the package is a very useful tool for looking into the mysteries of CRAN.
Bayesian_packages <- package_with(package_network, name = c("Bayes", "MCMC"))
plot(package_network, package = Bayesian_packages, legend=FALSE)
You may leave a comment below or discuss the post in the forum community.rstudio.com.