Persistent Configuration for the R Developer
By Jimmy Briggs
Does your R package work best with some configuration? You probably want it to be easily found by your package.
Does your R package download huge datasets that don’t change much on the provider side? Maybe you want to save the corresponding data somewhere persistent so that things will go faster during the next R session.
Preface: standard locations on the user’s machine
Throughout this post we’ll often refer to standard locations on the user’s machine.
“Applications can actually store user level configuration information, cached data, logs, etc. in the user’s home directory, and there is a standard way to do this [depending on the operating system].”
R packages that are on CRAN cannot write to the home directory without getting confirmation from the user, but they can and should use standard locations.
To find where those are, package developers can use the rappdirs
package.
# Using a reference class object
rhub_app <- rappdirs::app_dir("rhub", "r-hub")
rhub_app$cache()
## [1] "C:\\Users\\JIMMY~1.BRI\\AppData\\Local\\r-hub\\rhub\\Cache"
# or functions
rappdirs::user_cache_dir("rhub")
## [1] "C:\\Users\\JIMMY~1.BRI\\AppData\\Local\\rhub\\rhub\\Cache"
On top of these non-R specific standard locations, we’ll also mention the standard homes of R options and environment variables, .Rprofile and .Renviron.
User preferences
As written in Android developer guidance and probably every customer service guide ever, “Everyone likes it when you remember their name”. Everyone probably likes it too when the barista at their favorite coffee shop remembers their usual order. As an R package developer, what can you do for your R package to correctly assess user preferences and settings?
Using options
In R, options
allow the user to set and examine a variety of global options which affect the way in which R computes and displays its results. For instance, for the usethis package, the usethis.quiet
option can control whether usethis is chatty1. Users either:
write
options(usethis.quiet = TRUE)
at the beginning of a script or directly in the console;or write that same line in their .Rprofile that’s loaded before every session, which is more persistent.
Users can use a project-level or more global user-level .Rprofile.
The use of a project-level .Rprofile overrides the user-level .Rprofile unless the project-level .Rprofile contains the following lines as mentioned in the blogdown
book:
# in .Rprofile of the project
if (file.exists('~/.Rprofile')) {
base::sys.source('~/.Rprofile', envir = environment())
}
# then set project options
For more startup tweaks, the user could adopt the startup
package.
As a package developer in your code you can retrieve options by using getOption()
whose second argument is a fallback for when the option hasn’t been set by the user.
Note that an option can be any R object.
options(blabla.foo = TRUE)
if (isTRUE(getOption("blabla.foo", FALSE))) {
message("foo!")
}
## foo!
options(blabla.bar = mean)
getOption("blabla.bar")(c(1:7))
## [1] 4
The use of options in the .Rprofile startup file is great for workflow packages like usethis
, blogdown
, etc., but shouldn’t be used for, say, arguments influencing the results of a statistical function.
Using environment variables
Environment variables, found via Sys.getenv()
rather than getOption()
, are often used for storing secrets (like GITHUB_PAT
for the gh
package) or the path to secrets on disk (like TWITTER_PAT
for rtweet
), or not secrets (e.g. the browser to use for chromote
).
Similar to using options()
in the console or at the top of a script the user could use Sys.setenv()
.
Obviously, secrets should not be written at the top of a script that’s public.
To make environment variables persistent they need to be stored in a startup file, .Renviron.
.Renviron does not contain R code like .Rprofile, but rather key-value pairs that are only called via Sys.getenv()
.
As a package developer, you probably want to at least document how to set persistent variables or provide a link to such documentation; and you could even provide helper functions like what rtweet
does.
Using credential stores for secrets
Although say API keys are often stored in .Renviron
, they could also be stored in a standard and more secure location depending on the operating system.
The keyring
package allows to interact with such credential stores.
You could either take it on as a dependency like e.g. gh
, or recommend the user of your package to use keyring
and to add a line like
Sys.setenv(SUPERSECRETKEY = keyring::key_get("myservice"))
in their scripts.
Using a config file
The batchtools
package expect its users to setup a config file somewhere if they don’t want to use the defaults.
That somewhere can be several locations, as explained in the batchtools::findConfFile()
manual page.
Two of the possibilities are rappdirs::user_config_dir("batchtools", expand = FALSE)
and rappdirs::site_config_dir("batchtools")
which refer to standard locations that are different depending on the operating system.
The golem
package offers its users the possibility to use a config file based on the config
package.
A good default experience
Obviously, on top of letting users set their own preferences, you probably want your package functions to have sensible defaults. :grin:
Asking or guessing?
For basic information such as username, email, GitHub username, the whoami
package does pretty well.
whoami::whoami()
## username fullname
## "jimmy.briggs" "Jimmy Briggs"
## email_address gh_username
## "jimmy.briggs@oliverwyman.com" "jimbrig"
whoami::email_address()
## [1] "jimmy.briggs@oliverwyman.com"
In particular, for the email address, if the R environment variable EMAIL
isn’t set, whoami
uses a call to git
to find Git’s global configuration.
Similarly, the gert
package can find and return Git’s preferences via gert::git_config_global()
2.
In these cases where packages guess something, their guessing is based on the use of standard locations for such information on different operating systems. Unsurprisingly, in the next section, we’ll recommend using such standard locations when caching data.
Not so temporary files
To quote Android developers guide again, “Persist as much relevant and fresh data as possible.”.
A package that exemplifies doing so is getlandsat
that downloads “Landsat 8 data from AWS public data sets” from the web.
The first time the user downloads an image, the result is cached so next time no query needs to be made.
A very nice aspect of getlandsat
is its providing cache management functions
library("getlandsat")
# list files in cache
lsat_cache_list()
## character(0)
# List info for single files
lsat_cache_details(files = lsat_cache_list()[1])
## <landsat cached files>
## directory: C:\Users\JIMMY~1.BRI\AppData\Local\landsat-pds\landsat-pds\Cache
##
## file: NA
## size: NA mb
lsat_cache_details(files = lsat_cache_list()[2])
## <landsat cached files>
## directory: C:\Users\JIMMY~1.BRI\AppData\Local\landsat-pds\landsat-pds\Cache
##
## file: NA
## size: NA mb
# List info for all files
lsat_cache_details()
## <landsat cached files>
## directory: C:\Users\JIMMY~1.BRI\AppData\Local\landsat-pds\landsat-pds\Cache
# delete files by name in cache
# lsat_cache_delete(files = lsat_cache_list()[1])
# delete all files in cache
# lsat_cache_delete_all()
The getlandasat
uses the rappdirs
package we mentioned earlier.
lsat_path <- function() rappdirs::user_cache_dir("landsat-pds")
When using rappdirs
, keep caveats in mind.
If you hesitate to use e.g. rappdirs::user_cache_dir()
vs rappdirs::user_data_dir()
, use a GitHub code search.
rappdirs or not
To use an app directory from within your package you can use rappdirs
as mentioned earlier, but also other tools.
- Package developers might also like the
hoardr
package that basically creates an R6 object building onrappdirs
with a few more methods (directory creation, deletion).
Lots of pkgs already roll their own version of this, some use {rappdirs}. I roll my own in {R.cache} for memoization to file. In {startup}, I use it for optional anacron-like hourly,daily,weekly, ... startup scripts
— Henrik Bengtsson (@henrikbengtsson) February 28, 2020
- Some package authors “roll their own” like Henrik Bengtsson in
R.cache
.
Oh là là! R-devel just gained support for OS-agile user-specific #rstats cache/config/data folders:
— Henrik Bengtsson (@henrikbengtsson) February 28, 2020
> tools::R_user_dir("aPkg", "cache")
[1] "/home/alice/.cache/R/aPkg"
> tools::R_user_dir("aPkg", "config")
[1] "/home/alice/.config/R/aPkg"
This is big!https://t.co/sfH87AwGZX
- R-devel “just gained support for OS-agile user-specific #rstats cache/config/data folders” which is big (but if you use the base R implementation available after R 4.x.y, unless your package depends on R above that version you’ll need to backport the functionality 3).
More or less temporary solutions
This section presents solutions for caching results very temporarily, or less temporarily.
Caching results within an R session
To cache results within an R session, you could use a temporary directory for data.
For any function call you could use memoise
that supports, well memoization which is best explained with an example.
time <- memoise::memoise(Sys.time)
time()
## [1] "2020-09-16 23:20:45 EDT"
Sys.sleep(1)
time()
## [1] "2020-09-16 23:20:45 EDT"
Only the first call to time()
actually calls Sys.time()
, after that the results is saved for the entire session unless memoise::forget()
is called.
It is great for speeding up code, and for not abusing internet resources which is why the polite
package wraps memoise
.
Providing a ready-to-use dataset in a non-CRAN package
If your package depends on the use of a huge dataset, the same for all users, that is by definition too huge for CRAN, you can use a setup like the one presented by Brooke Anderson and Dirk Eddelbuettel in which the data is packaged up in a separate package not on CRAN, that the user will install therefore saving the data on disk somewhere where you can find it easily.4
Conclusion
In this blog post we presented ways of saving configuration options and data in a not so temporary way in R packages.
We mentioned R startup files (options in .Rprofile and secrets in .Renviron, the startup
package); the rappdirs
and hoardr
packages as well as an exciting similar feature in R devel; the keyring
package.
Writing in the user home directory can be viewed as invasive (and can trigger CRAN archival), hence there is a need for a good package design (asking for confirmation; providing cache management functions like getlandsat
does) and documentation for transparency.
Do you use any form of caching on disk with a default location in one of your packages?5
Additional Notes:
My Setup
.Rprofile
if (interactive()) {
suppressMessages(suppressWarnings(require(devtools)))
suppressMessages(suppressWarnings(require(testthat)))
}
# cran mirror
options(repos = c(CRAN='https://cran.rstudio.com/'))
# or
# options(repos = c(CRAN='https://cloud.r-project.org'))
# install packages in parallel via 'Ncpus' argument
# https://www.jumpingrivers.com/blog/speeding-up-package-installation/
# parallel::detectCores() == 8 (8 total cores, 4 physical, 4 hyper-threading, ~6 max)
options(Ncpus = 6L)
# turn on completion of installed package names
utils::rc.settings(ipck = TRUE)
# usethis / devtools
options(
usethis.protocol = "ssh",
usethis.description = list(
`Authors@R` = '
person("Jimmy", "Briggs",
email = "jimmy.briggs@oliverwyman.com",
role = c("aut", "cre"))',
License = "MIT + file LICENSE",
Language = "es"
)
)
# addinit
options(
"addinit" = list(
author = "Jimmy Briggs <jimmy.briggs@oliverwyman.com>",
project = list(
folders = list(
default = c("R", "inst", "man", "data-raw", "data", "tests", "vignettes"),
selected = c("R", "man")
),
packages = list(
default = rownames(utils::installed.packages()),
selected = "shiny"
)
)
)
)
# dev and old libraries
options(
"dev_lib" = "C:/Users/jimmy.briggs/Computer/Program Files/R/win-library/Dev",
"old_lib" = "C:/Users/jimmy.briggs/Computer/Program Files/R/win-library/Old"
)
# map network drives - runs a batch/vbs script file
if (!dir.exists("H:/")) {
cmd_path <- normalizePath("C:\\windows\\OWGLocalLoginScript.vbs")
cmd <- paste0("WScript ", '"', cmd_path, '"')
system(command = cmd, wait = TRUE)
if (dir.exists("H:/") && dir.exists("G:/")) {
usethis::ui_info("Successfully mapped H:/ and G:/ Drives!")
}
rm(cmd_path, cmd)
invisible(gc())
}
.Renviron
# timezone
TZ=UTC
# environment paths
R_HOME="C:\Users\jimmy.briggs\Documents"
R_LIBS_USER="C:\Users\jimmy.briggs\Computer\Program Files\R\win-library\4.0"
R_USERDATA_DIR="C:\Users\jimmy.briggs\AppData\Roaming\R\data\R"
RSTUDIO_MSYS_SSH="C:\Program Files/RStudio/bin/msys-ssh-1000-18"
RSTUDIO_PANDOC="C:/Program Files/RStudio/bin/pandoc"
MYSQL_HOME="C:/Users/jimmy.briggs/sql/"
JAVA_HOME="C:/java/jdk-11"
# Ghostscript
# qpdf
# etc.
# add Rtools and Rterm/R.exe to sys path
PATH="${RTOOLS40_HOME}\usr\bin;${PATH}"
PATH="C:/Program Files/R/bin/x64/"
# specs
R_ARCH=x64
RSTUDIO_SESSION_PORT=24886
RSTUDIO_USER_IDENTITY=jimmy.briggs
GITHUB_USERNAME=jimbrig
# credentials
GITHUB_PAT=<secret>
TODOIST_API_TOKEN=<secret>
TMETRIC_API_TOKEN=<secret>
GOOGLEDRIVE_API_TOKEN=<secret>
# and more...
Further Reading:
Note that in tests
usethis
suppresses the chatty behavior by the use ofwithr::local_options(list(usethis.quiet = FALSE))
.↩︎The
gert
package uses libgit2, not Git directly.↩︎There’s actually an R package called
backports
which provides backports of functions which have been introduced in one of the base packages in R version 3.0.1 or later, maybe it’ll provide backports fortools::R_user_dir()
?↩︎If your package has a helper for downloading and saving the dataset locally, and you don’t control the dataset source (contrary to the aforementioned approach), you might want to register several URLs for that content, as explained in the README of the conceptual
contenturi
package.↩︎In
file.path(rappdirs::user_data_dir("userdata", "r"), "validated_emails.csv")
, C:~1.BRI/validated_emails.csv in my case.↩︎