NSE Functions with oshka

Overview

We will implement simplified versions of dplyr and data.table to illustrate how to write programmable NSE functions with oshka. The implementations are intentionally limited in functionality, robustness, and speed for the sake of simplicity.

An Ersatz dplyr

Interface

The interface is as follows:

group_r <- function(x, ...) {...}     # similar to dplyr::group_by
filter_r <- function(x, subset) {...} # similar to dplyr::filter
summarize_r <- function(x, ...) {...} # similar to dplyr::summarise
`%$%` <- function(x, y) {...}         # similar to the magrittr pipe

Our functions mimic the corresponding dplyr ones:

CO2 %$%                              # built-in dataset
  filter_r(grepl("[12]", Plant)) %$%
  group_r(Type, Treatment) %$%
  summarize_r(mean(conc), mean(uptake))
         Type  Treatment mean.conc. mean.uptake.
1      Quebec nonchilled        435     34.19286
2 Mississippi nonchilled        435     26.87143
3      Quebec    chilled        435     31.33571
4 Mississippi    chilled        435     15.07143

Implementation

Most of the implementation is not directly related to oshka NSE, but we will go over summarize_r to highlight how those parts integrate with the rest. summarize_r is just a forwarding function:

summarize_r <- function(x, ...)
  eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())

We use the eval/bquote pattern to forward NSE arguments. We retrieve summarize_r_l from the current function frame with .(), because there is no guarantee we would find it on the search path starting from the parent frame. In this case it happens to be available, but it would not be if these functions were in a package.

We present summarize_r_l in full for reference, but feel free to skip as we highlight the interesting bits next:

summarize_r_l <- function(x, els) {
  frm <- parent.frame()
  exps.sub <- expand(substitute(els), x, frm)
  if(is.null(exps.sub)) x else {
    # compute groups and splits
    grps <- make_grps(x)        # see appendix
    splits <- lapply(grps, eval, x, frm)
    dat.split <- split(x, splits, drop=TRUE)
    grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)

    # aggregate
    res.list <- lapply(
      dot_list(exps.sub),       # see appendix
      function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
    )
    list_to_df(res.list, grp.split, splits)   # see appendix
  }
}

The only oshka specific line is the second one:

  exps.sub <- expand(substitute(els), x, frm)

els is the language captured and forwarded by summarize_r. We run expand on that language with our data x as the environment and the parent frame as the enclosure. We then compute the groups:

    grps <- make_grps(x)        # see appendix
    splits <- lapply(grps, eval, x, frm)

make_grps extracts the grouping expressions generating by group_r. These have already been substituted so we evaluate each one with x as the environment and the parent frame as the enclosure. We use this to split our data into groups:

    dat.split <- split(x, splits, drop=TRUE)

Finally we can evaluate our expanded expressions within each of the groups:

    # aggregate
    res.list <- lapply(
      dot_list(exps.sub),       # see appendix
      function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
    )
    list_to_df(res.list, grp.split, splits)   # see appendix

dot.list turns exps.sub into a list of expressions. Each expression is then evaluated with each data chunk as the environment and the parent frame as the enclosure. Finally list_to_df turns our lists of vectors into a data frame.

You can see the rest of the implementation in the appendix.

Examples

That single expand line enables a programmable NSE:

f.exp <- quote(grepl("[12]", Plant))
s.exp <- quote(mean(uptake))

CO2 %$%
  filter_r(f.exp & conc > 500) %$%
  group_r(Type, Treatment) %$%
  summarize_r(round(s.exp))
         Type  Treatment round.mean.uptake..
1      Quebec nonchilled                  41
2 Mississippi nonchilled                  33
3      Quebec    chilled                  38
4 Mississippi    chilled                  18

Because %$% uses expand you can even do the following:

f.exp.b <- quote(filter_r(grepl("[12]", Plant) & conc > 500))
g.exp.b <- quote(group_r(Type, Treatment))
s.exp.b <- quote(summarize_r(mean(conc), mean(uptake)))
exp <- quote(f.exp.b %$% g.exp.b %$% s.exp.b)

CO2 %$% exp
         Type  Treatment mean.conc. mean.uptake.
1      Quebec nonchilled      837.5       41.150
2 Mississippi nonchilled      837.5       32.625
3      Quebec    chilled      837.5       38.500
4 Mississippi    chilled      837.5       18.050

An Ersatz data.table

Implementation

We wish to re-use our ersatz dplyr functions to create a data.table-like interface:

as.super_df <- function(x) {
  class(x) <- c("super_df", class(x))
  x
}
"[.super_df" <- function(x, i=NULL, j=NULL, by=NULL) {
  frm <- parent.frame() # as per docs, safer to do this here
  x <- as.data.frame(x)
  x <- eval(bquote(.(filter_r)(     .(x),  .(substitute(i)))), frm)
  x <- eval(bquote(.(group_r_l)(    .(x), .(substitute(by)))), frm)
  x <- eval(bquote(.(summarize_r_l)(.(x),  .(substitute(j)))), frm)
  as.super_df(x)
}

Again, we use the eval/bquote pattern to forward the NSE arguments to our NSE functions filter_r, group_r_l, and summarize_r_l. The pattern is not trivial, but it only took six lines of code to transmogrify our faux-dplyr into a faux-data.table.

Examples

After we add the super_df class to our data we can start using it with data.table semantics, but with programmable NSE:

co2 <- as.super_df(CO2)
co2[f.exp, s.exp, by=Type]
         Type mean.uptake.
1      Quebec     32.76429
2 Mississippi     20.97143
exp.a <- quote(max(conc))
exp.b <- quote(min(conc))

co2[f.exp, list(exp.a, exp.b), by=list(Type, Treatment)][1:3,]
         Type  Treatment max.conc. min.conc.
1      Quebec nonchilled      1000        95
2 Mississippi nonchilled      1000        95
3      Quebec    chilled      1000        95
exp.c <- quote(list(exp.a, exp.b))
exp.d <- quote(list(Type, Treatment))

co2[f.exp, exp.c, by=exp.d][1:3,]
         Type  Treatment max.conc. min.conc.
1      Quebec nonchilled      1000        95
2 Mississippi nonchilled      1000        95
3      Quebec    chilled      1000        95

Despite the forwarding layers the symbols resolve as expected in complex circumstances:

exps <- quote(list(stop("boo"), stop("ya")))  # don't use this
g.exp <- quote(Whatever)                         # nor this

local({
  summarize_r_l <- function(x, y) stop("boom")  # nor this
  max.upt <- quote(max(uptake))                 # use this
  min.upt <- quote(min(uptake))                 # and this
  exps <- list(max.upt, min.upt)

  g.exp <- quote(Treatment)                        # and this

  lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})
[[1]]
   Treatment max.uptake.
1 nonchilled        44.3
2    chilled        42.4

[[2]]
   Treatment min.uptake.
1 nonchilled        10.6
2    chilled         7.7

And we can even nest our dplyr and data.table for an unholy abomination:

exp <- quote(data.frame(upt=uptake) %$% summarize_r(new.upt=upt * 1.2))

local({
  exps <- list(quote(sum(exp$new.upt)), quote(sum(uptake)))
  g.exp <- quote(Treatment)
  lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})
[[1]]
   Treatment      V1
1 nonchilled 1025.88
2    chilled  779.64

[[2]]
   Treatment sum.uptake.
1 nonchilled       854.9
2    chilled       649.7

Appendix

Ersatz dplyr implementation:

## - Summarize -----------------------------------------------------------------

summarize_r <- function(x, ...)
  eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())
summarize_r_l <- function(x, els) {
  frm <- parent.frame()
  exps.sub <- expand(substitute(els), x, frm)
  if(is.null(exps.sub)) x else {
    # compute groups and splits
    grps <- make_grps(x)        # see appendix
    splits <- lapply(grps, eval, x, frm)
    dat.split <- split(x, splits, drop=TRUE)
    grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)

    # aggregate
    res.list <- lapply(
      dot_list(exps.sub),       # see appendix
      function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
    )
    list_to_df(res.list, grp.split, splits)   # see appendix
  }
}
## - Grouping ------------------------------------------------------------------ 

group_r <- function(x, ...)
  eval(bquote(.(group_r_l)(.(x), .(substitute(list(...))))), parent.frame())
group_r_l <- function(x, els) {
  exps.sub <- expand(substitute(els), x, parent.frame())
  if(is.null(exps.sub)) x else {
    if(!is.call(exps.sub) || exps.sub[[1L]] != quote(list))
      exps.sub <- call("list", exps.sub)
    structure(x, .GRP=dot_list(exps.sub, "G"))
} }
## - Filtering -----------------------------------------------------------------

filter_r <- function(x, subset) {
  sub.exp <- expand(substitute(subset), x, parent.frame())
  sub.val <- eval(sub.exp, x, parent.frame())
  as.data.frame(
    if(!is.null(sub.val)) {
      as.data.frame(x)[
        if(is.numeric(sub.val)) sub.val else !is.na(sub.val) & sub.val,
      ]
    } else x
  )
}
## - Pipe ----------------------------------------------------------------------

`%$%` <- function(x, y) {
  x.sub <- expand(substitute(x), parent.frame())
  y.sub <- expand(substitute(y), parent.frame())
  y.list <- if(!is.call(y.sub)) list(y.sub) else as.list(y.sub)
  eval(sub_dat(y.sub, x), parent.frame())
}
## - Helper Funs ---------------------------------------------------------------

# Takes result of `substitute(list(...))` and returns a list of quoted language
# object with nice names.

dot_list <- function(x, pre="V") {
  if(!is.call(x) || x[[1L]] != quote(list)) x <- call("list", x)
  dots <- tail(as.list(x), -1L)

  if(is.null(names(dots))) names(dots) <- character(length(dots))
  for(i in seq_along(dots)[!nzchar(names(dots))])
    names(dots)[i] <- if(
      is.language(dots[[i]]) && nchar(deparse(dots[[i]])[[1]]) < 20
    ) deparse(dots[[i]])[[1]] else sprintf("%s%d", pre, i)
  dots
}
# Used by the `%$%` pipe operator to find the correct point in the RHS to
# substitute the forwarded argument in

sub_dat <- function(z, dat) {
  if(is.call(z)) {
    if(z[[1]] == as.name('%$%')) z[[2]] <- sub_dat(z[[2]], dat)
    else {
      z.list <- as.list(z)
      z <- as.call(c(z.list[1], list(dat), tail(z.list, -1)))
  } }
  z
}
# convert the ".GRP" attribute into usable form

make_grps <- function(x)
  if(is.null(attr(x, ".GRP")) || !length(attr(x, ".GRP")))
    list(rep_len(1, nrow(x))) else attr(x, ".GRP")

# Takes result list and makes into a data.frame by recycling elements so they
# are the same length a longest, and also adds in cols for the group vars

list_to_df <- function(dat, grp, splits) {
  lens <- do.call(pmax, lapply(dat, lengths, integer(length(splits))))
  as.data.frame(
    lapply(c(grp, dat), function(x) unname(unlist(Map(rep_len, x, lens))))
  )
}