oshka
We will implement simplified versions of dplyr
and
data.table
to illustrate how to write programmable NSE
functions with oshka
. The implementations are intentionally
limited in functionality, robustness, and speed for the sake of
simplicity.
dplyr
The interface is as follows:
group_r <- function(x, ...) {...} # similar to dplyr::group_by
filter_r <- function(x, subset) {...} # similar to dplyr::filter
summarize_r <- function(x, ...) {...} # similar to dplyr::summarise
`%$%` <- function(x, y) {...} # similar to the magrittr pipe
Our functions mimic the corresponding dplyr
ones:
CO2 %$% # built-in dataset
filter_r(grepl("[12]", Plant)) %$%
group_r(Type, Treatment) %$%
summarize_r(mean(conc), mean(uptake))
Type Treatment mean.conc. mean.uptake.
1 Quebec nonchilled 435 34.19286
2 Mississippi nonchilled 435 26.87143
3 Quebec chilled 435 31.33571
4 Mississippi chilled 435 15.07143
Most of the implementation is not directly related to
oshka
NSE, but we will go over summarize_r
to
highlight how those parts integrate with the rest.
summarize_r
is just a forwarding function:
summarize_r <- function(x, ...)
eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())
We use the eval
/bquote
pattern to forward
NSE
arguments. We retrieve summarize_r_l
from the current function frame with .()
, because there is
no guarantee we would find it on the search path starting from the
parent frame. In this case it happens to be available, but it would not
be if these functions were in a package.
We present summarize_r_l
in full for reference, but feel
free to skip as we highlight the interesting bits next:
summarize_r_l <- function(x, els) {
frm <- parent.frame()
exps.sub <- expand(substitute(els), x, frm)
if(is.null(exps.sub)) x else {
# compute groups and splits
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)
dat.split <- split(x, splits, drop=TRUE)
grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
}
}
The only oshka
specific line is the second one:
els
is the language captured and forwarded by
summarize_r
. We run expand
on that language
with our data x
as the environment and the parent frame as
the enclosure. We then compute the groups:
make_grps
extracts the grouping expressions generating
by group_r
. These have already been substituted so we
evaluate each one with x
as the environment and the parent
frame as the enclosure. We use this to split our data into groups:
Finally we can evaluate our expand
ed expressions within
each of the groups:
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
dot.list
turns exps.sub
into a list of
expressions. Each expression is then evaluated with each data chunk as
the environment and the parent frame as the enclosure. Finally
list_to_df
turns our lists of vectors into a data
frame.
You can see the rest of the implementation in the appendix.
That single expand
line enables a programmable NSE:
f.exp <- quote(grepl("[12]", Plant))
s.exp <- quote(mean(uptake))
CO2 %$%
filter_r(f.exp & conc > 500) %$%
group_r(Type, Treatment) %$%
summarize_r(round(s.exp))
Type Treatment round.mean.uptake..
1 Quebec nonchilled 41
2 Mississippi nonchilled 33
3 Quebec chilled 38
4 Mississippi chilled 18
Because %$%
uses expand
you can even do the
following:
f.exp.b <- quote(filter_r(grepl("[12]", Plant) & conc > 500))
g.exp.b <- quote(group_r(Type, Treatment))
s.exp.b <- quote(summarize_r(mean(conc), mean(uptake)))
exp <- quote(f.exp.b %$% g.exp.b %$% s.exp.b)
CO2 %$% exp
Type Treatment mean.conc. mean.uptake.
1 Quebec nonchilled 837.5 41.150
2 Mississippi nonchilled 837.5 32.625
3 Quebec chilled 837.5 38.500
4 Mississippi chilled 837.5 18.050
data.table
We wish to re-use our ersatz dplyr
functions to create a
data.table
-like interface:
as.super_df <- function(x) {
class(x) <- c("super_df", class(x))
x
}
"[.super_df" <- function(x, i=NULL, j=NULL, by=NULL) {
frm <- parent.frame() # as per docs, safer to do this here
x <- as.data.frame(x)
x <- eval(bquote(.(filter_r)( .(x), .(substitute(i)))), frm)
x <- eval(bquote(.(group_r_l)( .(x), .(substitute(by)))), frm)
x <- eval(bquote(.(summarize_r_l)(.(x), .(substitute(j)))), frm)
as.super_df(x)
}
Again, we use the eval
/bquote
pattern to
forward the NSE arguments to our NSE functions filter_r
,
group_r_l
, and summarize_r_l
. The pattern is
not trivial, but it only took six lines of code to transmogrify our
faux-dplyr
into a faux-data.table
.
After we add the super_df
class to our data we can start
using it with data.table
semantics, but with programmable
NSE:
Type mean.uptake.
1 Quebec 32.76429
2 Mississippi 20.97143
exp.a <- quote(max(conc))
exp.b <- quote(min(conc))
co2[f.exp, list(exp.a, exp.b), by=list(Type, Treatment)][1:3,]
Type Treatment max.conc. min.conc.
1 Quebec nonchilled 1000 95
2 Mississippi nonchilled 1000 95
3 Quebec chilled 1000 95
exp.c <- quote(list(exp.a, exp.b))
exp.d <- quote(list(Type, Treatment))
co2[f.exp, exp.c, by=exp.d][1:3,]
Type Treatment max.conc. min.conc.
1 Quebec nonchilled 1000 95
2 Mississippi nonchilled 1000 95
3 Quebec chilled 1000 95
Despite the forwarding layers the symbols resolve as expected in complex circumstances:
exps <- quote(list(stop("boo"), stop("ya"))) # don't use this
g.exp <- quote(Whatever) # nor this
local({
summarize_r_l <- function(x, y) stop("boom") # nor this
max.upt <- quote(max(uptake)) # use this
min.upt <- quote(min(uptake)) # and this
exps <- list(max.upt, min.upt)
g.exp <- quote(Treatment) # and this
lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})
[[1]]
Treatment max.uptake.
1 nonchilled 44.3
2 chilled 42.4
[[2]]
Treatment min.uptake.
1 nonchilled 10.6
2 chilled 7.7
And we can even nest our dplyr
and
data.table
for an unholy abomination:
exp <- quote(data.frame(upt=uptake) %$% summarize_r(new.upt=upt * 1.2))
local({
exps <- list(quote(sum(exp$new.upt)), quote(sum(uptake)))
g.exp <- quote(Treatment)
lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})
[[1]]
Treatment V1
1 nonchilled 1025.88
2 chilled 779.64
[[2]]
Treatment sum.uptake.
1 nonchilled 854.9
2 chilled 649.7
Ersatz dplyr
implementation:
## - Summarize -----------------------------------------------------------------
summarize_r <- function(x, ...)
eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())
summarize_r_l <- function(x, els) {
frm <- parent.frame()
exps.sub <- expand(substitute(els), x, frm)
if(is.null(exps.sub)) x else {
# compute groups and splits
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)
dat.split <- split(x, splits, drop=TRUE)
grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
}
}
## - Grouping ------------------------------------------------------------------
group_r <- function(x, ...)
eval(bquote(.(group_r_l)(.(x), .(substitute(list(...))))), parent.frame())
group_r_l <- function(x, els) {
exps.sub <- expand(substitute(els), x, parent.frame())
if(is.null(exps.sub)) x else {
if(!is.call(exps.sub) || exps.sub[[1L]] != quote(list))
exps.sub <- call("list", exps.sub)
structure(x, .GRP=dot_list(exps.sub, "G"))
} }
## - Filtering -----------------------------------------------------------------
filter_r <- function(x, subset) {
sub.exp <- expand(substitute(subset), x, parent.frame())
sub.val <- eval(sub.exp, x, parent.frame())
as.data.frame(
if(!is.null(sub.val)) {
as.data.frame(x)[
if(is.numeric(sub.val)) sub.val else !is.na(sub.val) & sub.val,
]
} else x
)
}
## - Pipe ----------------------------------------------------------------------
`%$%` <- function(x, y) {
x.sub <- expand(substitute(x), parent.frame())
y.sub <- expand(substitute(y), parent.frame())
y.list <- if(!is.call(y.sub)) list(y.sub) else as.list(y.sub)
eval(sub_dat(y.sub, x), parent.frame())
}
## - Helper Funs ---------------------------------------------------------------
# Takes result of `substitute(list(...))` and returns a list of quoted language
# object with nice names.
dot_list <- function(x, pre="V") {
if(!is.call(x) || x[[1L]] != quote(list)) x <- call("list", x)
dots <- tail(as.list(x), -1L)
if(is.null(names(dots))) names(dots) <- character(length(dots))
for(i in seq_along(dots)[!nzchar(names(dots))])
names(dots)[i] <- if(
is.language(dots[[i]]) && nchar(deparse(dots[[i]])[[1]]) < 20
) deparse(dots[[i]])[[1]] else sprintf("%s%d", pre, i)
dots
}
# Used by the `%$%` pipe operator to find the correct point in the RHS to
# substitute the forwarded argument in
sub_dat <- function(z, dat) {
if(is.call(z)) {
if(z[[1]] == as.name('%$%')) z[[2]] <- sub_dat(z[[2]], dat)
else {
z.list <- as.list(z)
z <- as.call(c(z.list[1], list(dat), tail(z.list, -1)))
} }
z
}
# convert the ".GRP" attribute into usable form
make_grps <- function(x)
if(is.null(attr(x, ".GRP")) || !length(attr(x, ".GRP")))
list(rep_len(1, nrow(x))) else attr(x, ".GRP")
# Takes result list and makes into a data.frame by recycling elements so they
# are the same length a longest, and also adds in cols for the group vars
list_to_df <- function(dat, grp, splits) {
lens <- do.call(pmax, lapply(dat, lengths, integer(length(splits))))
as.data.frame(
lapply(c(grp, dat), function(x) unname(unlist(Map(rep_len, x, lens))))
)
}