Don’t forget non-tidyverse solutions

David Hugh-Jones
2 min readSep 6, 2019

--

I’m a tidyverse fan. I am not hugely interested in “tidyverse vs. base R” conflicts. I’ve been using R for quite a while; many tidyverse packages (and RStudio) have made it a lot nicer to work in.

I do feel there is one fair point made on the anti-tidyverse side. Sometimes, being too focused on tidyverse solutions can lead people to forget simple base R alternatives. I’ve done this myself.

Here is an easy example from a common task. Suppose you have some survey data with a column age . In the raw data, “-1” means “didn’t answer”, and other negative values are similarly uninformative. You want to update these values to NA.

# tidyverse style
data <- data %>% mutate(age = ifelse(age < 0, NA, age))

That isn’t so bad, but compare it to base R’s much more elegant:

data$age[data$age < 0] <- NA

The dplyr package famously has no way to do selective updates. And it isn’t going to get one either — doubtless a decision that was carefully made. But if you never learned base R, you are going to find tasks like this needlessly complicated.

Here’s a more complex example from a question I asked on SO. Essentially the answer involves checking for duplicates in df$var along these lines:

df <- df %>%
group_by(var) %>%
mutate(duplicated = n() > 1) %>%
ungroup()

This is… probably not optimal. Base R has a function duplicated which is almost what we need; the result is shorter and conveys intent more clearly:

df$duplicated <- duplicated(df$var) | duplicated(df$var, fromLast = TRUE)

Some other examples where people struggle with complex code, for which there are simpler base R solutions:

How can I create a new variable summing several rows?

df$newvar <- rowSums(df[5:10])

Various long and complex transformations, with equally long explanations, for doing something in parallel to several columns:

for (i in 1:5) {
df[i+20] <- df[i]*df[i+3]/df[i+6]*df[i+9]
}

Or this:

library(purrr)
data %>%
rowwise() %>%
mutate(c=lift_vd(mean)(a,b))

(Doesn’t lift_vd sound like a mildly rude Cleric spell?) For:


data$c <- rowMeans(data[, c("a", "b")])

In all these cases, people are finding awkward, complex ways to express something that is really quite simple — if you know the right base R functions.

As a last example, I have often struggled with barplots in ggplot2 before realising that a simple barplot(table(x)) would do what I wanted. (This has got easier in recent versions with geom_col()).

The tidyverse is really useful, but it is easy to get stuck inside it and forget the rest of base R. When people learn the tidyverse without a solid basic understanding of base R, then the problem is multiplied.

This doesn’t mean that RStudio or the tidyverse guys are doing anything bad. It has no political implications. It’s just a fact about how people are learning R today — a side effect of having a lot of newcomers to the language (which is a good thing!)

Base R is huge and intimidating (and not always beautifully engineered). But there is a subset of it that is pretty useful. Perhaps we should be trying harder to introduce people to that subset.

--

--

Responses (1)