case_when()
is a code smell and across()
is a code smell too
I’ve written before about the risks of using dplyr without knowing base R. I now feel I can be more specific:
dplyr::case_when()
is a code smell.- So is
dplyr::across()
.
Evidence
Uuuurgh! To be clear, that isn’t a StackOverflow question — it’s an answer.
Or this horror:
var <- quo(carb)
mtcars %>%
mutate(cg = case_when(!!var <= 2 ~ "low",
!!var > 2 ~ "high"))
This poor sod is trying to do:
mtcars[var] <- ifelse(mtcars[var] <= 2, "low", "high")
Or this, when all that’s needed is car::recode()
or dplyr::recode()
.
Or this:
library(tidyverse)df <- as_tibble(iris)df_new <- df %>%
mutate(across(1:4, ~ case_when(. < 7 ~ * 10, . >= 7 ~ * 5)))
which, translated out of the language of the Old Ones, means:
df_new <- df
df_new[1:4] <- ifelse(df_new[1:4] < 7,
df_new[1:4] * 10,
df_new[1:4] * 5
)
These examples were not especially cherry-picked. They all come from the top search results for [r] case_when
on SO. You can find plenty more like this.
It’s not just the questions
The problem here is not just that case_when()
is used by newbies who don't understand how to use cut()
or recode()
. That would be fine. Beginner-friendly functions and paradigms are good.
The problem is that less-newbies then give them terrible advice, and unreadable unmaintainable code propagates through the R universe.
The source of the smell
There are two underlying issues.
Problem 1: base R has no nice way of not repeating yourself when you change part of a data frame
Look again at this example.
df_new[1:4] <- ifelse(df_new[1:4] < 7,
df_new[1:4] * 10,
df_new[1:4] * 5
)
df_new
gets repeated three times. What if it's a more informative, longer variable name? What if you have a condition on the rows?
df_new[1:4, df_new$foo < 12] <- ifelse(
df_new[1:4, df_new$foo < 12] < 7,
df_new[1:4, df_new$foo < 12] * 10,
df_new[1:4, df_new$foo < 12] * 5
)
We are now up to a paragraph of code for a single operation, and good luck changing 12 to 11 without forgetting one instance. (Thank God for Rstudio’s multiple cursors, but that’s a sticking plaster. Or there’s with()
… but it does not seem that with()
is newbie-friendly - perhaps it's too abstract?)
Of course you wouldn’t write this, but that’s the point — why do you need workarounds for something simple?
Problem 2: dplyr
has no nice way of changing part of a data frame at all
To solve this problem we got dplyr
– which, to be clear, is a fantastic piece of software, and foundational to the modern R landscape. dplyr
did two magical things. First, group_by()
let you find per-group summary statistics with a single command. Second, mutate()
let you change columns without repeating yourself:
df %>% mutate(my_column = my_column * 10)
That is a big improvement over df_new$my_column <- df_new$my_column * 10
.
However, dplyr has made a hard pass on mutating anything but the whole data frame. In fact this is where case_when()
came from. (See the numerous closed requests for mutate_where()
marked as duplicates of that issue.) That is probably to do with the need to make dplyr
work with databases. But for the ordinary R user, it causes problems.
Changing part of a data frame is a very common use case. For base R it’s easy, but ugly:
df$foo[df$foo < 10] <- df$foo[df$foo < 10] + 1
For the tidyverse, users learn to think in terms of ifelse()
(or if_else()
) and case_when()
:
df %>% mutate(foo = ifelse(foo < 10, foo + 1, foo))# and for anything more complex:
df %>% mutate(foo = case_when(
foo < 10 ~ foo + 1,
foo >= 10 & foo < 15 ~ foo + 2,
TRUE ~ foo
)
)
and whoops, we’re into ugly land.
Across? Yes, it does make me across
case_when()
code is often ugly, but at least it's comprehensible. Now we get into mutating multiple columns, and oh boy. Here, the goal is just to do something like:
df[rows, columns] <- do_something_to(df[rows, columns])
In base R it’s repetitive, but you can see what’s going on:
df_new[df_new$foo < 10, 1:4] <- df_new[df_new$foo < 10, 1:4] + 1
In dplyr
it's:
df_new %>% mutate(across(1:4, ~ifelse(.$foo < 10, .x, .x + 1)))
At least, I think that’s what it is. I honestly am not sure whether that works or not, because I can’t remember the syntax, which includes the across()
function, the tidyverse formula-as-function interface, and use of .
to refer to the whole dataframe.
Oops, no, that didn’t work. I needed to do
df_new %>% mutate(across(1:4, ~ifelse(foo < 10, .x, .x+1)))
Given that I struggle to do something this simple, how do you think beginners cope with anything more complex? Now read on, and enjoy code such as:
test %>%
mutate(across(c(bob, sally, rita),
~ case_when(. > baseline ~ baseline,
. <= baseline ~ .)))
which features two different abuses of the formula operator, or
df %>%
mutate(map2_df(across(ends_with('_nom'), .names = '{col}_val'),
across(ends_with('_wt')), ~as.integer(.x <= .y)))
which, do you understand? I don’t.
Again, those are not cherry-picked. They are three of the top seven search results on SO for [r] case_when across
, and the other four were not using across()
the function. And they're not from questions, they're from answers. What about when people start writing functions to do this stuff? Oh boy, that'll be fun!!!
Solutions
I don’t know if there are easy solutions to all this. across()
is often genuinely useful and so is case_when()
. They arose in the context of dplyr
's way of naming columns - an intuitive idiom that has proven its worth. But they are also powerful footguns for the newbie, leading to code that is repetitive and ugly, or worse, incomprehensible.
I think we have to go to the root. It’s time for base R to have a version of magrittr’s doubleheaded pipe, %<>%
. The doubleheaded or "assignment" pipe is one of R programming's best-kept secrets. It allows you to write:
df %<>% mutate(foo = foo + 1)
instead of
df <- df %<>% mutate(foo = foo + 1)
Of course, this pays off double when the left hand side is more complex:
iris[iris$Species=="virginica", "Sepal.Length"] %<>% `*`(2)
(Yeah, I just used * as a function; yes, I know it’s not beginner-friendly.) At one swoop this gets rid of a ton of duplication.
R has already introduced the base R pipe |>
, with a nice-looking syntax and fast implementation. So, why not <|>
? This could just be a syntax transformation, where
a <|> foo(1)
becomes:
a <- foo(a, 1)
That does a lot from the point of view of base R. It leaves some issues unsolved. In particular, it doesn’t work inside mutate()
:
df %>% mutate(
a <|> foo() # nope
)
But it does allow
df$a <|> foo()
or:
df$a[df$a < 10, 1:4] <|> foo()
which might be attractive enough to bring people back from across()
hell.
More radically, maybe it’s time to bring data.table
style addressing into base R? In data.table
you can just do:
ans <- flights[origin == "JFK" & month == 6]
and it knows that origin
and month
are columns of flights.
Meanwhile
I am now fully speculating on what R-core might and/or should do, without any ability to contribute myself. I really hope for an assignment pipe. In the meantime, just as the community recognized sapply()
as a code smell, and we got vapply()
as a solution, we should recognize that case_when()
and across()
are code smells. We should warn new R users about misusing them. And the very smart people working on the tidyverse and on base R should think about the problem.