Rookie R mistakes
Here’s some simple mistakes inexperienced R programmers make.
Thinking that c()
creates a vector
When you read R code, you see stuff like c(1, 2, 3)
a lot. So, obviously that’s how you make a vector, right? Then you write stuff like
if (c(2) + c(2) == c(4)) ...
This isn’t necessary. c()
just concatenates vectors. In R, all basic data types are vectors already. 1:10
is a vector, and so is plain old 1
. You can just write
if (2 + 2 == 4) ...
Not putting in spaces
R is punctuation-heavy. If you write code like this:
data[1:max(top,length(var)),]<-vec[[vec>=6]
then you will be in pain when you reread it. Use spaces to make your code legible:
data[1:max(top, length(var)), ] <- vec[[vec >= 6]]
Writing for
loops instead of using vectors
You don’t need to do this:
for (i in 1:length(foo)) {
foo[i] <- foo[i] * 2
}
Just do this:
foo <- foo * 2
Everything is a vector in R.
Using lapply
etc. instead offor
loops
More advanced beginners have learned that “for loops are bad” and that there’s this useful function lapply
which you can use instead of a for loop. They put these facts together and write code like:
lapply(foo, function (x) cat("This element is ", x, "\n"))
Then they get a printout of [[1]] NULL, [[2]] NULL
, etc., and wonder what they did wrong.
Yes, for loops can be bad, and yes, lapply
is helpful, but lapply
doesn’t replace for loops. In fact, lapply
itself uses a for loop under the hood! If you can rewrite something just using vectors, that’s great. If you can’t, then a for loop is fine — it’s easy to understand. In addition, for loops are good for side effects:
for (x in foo) cat("This element is ", x, "\n")
whereas lapply
and friends are best used for the value they return:
l <- list(a = 1, b = 2:3, c = 4:6)
means_of_l <- lapply(l, mean) # mean of each list element
Learning the tidyverse without understanding basic R
The tidyverse is awesome, but it doesn’t do everything. Here’s a simple base R operation that is complex with dplyr: changing a single column in a data frame, conditional on its existing value.
data$col[data$col < 0] <- NA
In dplyr this would be
data %>% mutate(
col = ifelse(col < 0, NA, col)
)
which is longer and harder to understand. There are other cases too.
In addition, if you don’t understand base R syntax, you will get lost quickly when you read someone else’s code. Here’s the skinny:
data$col # column 'col' in data frame 'data'
data[["col"]] # the same, but you can use a variable
data[[var]] # uses the value of var to pick a columndata[1:3, 2:4] # rows 1-3 and columns 2-4
data[1:3, ] # rows 1-3 and all the columns
data[ , 2:4] # all the rows and columns 2-4data[data$col < 5, ] # all the rows where 'col' is less than 5
data[data$col < 5, "col2"] # column 'col2' from rows where col < 5
This syntax is sometimes ugly, but it’s compact and powerful. If you want the full details, read the documentation in ?Extract.
There are some power user tips here.
Similarly, you should know the difference between a list and an (atomic) vector, and the basic R data types like logical, numeric, integer and character. Read the basic R manual — at least the early chapters. It’s not as well-written as more modern documentation, but it does teach you the basics. When you are ready, and need more depth, read Hadley Wickham’s advanced R.
Not reading vignettes
R documentation is usually very exact, but not very beginner-friendly. It says what each function does, but doesn’t give you an overview of how to do a given task. I’d been using R for 10 years before I realised what vignettes even were. This made my journey unnecessarily hard.
Vignettes are broad overviews to a particular R package. You can find them by browsing the help. Start with e.g. ?dplyr::filter
and go to the bottom of the page. You’ll see a link like this:
Click “index” to see the package help index. Then look for the link like:
If there’s a “User guides, package vignettes and other documentation” link, then you’re in luck. Click it and you’ll see a list of vignettes. Tidyverse vignettes are particularly well-written.
Not using debug
When something goes wrong, you’re likely to get a cryptic error message. Many people give up at this stage. What the hell does “Object of type closure is not subsettable” mean? “$ error is invalid for atomic vectors”? Time to post a woebegone message on Stackoverflow.
R has a great interactive debugger and you can use it to see what is going wrong. Here’s an example:
> factorial(foo)
Error in x + 1 : non-numeric argument to binary operator
What does that mean? Who said anything about x? Let’s fire up the debugger:
> debug(factorial)
> factorial(foo)
debugging in: factorial(foo)
debug: gamma(x + 1)
Browse[2]>
The line after debug:
shows you the next line of code. Aha! That’s where it says x+1.
You can also look at the body of the factorial
function, to see where x
was. Just type in factorial
with no brackets at the command line:
Browse[2]> factorial
function (x)
gamma(x + 1)
<bytecode: 0x7fd3882b3e68>
<environment: namespace:base>
OK, so x
was the argument you passed in! x
was really foo
in disguise all along. (Insert Scooby Doo meme here.) You can also evaluate statements in the debugger:
Browse[2]> foo
[1] "a"
OK, now we know what went wrong. Type n
to evaluate the next line of code:
Browse[2]> n
Error in x + 1 : non-numeric argument to binary operator
Indeed, the error gets thrown and you are dumped back to the main command line. Now you can fix it:
> foo <- 10
> undebug(factorial) # do this or you'll go back into the debugger
> factorial(foo)
[1] 3628800
That’s all
I’m sure there are other errors, but I hope these were useful. I think I made all of these at some point. Learn from my mistakes.