Notes on Advanced R
by Lawrence Liu
- R's biggest challenges: Most R users are not programmers. Which means
- Much of the R code is not elegant, fast or easy to understand.
- The knowledge of software engineering of R users is patchy. They don't use source code control or automated testing.
- Metaprogramming is double-edged sword
- Inconsistency is rife across contributed packages, even within base R.
- R is not a particularly fast programming language, and poorly written R code can be terribly slow.
- What I can get from this book
- Be familiar with fundamentals of R
- Understand what functional programming means and why it is a useful tool for data analysis.
- Appreciate the double-edged sword of metaprogramming.
- Have a good intuition for which operations in R are slow or use a lot of memory.
- Be comfortable reading and understanding the majority of R code.
I should bear this in mind, and check it often and ask myself if I really learn these things from this book.
- Meta-techniques
- Reading source code
- adopt a scientific mindset
- The recommended reading seems to be good resources, I must read them when I am free.
- R has no 0-dimensional, or scalar types. The most basic class is atomic vector.
- vectors comes in two flavors: atomic vectors and lists. They have three common properties:
- Type,
typeof()
, what it is. - Length,
legnth()
how many elements it contains. - Attributes,
atrributes()
, additional arbitrary metadata.
- Type,
- Use
is.atomic(x) || is.list(x)
to test if an object is actually a vector. c()
is short for combine.- R considers
db_var <- c(1, 2, 3)
as double vector, it we want to create integer vector, we should useint_var <- c(1L, 2L, 3L)
. - is.numeric() is a general test for "numberliness" of a vector and returns
TRUE
for both integer and double vectors. - Types from least to most flexible: logical, integer, double, character.
- You will usually get a warning message if the coercion might lose information. This is a good habitat when i develop my R package.
is.recursive()
andis.atomic()
- When combining a vector and a list, the result is quite different between
c()
andlist()
- list is actually a kind of vector. If we use
is.vector()
to test it, it will returnTRUE
. - We should use
unlist()
to convert a list to an atomic vector.as.vector
doesn't work. - All objects can have arbitrary additional attributes. Attributes can be thought of as a names list(with unique names).
- By default, most attributes are lost when modifying a vector. The only attributes not lost are the three most important: Names
names()
, Dimensionsdim()
and Classclass()
. - class() is a special attribute.
- Factor is a special integer vector, what makes it special is its attributes of class and levels.
- We can't combine factors, because the levels are different. When we combine factors, it's coerced into a integer vector.
- While factors look (and often behave) like character vectors, they are actually integers.
- Some string methods (like
gsub()
andgrepl()
) will coerce factors to strings, while others (likenchar()
) will throw an error, and still others (likec()
) will use the underlying integer values. So we should explicitly convert factors to character vectors. - Matrices and arrays are atomic vectors with a
dim()
attribute.
- Nothing returns whole vector.
x[]
returns all elements of x. outer()
is a very helpful function.- You can also subset higher-dimensional data structures with an integer matrix (or, if named, a character matrix) I didn't know this before!
- We can use matrix subsetting and list subsetting to subset a data frame. One thing worth noting is that if I select one single column, matrix subsetting simplifies by default, while list subsetting does not.
- S3 objects are made up of atomic vectors, arrays and lists, so you can always pull apart an S3 object using the techniques described above and the knowledge you gain from
str()
. - Three subsetting opertors:
[
,[[
and$
.$
is a useful shorthand for[[
combined with character subsetting. [[
can return only a single value, so we must use it with either a single positive integer and string.- If we do supply a vector in
[[
, it indexes recursively. - S3 and S4 objects can override the standard behavior of
[
and[[
. - Simplifying subsetting and preserving subsetting are different, and it's very important to understand those differences.
- The behavior of simplifying subsetting varies slightly between different data types, as described below:
- Atomic vector: remove names.
- List: return the object inside the list, not a single element list.
- Factor: drops any unused levels. By default,
Drop = FALSE
. - Matrix or array: if any of the dimensions has length 1, drops that dimension. By default,
Drop = TRUE
. - Data frame: If output is a single column, returns a vector instead of a data frame. By default,
Drop = TRUE
.
$
is a shorthand operator, wherex$y
is equivalent tox[["y", exact = FALSE]]
. So apparently$
and[[
differ slightly.- All subsetting operators can be combined with assignment to modify selected values of the input vector. Something worth noting:
- The length of LHS needs to match the RHS
- There is no checking for duplicated indices.
- We can't combine integer indices with NA, but we can combine logical indices with NA.
- If we use logical indices to subset, the length of indices must match the length of the vector, otherwise, R will replicate the logical indices.
- Subsetting with nothing can be useful in conjunction with assignment because it will preserve the original class and structure.
<<-
: this operators is assignment operator, it will search for the varible in LHS through parent environment.- Difference between
<-
and=
: still unknown.
exists
: test if an R object exists.
apropos()
andfind()
: look for objects named in parameter.
dir()
,list.files()
,list.dirs()
: list the files and directories under a directory.basename()
,dirname()
,tools::file_ext()
:return basement, dirname and extension of a file.file.path()
: connects different parts of a file path with platform's path separator.file.exists()
: This series of functions provide a low level interface to the system's file system, and the use of these functions are quite straightforward.download.file()
: download file from Internet.- Package
downloader
makes it possible to download from https links in R.
- First of all, no style is better than others, so we should keep a consistent style. And when collabrate with other, we should discuss and decide a common style.
- If R files need to be run in sequence, prefix them with numbers
- Place spaces around all infix operators. The same rules applies when using = in function calls. Always put a space after a comma, and never before, except
:
,::
and:::
. - Place a space before left parentheses, except in a function call.
- Do not place spaces around code in parentheses or square brackets.
- An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it's followed by
else
. - It's OK to leave very short statements on the same line.
- Strive to limit your code to 80 characters per line. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.
- Each line of a comment should begin with the comment symbol and a single space. Comments should explain the why, not the what. And why is that?
- Use commented lines of
-
and=
to break up your file into easily readable chunks.
- The three components of a function are its body, arguments and environment. We can use
body()
,formals()
andenvironment()
to extract the three components. - There is one exception to the above rule. Primitive functions, like
sum()
, calls C code directly with.Primitive()
and contain no R code. Their three components are allNULL
. - Primitive functions are only found in the
base
package. is.function
andis.primitive()
check whether its argument is a (primitive) function. So every time I name a variable, I can use exist() to check if I already gave another variable the same name.- Scoping is the set of rules that govern how R looks up the value of a symbol.
- R has two types of scoping: lexical scoping and dynamic scoping.
- There are four basic principles behind R's implementation of lexical scoping:
- name masking
- functions vs. variables
- a fresh start
- dynamic lookup
- If you are using a name in a context where it's obvious that you want a function (e.g.
f(3)
), R will ignore objects that are not functions while it is searching. - Every time a function is called, a new environment is created to host execution.
- R uses dynamic lookup, R looks for values when the function is run, not when it's created. This means that the output of a function can be different depending on objects outside its environment.
- However, it's never possible to make a function completely self-contained because you must always rely on functions defined in base R or other packages.
- All standard operators in R are actually functions.
- Everything that exists is an object. Everything that happens is a function call.
- When we use
+
inside sapply, we can use+
and ``. - The formal arguments are a property of the function, whereas the actual or calling arguments can vary each time you call the function.
- When calling a function you can specify arguments by position, by complete name, or by partial name. And the priority is exact name, prefix matching, and by position.
do.call()
constructs and executes a function call from a name or a function and a list of arguments to be passed to it.- Since arguments in R are evaluated lazily (more on that below), the default value can be defined in terms of other arguments. E.g.,
f <- function(a = 1, b = a * 2)
- Default arguments can even be defined in terms of variables created within the function. This is used frequently in base R functions. E.g.,
h <- function(a = 1, b = d) d <- (a + 1) ^2; c(a, b)
. This is frequently used in base R functions. missing()
: used to test whether a value was specified as an argument to a function.- By default, R functions are lazy - they're only evaluated if they're actually used.
- If you want to ensure that an argument is evaluated you can use
force
. - This is very tricky when using
sapply
or loop to create closures. - Default arguments are evaluated inside the function. This means that if the expression depends on the current environment the results will differ depending on whether you use the default value or explicitly provide one.
- The argument
...
will match any arguments not otherwise matched. - infix and prefix function, and how to create our own infix function.
- Some functions can be used to alter the value of an object, they are called replacement functions.
- Replacement functions actually create a modified copy.
pryr:address()
can be used to find the memory address of the underlying object.- Built in functions that are implemented using .Primitive will modify in place. This explains why changing the name of a very large object can be memory-consuming.
- Functions can return
invisible
values, which are not printed out by default when you call the function. E.g., the most common function returns invisibly is<-
. - What
library()
does is loading a package and modifying the search path.
- The job of an environment is to associate, or bind, a set of names to a set of values. You can think of an environment as a bag of names. Each name points to an object stored elsewhere in memory.
- If an object has no names pointing to it, it gets automatically deleted by the garbage collector(
gc
). - Every environment has a parent except empty environment. But given an environment, we have no way to find its children.
- Generally an environment is similar to a list, with four important exceptions:
- Every object in an environment has a unique name
- The objects in an environment are not ordered.
- An environment has a parent.
- Environments have reference semantics.
- Four special environments:
globalenv()
,baseenv()
,emptyenv()
,environment()
. - The default parent provided by
new.env()
is environment from which it is called. - The easiest way to modify the bindings in an environment is to treat it like a list.
- By default, ls() only shows names that don't begin with
.
- To compare environments, you must use
identical()
not==
. - The enclosing environment is the environment where the function was created. Every function has one and only one enclosing environment. We can find the enclosing environment of a function by calling
environment()
with a function as its first argument. - The enclosing function determines how the function finds values; the binding environments determine how we find the function.