Before getting into actual R code, we’ll start with a few notesabout how to use it most effectively. Bad coding habits canmake your R code difficult to read and understand, so hopefullythese tips will ensure you have good habits right from the start.
2.1 Use RStudio’s “projects” feature
Every project you do in R should be set up in its own folder, throughRStudio’s “projects” feature. To start a fresh project, go toFile -> New project and create a new folder. Reopen the projectin RStudio whenever you want to work with it.
When sending your work to other people, you can send them the wholefolder and know that they’ll have access to all the required files andscripts.
One of the main benefits is that this sets R’s “working directory” tothe project folder by default. Any files you load or save canjust be referenced relative to that folder, so instead of:
You can just do:
This will also work for anyone you send the project to, so you don’thave to worry that the file is in a different location on their machine.
2.2 Using scripts
It’s best to put every step of your data cleaning and analysisin a script that you save, rather than making temporary changesin the console.
Ideally, this will mean that you (or anyone else) can run the scriptfrom top to bottom, and get the same results every time, i.e.they’rereproducible.
2.2.1 Script layout
Most R scripts I write have the same basic layout:
- Loading the libraries I’m using
- Loading the data
- Changing or analysing the data
- Saving the results or the recoded data file
For a larger project, it’s good to create multiple different scripts foreach stage, e.g.one file to recode the data, one to run the analyses.
When saving the recoded data, it’s best to save it as a different file -you keep the raw data, and you can recreate the recoded dataexactly by rerunning your script.
R won’t overwrite your data files when you change your data, unless you specifically ask it to. When you load a file into R, it lives in R’s ‘short-term memory’, and doesn’t maintain any connection to the file on disk. It’s only when you explicitly save to a file that those changes become permanent.
2.2.2 A tip for better reproducibility
By default, RStudio will save your current “workspace” when youquit. While convenient, this can mean that you make one-off changesto your data and forget to save that command in your script. Startingwith a fresh session every time you open RStudio means you’ll learnto keep every step of your analysis in your script, and you’llknow that you can get back to where you were by rerunning the script.
To disable the default setting, go to Tools -> Global options..,and in the General tab, and:
- Uncheck “Restore .RData”
- Set “Save workspace on exit” to “Never”
RStudio’s “save workspace” settings
2.3 Long or wide data
R works better with long data, whereas SPSS generally worksbetter with wide data. Roughly speaking, long data means:
- There is one “observation” of each participant/subject per row - e.g.onesurvey, one session
- All the measurements of the same type are in one column - so all theK6 scores, from multiple surveys, would be in a single
K6
column,with an additionalTime
orSurvey
column that identifies whichsurvey they come from.
Example long data:
ID | Survey | Drinking | AnxietyTotal |
---|---|---|---|
1 | 1 | No | 6 |
1 | 2 | Yes | 8 |
2 | 1 | No | 5 |
2 | 2 | No | 7 |
Example wide data:
ID | Drinking_T1 | Drinking_T2 | AnxietyTotal_T1 | AnxietyTotal_T2 |
---|---|---|---|---|
1 | No | Yes | 6 | 8 |
2 | No | No | 5 | 7 |
2.3.1 Going from wide to long
If your data is currently in wide format, you may have to reshapeit before working with it in R. The new pivot_longer
and pivot_wider
functions from the tidyr
package are good for reshaping data like this.To go from the wide data above to a long dataset:
wide %>% tidyr::pivot_longer( cols = Drinking_T1:AnxietyTotal_T2, names_to = c(".value", "Time"), names_sep = "_" )
This can get more complicated if the columns haven’t beennamed consistently, or have multiple pieces of infostored in the name (e.g.t1_male_mean
)
At the time of writing, the pivot
functions were still in beta, but should be in the new tidyr
version shortly.
2.4 Writing readable code
There are two very good reasons to try to write your code ina clear, understandable way:
- Other people might need to use your code.
- You might need to use your code, a few weeks/months/yearsafter you’ve written it.
It’s possible to write R code that “works” perfectly, andproduces all the results and output you want, but provesvery difficult to make changes to when you have to come backto it (because a reviewer asked for one additional analysis, etc.)
2.4.1 Basic formatting tips
You can improve the readability of your code a lot by followinga few simple rules:
- Put spaces between and around variable names and operators (
=+-*/
) - Break up long lines of code
- Use meaningful variable names composed of 2 or 3 words (avoid abbreviationsunless they’re very common and you use them very consistently)
These rules can mean the difference between this:
and this:
male_difference = lm(DepressionScore ~ Group + GroupTimeInteraction, data = interview_data, subset = BaselineSex == "Male")
R will treat both pieces of code exactly the same, but for any humans reading,the nicer layout and meaningful names make it much easier to understandwhat’s happening, and spot any errors in syntax or intent.
2.4.2 Keeping a consistent style
Try to follow a consistent style for naming things, e.g.using snake_case
for all your variable names in your R code, and TitleCase
for thecolumns in your data. Either style is probably better than lowercase withno spacing allmashedtogether
.
It doesn’t particularly matter what that style is, as long as you’reconsistent. There is a suggested style guide for the tidyverse,but I don’t follow it 100%, I just try to be consistent within my code.
2.4.3 Writing comments
One of the best things you can do to make R code readable andunderstandable is write comments - R ignores lines that start with#
so you can write whatever you want and it won’t affectthe way your code runs.
Comments that explain why something was done are great:
Comments that explain what is being done are less useful. Peoplewho already understand R code should be able to tell what ishappening just by looking at your code (especially if you’re followingthe other advice about writing readable code), so these commentscan be redundant:
The exception to this is when you’ve run into something thatwas tricky to get working, and you need to explain it so otherpeople don’t run into the same issue:
2.5 Don’t panic: dealing with SPSS withdrawal
2.5.1 RStudio has a data viewer
As you get used to R, you should find that you get more comfortableusing the console to check on your data. You can often seea lot of the information you need by printing the first fewrows of a dataset to the console. The head()
function printsthe first 6 rows of a table by default, and you can select the columns thatare most relevant to what you’re working on if there are too many:
## Species Petal.Length## 1 setosa 1.4## 2 setosa 1.4## 3 setosa 1.3## 4 setosa 1.5## 5 setosa 1.4## 6 setosa 1.7
However, you can also use RStudio’s built-in data viewer to get a morefamiliar, spreadsheet style view of your data. In the Environmentpane in the top-right, you can click on the name of any data youhave loaded to bring up a spreadsheet view:
Data viewer example
This also supports basic sorting and filtering so you can explorethe data more easily (you’ll still need to write code using functionslike arrange()
or filter()
if you want to actually makechanges to the data though).
2.5.2 R can read SPSS files
The haven
package can read (and write) SPSS data files, so youcan read in existing data:
R doesn’t deal with “value labels” in the same way as SPSS, andhaven
tries to keep information about the SPSS value labels available.However, it’s best to just convert everything to R’s way of dealing withcategorical variables, i.e.factors, using haven
’s as_factor()
function:
The levels = "both"
option puts both the numeric value and the text labelinto the factor labels, like "[0] No"
, "[1] Yes"
. As you get morecomfortable with R you may want to just use levels = "labels"
so youjust get the text labels like "No"
, "Yes"
.
You may need to convert your data from wide to long, sinceSPSS prefers wide and R prefers long.
The haven
package can also read SAS and Stata data, and there arepackages like readxl
for Excel files. It’s generally easy to readyour data into R from any format designed for tables of data.
2.6 Here be dragons: the bad parts of R
There are a few tools in R that tend to create more problemsthan they solve. Unfortunately beginners often end up usingthem (sometimes because bad tutorials recommend them). Mypersonal list of tools to avoid includes:
attach()
: This copies all the individual variables from a datasetinto R’s environment, so you can access them with justvar_name
instead ofdataset$var_name
. The problem is:- You end up with a lot of variables in your environment that arehard to keep track of.
- The variables can get out of sync with each other in waysthat wouldn’t be possible if they were kept in the dataset.
get()
andassign()
: These allow you to look up and create variablenames using strings, so instead of looking upmodel3
, you canprogrammatically create the variable name likeget(paste0("model", model_num))
.- Again, you can end up with a lot of variables in your environment.
- People often use these when they want to run 100 different versionsof a model (there are sometimes good reasons to do this). Instead of creating100 different variables called
model1
,model2
, …,model100
, it’susually possible to save these in a singlelist
or dataframe. Savingthem all in a list means it’s much easier to process and workwith the results.