About Data
data.RmdWe’ll mostly use functions from the tidyverse when
working with data; for more on the tidyverse, see the
following resources:
About data sets
A good data file …
- Is rectangular
- Has observations in rows and variables in columns
- Can have extra information (metadata) in rows above the data; we can have R skip these
- Has no information in formatting (ie, color) or position (ie, sorting shouldn’t lose information)
- Has minimal computation or duplication
- Has well-labeled variables…
Unfortunately computers and humans have different ideas about what a good label is.
- Good computer labels: Alphanumeric and underscore only, starting
with a letter. Units often get in the way. (eg,
BloodPressureorblood_pressure) - Good human labels: spaces, parentheses allowed, units should be specified. “Blood Pressure (mmHg)”
Consider using two rows for labels, with human labels (to be skipped when reading data in) above computer labels.
Reading in data
We’ll most often read in data from either csv or Excel sheets. The
functions read_csv and read_excel are used for
this.
Both take the filename as the first parameter, and have additional
parameters na to specify what values should be considered
missing, and skip to skip a certain number of rows at the
top. For an Excel sheet, there’s also the optional sheet
parameter to specify which sheet to read; by default the first sheet is
read.
It’s also recommended to use the here function to
specify the location of the file within your project; this allows you to
move your script or Quarto file to subfolders within the project without
changing the code.
For example, the following code reads in a csv file
test.csv from the main folder, considers a single period
. and the code NA to be missing values, and
skips the first two lines. It also saves the result in the variable
testdata.
testdata <- read_csv(here("test.csv"), na = c(".", "NA"), skip = 2)Here’s similar for reading an Excel file, that specifically chooses
to read from the sheet named mydata. If not specified, the
first sheet is chosen.
testdata <- read_excel(here("test.csv"), sheet = "mydata", na = c(".", "NA"), skip = 2)For further examples in this document, I’ll use the
mtcars2 data set, which is included in the
umncvmstats package.
Checking that it was read properly
After first reading in a data file, use the skim
function to be sure it read in properly. This shows the number of rows
and columns that were read, and then for each type of variable
(character/factor/numeric, etc.), it shows some basic summary
information about each variable.
This function is in umncvmstats and uses the
skimr package behind the scenes; further control is
possible by using that package directly.
mtcars2 |> skim()Summary of mtcars2:
32 rows, 12 columns
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| model | 0 | 1 | 7 | 19 | 0 | 32 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cyl | 0 | 1 | FALSE | 3 | 8: 14, 4: 11, 6: 7 |
| vs | 0 | 1 | FALSE | 2 | V-s: 18, str: 14 |
| am | 0 | 1 | FALSE | 2 | aut: 19, man: 13 |
| gear | 0 | 1 | FALSE | 3 | 3: 15, 4: 12, 5: 5 |
| carb | 0 | 1 | FALSE | 6 | 2: 10, 4: 10, 1: 7, 3: 3 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
| disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
| hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
| drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
| wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
| qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
In R, a categorical variable is called a factor; text
variables should usually be factors unless they are simply used for
identification, like the model variable, here.
The most common error is that a variable that should be numeric is read in as text, due to an unspecified missing value code. Be sure to check that all variables are of the expected type, and that the values as summarized make sense.
Creating factors
The mtcars2 data set is based on the mtcars
data set, which is included in your basic R installation. In the
original mtcars data set, though, all variables are
numeric; the categorical variables are all coded as 0/1.
Compare the skim summary of the mtcars data
set with mtcars2:
mtcars |> skim()Summary of mtcars:
32 rows, 11 columns
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
| cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
| disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
| hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
| drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
| wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
| qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
| vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
| carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
To convert these variables into categorical variables, we use the
functions mutate, as_factor, and optionally
fct_recode; we’ll see more about them later, for now, let’s
just see a couple examples of how we use them to convert these
variables.
-
mutatefunction creates new variables (possibly overwriting old ones). Withinmutatewe usenewname = [operation]to perform an operation and assign the result to the variablenewname. Multiple variables can be created within one call tomutateby using commas. -
as_factorconverts a variable to a categorical variable. You may also see the original version,factor; this version is from thetidyverseand is preferred for improved consistency. -
fct_recodechanges the coding of the values.
As an example, the following code:
- will save the result into a new variable
my_mtcars - pipes the
mtcarsdataset into themutatefunction, then within mutate,- creates a new variable
cyl(which will overwrite the previouscyl) by runningas_factoron thecylvariable - creates a new variable
vs(which again, overwrites the previousvs) by first runningas_factoronvsand then recoding it so that0becomesV-shapedand1becomesStraight. Quotation marks are recommended, though in certain cases are not absolutely necessary.
- creates a new variable
We then skim the new result.
my_mtcars <- mtcars |>
mutate(cyl = as_factor(cyl),
vs = as_factor(vs) |> fct_recode("V-shaped"="0", "Straight"="1"))
skim(my_mtcars)Summary of my_mtcars:
32 rows, 11 columns
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cyl | 0 | 1 | FALSE | 3 | 8: 14, 4: 11, 6: 7 |
| vs | 0 | 1 | FALSE | 2 | V-s: 18, Str: 14 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
| disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
| hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
| drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
| wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
| qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
| am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
| carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
Summarizing data
Descriptive statistics for all variables
To get basic descriptive statistics, use the
descriptive_statistics function, which shows
mean/SD/median/range for continuous variables and count/percent for
categorical variables (or continuous variables with just a few
values).
Here I’m using select (to be seen later) to remove the
model variable before computing these statistics.
This is a version of tbl_summary from the
gtsummary package; further control is possible by using
that package directly.
mtcars2 |> select(-model) |> descriptive_statistics()| Characteristic | N = 32 |
|---|---|
| mpg | |
| Mean ± SD | 20.09 ± 6.03 |
| Median (Min, Max) | 19.20 (10.40, 33.90) |
| cyl, n/N (%) | |
| 4 | 11/32 (34%) |
| 6 | 7/32 (22%) |
| 8 | 14/32 (44%) |
| disp | |
| Mean ± SD | 230.72 ± 123.94 |
| Median (Min, Max) | 196.30 (71.10, 472.00) |
| hp | |
| Mean ± SD | 146.69 ± 68.56 |
| Median (Min, Max) | 123.00 (52.00, 335.00) |
| drat | |
| Mean ± SD | 3.60 ± 0.53 |
| Median (Min, Max) | 3.70 (2.76, 4.93) |
| wt | |
| Mean ± SD | 3.22 ± 0.98 |
| Median (Min, Max) | 3.33 (1.51, 5.42) |
| qsec | |
| Mean ± SD | 17.85 ± 1.79 |
| Median (Min, Max) | 17.71 (14.50, 22.90) |
| vs, n/N (%) | |
| V-shaped | 18/32 (56%) |
| straight | 14/32 (44%) |
| am, n/N (%) | |
| automatic | 19/32 (59%) |
| manual | 13/32 (41%) |
| gear, n/N (%) | |
| 3 | 15/32 (47%) |
| 4 | 12/32 (38%) |
| 5 | 5/32 (16%) |
| carb, n/N (%) | |
| 1 | 7/32 (22%) |
| 2 | 10/32 (31%) |
| 3 | 3/32 (9.4%) |
| 4 | 10/32 (31%) |
| 6 | 1/32 (3.1%) |
| 8 | 1/32 (3.1%) |
One can also compute these statistics on subsets of the data, by
splitting it by another variable first, using the by
parameter. This code gets descriptive statistics for each
am group (automatic and manual) separately.
[Output not shown]
mtcars2 |> select(-model) |> descriptive_statistics(by=am)This output is formatted, making it great for understanding and display, but hard to work with further. To work with further, we’d need to compute these values ourselves.
Summarizing categorical variables
Use count to count categorical variables; count
combinations of multiple variables by separating with a comma.
mtcars2 |> count(vs)
#> # A tibble: 2 × 2
#> vs n
#> <fct> <int>
#> 1 V-shaped 18
#> 2 straight 14
mtcars2 |> count(vs, am)
#> # A tibble: 4 × 3
#> vs am n
#> <fct> <fct> <int>
#> 1 V-shaped automatic 12
#> 2 V-shaped manual 6
#> 3 straight automatic 7
#> 4 straight manual 7We can then use mutate to make a new variable with the
percent; see more on mutate later. This adds columns with
the overall percent for each desired combination of variables. Note that
within each computation, the percents sum to 100%.
mtcars2 |> count(vs) |> mutate(percent = n / sum(n))
#> # A tibble: 2 × 3
#> vs n percent
#> <fct> <int> <dbl>
#> 1 V-shaped 18 0.562
#> 2 straight 14 0.438
mtcars2 |> count(vs, am) |> mutate(percent = n / sum(n))
#> # A tibble: 4 × 4
#> vs am n percent
#> <fct> <fct> <int> <dbl>
#> 1 V-shaped automatic 12 0.375
#> 2 V-shaped manual 6 0.188
#> 3 straight automatic 7 0.219
#> 4 straight manual 7 0.219If, however, we want to compute the percent of one variable within
categories of another variable, we can use the .by
parameter, which does the computation separately for each value of the
specified variable. Now the percents within each value of
am (automatic or manual) sum to 100%.
mtcars2 |> count(vs, am) |> mutate(percent = n / sum(n), .by=am)
#> # A tibble: 4 × 4
#> vs am n percent
#> <fct> <fct> <int> <dbl>
#> 1 V-shaped automatic 12 0.632
#> 2 V-shaped manual 6 0.462
#> 3 straight automatic 7 0.368
#> 4 straight manual 7 0.538Another option is to use xtabs to get a
“cross-tabulation” of multiple variables. This uses the “formula”
syntax, which we’ll see more of when we do inference. It also stores the
result in a matrix, which is easier to view but harder to work with. The
row and columns sums can be added using addmargins.
xtabs(~vs + am, data=mtcars2)
#> am
#> vs automatic manual
#> V-shaped 12 6
#> straight 7 7
xtabs(~vs + am, data=mtcars2) |> addmargins()
#> am
#> vs automatic manual Sum
#> V-shaped 12 6 18
#> straight 7 7 14
#> Sum 19 13 32Summarizing continuous variables
To get summary statistics for a continuous variable, use
summarize. Within summarize, use
newname = [operation] to perform an operation and assign
the result to the variable. To do this separately for each value of a
categorical variable, use the .by parameter. Note the
. in front of .by; that is used here to make
conflicts with your desired name less likely.
mtcars2 |> summarize(mean_mpg=mean(mpg), sd_mpg=sd(mpg))
#> # A tibble: 1 × 2
#> mean_mpg sd_mpg
#> <dbl> <dbl>
#> 1 20.1 6.03
mtcars2 |> summarize(mean_mpg=mean(mpg), sd_mpg=sd(mpg), .by=am)
#> # A tibble: 2 × 3
#> am mean_mpg sd_mpg
#> <fct> <dbl> <dbl>
#> 1 manual 24.4 6.17
#> 2 automatic 17.1 3.83See list of useful summary functions below.
Working with data
Selecting rows or columns
- Selecting columns (
select)- Select
modelandmpgcolumns:mtcars2 |> select(model, mpg) - Select all but
modelandmpgcolumns:mtcars2 |> select(-model, -mpg)
- Select
- Selecting rows (
filter)- Select rows that have manual transmissions:
mtcars2 |> filter(am=="manual") - Select rows that have mpg > 25:
mtcars2 |> filter(mpg > 25) - See “Comparison” below for more options
- Select rows that have manual transmissions:
Sorting
Use arrange to sort
- Sort by increasing mpg:
mtcars2 |> arrange(mpg) - Sort by decreasing mpg:
mtcars2 |> arrange(desc(mpg))
Creating new variables
We use the mutate function to add a variable.
-
mutatefunction creates new variables (possibly overwriting old ones). Withinmutatewe usenewname = [operation]to perform an operation and assign the result to the variable
Best practice is that variables use only letters, numbers, and the
underscore (_), and start with a letter.
This data set has fuel economy in miles per gallon
(mpg); suppose we instead want to use the UK standard of
liters/100 kilometers.
mtcars2 <- mtcars2 |>
mutate(Lp100km = 3.78541 / 1.60934 * 100 / mpg)
mtcars2 |> select(mpg, Lp100km) |> skim()Summary of select(mtcars2, mpg, Lp10…:
32 rows, 2 columns
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
| Lp100km | 0 | 1 | 12.76 | 3.86 | 6.94 | 10.32 | 12.25 | 15.25 | 22.62 | ▅▇▆▁▁ |
In addition to arithmetic, there are a number of useful function for creating new variables, as seen below.
Useful functions
For summarizing
Useful summary functions include:
-
mean: mean -
sd: standard deviation -
var: variance -
median: median -
quantile: quantile (or percentile); use theprobsparameter to specify which. For example,quantile(x, probs=0.25)gets the 25th percentile of the variablex. -
min: minimum value -
max: maximum value -
IQR: interquartile range -
n(): to get the number of observations (including when grouped using.by)
R is very careful with missing values and will not compute these if
any of the values are missing; this is a good thing as it forces you to
specifically handle it. If you want to remove any missing values before
computing these statistics, set the parameter na.rm = TRUE
for any of these functions.
For comparing
- less than/greater to:
<,<=,>,>=, - equal to
==, not equal to!=(note the double equals to test equality) -
%in%: tests if a value is in a vector, for example- For example, for a variable
v <- c("a", "b", "c"), the operationv %in% c("a", "b")returns the vectorc(TRUE, TRUE, FALSE).
- For example, for a variable
- Boolean operators: or
|, and&, not! - to test if a value is missing, use
is.na; using==NAdoes not work. This is also vectorized; tryis.na(c(1,2,NA,4))
For creating new variables
- arithmetic functions (
+,-,*,/) - log functions; natural log is default, but base 10 and base 2 also
can be used:
log,log10,log2 - For creating new variables based on conditions:
- For a binary condition:
if_else - For several possible conditions:
case_when - To break a numeric variable into categories:
cut
- For a binary condition: