About Data
data.Rmd
We’ll mostly use functions from the tidyverse
when
working with data; for more on the tidyverse
, see the
following resources:
About data sets
A good data file …
- Is rectangular
- Has observations in rows and variables in columns
- Can have extra information (metadata) in rows above the data; we can have R skip these
- Has no information in formatting (ie, color) or position (ie, sorting shouldn’t lose information)
- Has minimal computation or duplication
- Has well-labeled variables…
Unfortunately computers and humans have different ideas about what a good label is.
- Good computer labels: Alphanumeric and underscore only, starting
with a letter. Units often get in the way. (eg,
BloodPressure
orblood_pressure
) - Good human labels: spaces, parentheses allowed, units should be specified. “Blood Pressure (mmHg)”
Consider using two rows for labels, with human labels (to be skipped when reading data in) above computer labels.
Reading in data
We’ll most often read in data from either csv or Excel sheets. The
functions read_csv
and read_excel
are used for
this.
Both take the filename as the first parameter, and have additional
parameters na
to specify what values should be considered
missing, and skip
to skip a certain number of rows at the
top. For an Excel sheet, there’s also the optional sheet
parameter to specify which sheet to read; by default the first sheet is
read.
It’s also recommended to use the here
function to
specify the location of the file within your project; this allows you to
move your script or Quarto file to subfolders within the project without
changing the code.
For example, the following code reads in a csv file
test.csv
from the main folder, considers a single period
.
and the code NA
to be missing values, and
skips the first two lines. It also saves the result in the variable
testdata
.
testdata <- read_csv(here("test.csv"), na = c(".", "NA"), skip = 2)
Here’s similar for reading an Excel file, that specifically chooses
to read from the sheet named mydata
. If not specified, the
first sheet is chosen.
testdata <- read_excel(here("test.csv"), sheet = "mydata", na = c(".", "NA"), skip = 2)
For further examples in this document, I’ll use the
mtcars2
data set, which is included in the
umncvmstats
package.
Checking that it was read properly
After first reading in a data file, use the skim
function to be sure it read in properly. This shows the number of rows
and columns that were read, and then for each type of variable
(character/factor/numeric, etc.), it shows some basic summary
information about each variable.
This function is in umncvmstats
and uses the
skimr
package behind the scenes; further control is
possible by using that package directly.
mtcars2 |> skim()
Summary of mtcars2:
32 rows, 12 columns
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
model | 0 | 1 | 7 | 19 | 0 | 32 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
cyl | 0 | 1 | FALSE | 3 | 8: 14, 4: 11, 6: 7 |
vs | 0 | 1 | FALSE | 2 | V-s: 18, str: 14 |
am | 0 | 1 | FALSE | 2 | aut: 19, man: 13 |
gear | 0 | 1 | FALSE | 3 | 3: 15, 4: 12, 5: 5 |
carb | 0 | 1 | FALSE | 6 | 2: 10, 4: 10, 1: 7, 3: 3 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
In R, a categorical variable is called a factor
; text
variables should usually be factors unless they are simply used for
identification, like the model
variable, here.
The most common error is that a variable that should be numeric is read in as text, due to an unspecified missing value code. Be sure to check that all variables are of the expected type, and that the values as summarized make sense.
Creating factors
The mtcars2
data set is based on the mtcars
data set, which is included in your basic R installation. In the
original mtcars
data set, though, all variables are
numeric; the categorical variables are all coded as 0/1.
Compare the skim
summary of the mtcars
data
set with mtcars2
:
mtcars |> skim()
Summary of mtcars:
32 rows, 11 columns
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
To convert these variables into categorical variables, we use the
functions mutate
, as_factor
, and optionally
fct_recode
; we’ll see more about them later, for now, let’s
just see a couple examples of how we use them to convert these
variables.
-
mutate
function creates new variables (possibly overwriting old ones). Withinmutate
we usenewname = [operation]
to perform an operation and assign the result to the variablenewname
. Multiple variables can be created within one call tomutate
by using commas. -
as_factor
converts a variable to a categorical variable. You may also see the original version,factor
; this version is from thetidyverse
and is preferred for improved consistency. -
fct_recode
changes the coding of the values.
As an example, the following code:
- will save the result into a new variable
my_mtcars
- pipes the
mtcars
dataset into themutate
function, then within mutate,- creates a new variable
cyl
(which will overwrite the previouscyl
) by runningas_factor
on thecyl
variable - creates a new variable
vs
(which again, overwrites the previousvs
) by first runningas_factor
onvs
and then recoding it so that0
becomesV-shaped
and1
becomesStraight
. Quotation marks are recommended, though in certain cases are not absolutely necessary.
- creates a new variable
We then skim
the new result.
my_mtcars <- mtcars |>
mutate(cyl = as_factor(cyl),
vs = as_factor(vs) |> fct_recode("V-shaped"="0", "Straight"="1"))
skim(my_mtcars)
Summary of my_mtcars:
32 rows, 11 columns
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
cyl | 0 | 1 | FALSE | 3 | 8: 14, 4: 11, 6: 7 |
vs | 0 | 1 | FALSE | 2 | V-s: 18, Str: 14 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
Summarizing data
Descriptive statistics for all variables
To get basic descriptive statistics, use the
descriptive_statistics
function, which shows
mean/SD/median/range for continuous variables and count/percent for
categorical variables (or continuous variables with just a few
values).
Here I’m using select
(to be seen later) to remove the
model
variable before computing these statistics.
This is a version of tbl_summary
from the
gtsummary
package; further control is possible by using
that package directly.
mtcars2 |> select(-model) |> descriptive_statistics()
Characteristic | N = 32 |
---|---|
mpg | |
Mean ± SD | 20.09 ± 6.03 |
Median (Min, Max) | 19.20 (10.40, 33.90) |
cyl, n/N (%) | |
4 | 11/32 (34%) |
6 | 7/32 (22%) |
8 | 14/32 (44%) |
disp | |
Mean ± SD | 230.72 ± 123.94 |
Median (Min, Max) | 196.30 (71.10, 472.00) |
hp | |
Mean ± SD | 146.69 ± 68.56 |
Median (Min, Max) | 123.00 (52.00, 335.00) |
drat | |
Mean ± SD | 3.60 ± 0.53 |
Median (Min, Max) | 3.70 (2.76, 4.93) |
wt | |
Mean ± SD | 3.22 ± 0.98 |
Median (Min, Max) | 3.33 (1.51, 5.42) |
qsec | |
Mean ± SD | 17.85 ± 1.79 |
Median (Min, Max) | 17.71 (14.50, 22.90) |
vs, n/N (%) | |
V-shaped | 18/32 (56%) |
straight | 14/32 (44%) |
am, n/N (%) | |
automatic | 19/32 (59%) |
manual | 13/32 (41%) |
gear, n/N (%) | |
3 | 15/32 (47%) |
4 | 12/32 (38%) |
5 | 5/32 (16%) |
carb, n/N (%) | |
1 | 7/32 (22%) |
2 | 10/32 (31%) |
3 | 3/32 (9.4%) |
4 | 10/32 (31%) |
6 | 1/32 (3.1%) |
8 | 1/32 (3.1%) |
One can also compute these statistics on subsets of the data, by
splitting it by another variable first, using the by
parameter. This code gets descriptive statistics for each
am
group (automatic and manual) separately.
[Output not shown]
mtcars2 |> select(-model) |> descriptive_statistics(by=am)
This output is formatted, making it great for understanding and display, but hard to work with further. To work with further, we’d need to compute these values ourselves.
Summarizing categorical variables
Use count
to count categorical variables; count
combinations of multiple variables by separating with a comma.
mtcars2 |> count(vs)
#> # A tibble: 2 × 2
#> vs n
#> <fct> <int>
#> 1 V-shaped 18
#> 2 straight 14
mtcars2 |> count(vs, am)
#> # A tibble: 4 × 3
#> vs am n
#> <fct> <fct> <int>
#> 1 V-shaped automatic 12
#> 2 V-shaped manual 6
#> 3 straight automatic 7
#> 4 straight manual 7
We can then use mutate
to make a new variable with the
percent; see more on mutate
later. This adds columns with
the overall percent for each desired combination of variables. Note that
within each computation, the percents sum to 100%.
mtcars2 |> count(vs) |> mutate(percent = n / sum(n))
#> # A tibble: 2 × 3
#> vs n percent
#> <fct> <int> <dbl>
#> 1 V-shaped 18 0.562
#> 2 straight 14 0.438
mtcars2 |> count(vs, am) |> mutate(percent = n / sum(n))
#> # A tibble: 4 × 4
#> vs am n percent
#> <fct> <fct> <int> <dbl>
#> 1 V-shaped automatic 12 0.375
#> 2 V-shaped manual 6 0.188
#> 3 straight automatic 7 0.219
#> 4 straight manual 7 0.219
If, however, we want to compute the percent of one variable within
categories of another variable, we can use the .by
parameter, which does the computation separately for each value of the
specified variable. Now the percents within each value of
am
(automatic or manual) sum to 100%.
mtcars2 |> count(vs, am) |> mutate(percent = n / sum(n), .by=am)
#> # A tibble: 4 × 4
#> vs am n percent
#> <fct> <fct> <int> <dbl>
#> 1 V-shaped automatic 12 0.632
#> 2 V-shaped manual 6 0.462
#> 3 straight automatic 7 0.368
#> 4 straight manual 7 0.538
Another option is to use xtabs
to get a
“cross-tabulation” of multiple variables. This uses the “formula”
syntax, which we’ll see more of when we do inference. It also stores the
result in a matrix, which is easier to view but harder to work with. The
row and columns sums can be added using addmargins
.
xtabs(~vs + am, data=mtcars2)
#> am
#> vs automatic manual
#> V-shaped 12 6
#> straight 7 7
xtabs(~vs + am, data=mtcars2) |> addmargins()
#> am
#> vs automatic manual Sum
#> V-shaped 12 6 18
#> straight 7 7 14
#> Sum 19 13 32
Summarizing continuous variables
To get summary statistics for a continuous variable, use
summarize
. Within summarize
, use
newname = [operation]
to perform an operation and assign
the result to the variable. To do this separately for each value of a
categorical variable, use the .by
parameter. Note the
.
in front of .by
; that is used here to make
conflicts with your desired name less likely.
mtcars2 |> summarize(mean_mpg=mean(mpg), sd_mpg=sd(mpg))
#> # A tibble: 1 × 2
#> mean_mpg sd_mpg
#> <dbl> <dbl>
#> 1 20.1 6.03
mtcars2 |> summarize(mean_mpg=mean(mpg), sd_mpg=sd(mpg), .by=am)
#> # A tibble: 2 × 3
#> am mean_mpg sd_mpg
#> <fct> <dbl> <dbl>
#> 1 manual 24.4 6.17
#> 2 automatic 17.1 3.83
See list of useful summary functions below.
Working with data
Selecting rows or columns
- Selecting columns (
select
)- Select
model
andmpg
columns:mtcars2 |> select(model, mpg)
- Select all but
model
andmpg
columns:mtcars2 |> select(-model, -mpg)
- Select
- Selecting rows (
filter
)- Select rows that have manual transmissions:
mtcars2 |> filter(am=="manual")
- Select rows that have mpg > 25:
mtcars2 |> filter(mpg > 25)
- See “Comparison” below for more options
- Select rows that have manual transmissions:
Sorting
Use arrange
to sort
- Sort by increasing mpg:
mtcars2 |> arrange(mpg)
- Sort by decreasing mpg:
mtcars2 |> arrange(desc(mpg))
Creating new variables
We use the mutate
function to add a variable.
-
mutate
function creates new variables (possibly overwriting old ones). Withinmutate
we usenewname = [operation]
to perform an operation and assign the result to the variable
Best practice is that variables use only letters, numbers, and the
underscore (_
), and start with a letter.
This data set has fuel economy in miles per gallon
(mpg
); suppose we instead want to use the UK standard of
liters/100 kilometers.
mtcars2 <- mtcars2 |>
mutate(Lp100km = 3.78541 / 1.60934 * 100 / mpg)
mtcars2 |> select(mpg, Lp100km) |> skim()
Summary of select(mtcars2, mpg, Lp10…:
32 rows, 2 columns
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
Lp100km | 0 | 1 | 12.76 | 3.86 | 6.94 | 10.32 | 12.25 | 15.25 | 22.62 | ▅▇▆▁▁ |
In addition to arithmetic, there are a number of useful function for creating new variables, as seen below.
Useful functions
For summarizing
Useful summary functions include:
-
mean
: mean -
sd
: standard deviation -
var
: variance -
median
: median -
quantile
: quantile (or percentile); use theprobs
parameter to specify which. For example,quantile(x, probs=0.25)
gets the 25th percentile of the variablex
. -
min
: minimum value -
max
: maximum value -
IQR
: interquartile range -
n()
: to get the number of observations (including when grouped using.by
)
R is very careful with missing values and will not compute these if
any of the values are missing; this is a good thing as it forces you to
specifically handle it. If you want to remove any missing values before
computing these statistics, set the parameter na.rm = TRUE
for any of these functions.
For comparing
- less than/greater to:
<
,<=
,>
,>=
, - equal to
==
, not equal to!=
(note the double equals to test equality) -
%in%
: tests if a value is in a vector, for example- For example, for a variable
v <- c("a", "b", "c")
, the operationv %in% c("a", "b")
returns the vectorc(TRUE, TRUE, FALSE)
.
- For example, for a variable
- Boolean operators: or
|
, and&
, not!
- to test if a value is missing, use
is.na
; using==NA
does not work. This is also vectorized; tryis.na(c(1,2,NA,4))
For creating new variables
- arithmetic functions (
+
,-
,*
,/
) - log functions; natural log is default, but base 10 and base 2 also
can be used:
log
,log10
,log2
- For creating new variables based on conditions:
- For a binary condition:
if_else
- For several possible conditions:
case_when
- To break a numeric variable into categories:
cut
- For a binary condition: