Summary • umncvmstats

This document attempts to list all the features of R and this package that are described in the other vignettes. The intent is that anything you’d need to know how to do for this class is listed here.

Getting Started

Various basic R ideas are covered in the Getting Started page, including how to:

Open your class “Project”
Use the panes in Rstudio
Use the assignment operator <- to save the result of an operation
Use functions, that is, be able to call a function with one or more (possibly named) parameters
Use the pipe |> to send a value to a function
Use the c function to “combine” values together
Get unstuck by using [Control-C]) to cancel the current input line
Load packages using the library function
Create scripts and Quarto files and send commands from them to the console
Use chunk options in Quarto to control the output (eg, #| message: false)
Render Quarto files and open the result in your browser

About Data

The About Data page has details on how to read in data, compute descriptive summary statistics, working with data sets, and various functions for summarizing, comparing, creating new variables, and working with factors.

Reading in data
- Best practices for creating a data set in a spreadsheet
- Reading in a data set using read_csv and read_excel, with optional parameters na, skip, and sheet
- skim to check that it was read properly
- mutate, as_factor, and optionally fct_recode to create factor variables
Descriptive statistics
- descriptive_statistics to get a summary of descriptive statistics for all variables; use the parameter by to split by another variable first.
- count and mutate (optionally with .by parameter) to get counts and percents for categorical data
- summarize (optionally with .by parameter) to get summary statistics for continuous data
Working with data
- select to select by column
- filter to select by row (see comparing functions)
- arrange (and desc) to sort
- mutate to make new variables
Functions for summarizing
- count and mutate, to get counts and percents
- summarize for continuous variables, with mean, sd, var, median, quantile (with probs parameter), min, max, IQR, and n()
- Use na.rm = TRUE to remove missing values first
Functions for comparing
- <, <=, >, >=, ==, !=
- %in%
- |, &, !
- is.na
Functions for creating new variables
- arithmetic functions (+, -, *, /)
- log, log10, log2
- if_else, case_when, cut
Functions for factors
- as_factor
- fct_recode
- fct_relevel
- fct_infreq, fct_reorder
- droplevels

About Graphics

Controlling output

Combining plots using + and /
Using #| fig-width and #| fig-height to control figure size in Quarto

Basics of ggplot2

The ggplot2 library, which uses a “grammar of graphics” to specify the aspects of a plot. The following pseudo-code plots data from a data set data_set, and maps the variable x_var to the x aesthetic, y_var to the y aesthetic (more as needed), and then adds a geometric object (XXX); these could be points, lines, or bars. You can then optionally facet the plot, change the scales, change the labels, and more.

ggplot(data_set, 
       mapping=aes(x=x_var, y=y_var, fill=fill_var, color=color_var)) +
  geom_XXX() +
  facet_XXX() +
  scale_XXX() +
  labs(...)

Scatterplots

geom_point() to add points
Color points by another variable by mapping it to the color aesthetic
Use stat_smooth(), with optional parameters method="lm" and se=FALSE to add a fitted line
Use scale_x_log10() or scale_y_log10 to put the x or y axes on the log scale

Bar plots

only need an x mapping; the y will be the count of the x variable
geom_bar() to have one bar per value
geom_bars() to have multiple bars per value, with variable to color by specified by mapping a variable to the fill aesthetic

Histograms and Density plots

also only need an x mapping; the y will be computed appropriately
Use geom_histogram to make a histogram; use parameters binwidth and boundary to control the bins
Use geom_density to make a density plot; use the color aesthetic to do separately by another variable

Box plots

geom_boxplot(), usually has a continuous y and a categorical x
- Flip x and y to plot horizontally
- If only a single continuous variable, say var, use x=var, y=0 to plot horizontally, and add hide_y_axis()
add points on top of the boxplot by
- first turn off outliers using geom_boxplot(outlier.shape = NA)
- then add swarmed points with geom_beeswarm; use the spacing parameter to control the swarm; using the parameters pch=21 and fill="white" also help to make the swarm more apparent
Use scale_x_log10() or scale_y_log10 to put the x or y axes on the log scale

Logistic Regression plots

Use geom_beeswarm with a continuous x variable and a binary y variable
Use scale_y_binary to make the y-axis on 0-1 scale
Use geom_smooth_logistic to add a logistic smooth

Facetting

facet_wrap(~byvar) for facetting by one variable
facet_wrap(~byvar, labeller = label_both) to add variable names
facet_grid(var1 ~ var2) for facetting by two variables

Labelling

Change the label for an aesthetic (or add a title) using labs.

labs(x = "x variable", y = "y variable", title = "Plot title")

Statistical Inference

These functions all use a “formula” notation, like this: function(response ~ explanatory, data=dataset).

Inference for Proportions
- one_proportion_inference
- two_proportion_inference
- pairwise_proportion_inference
- paired_proportion_inference
- independence_test
Inference for Means
- one_t_inference
- two_t_inference
- pairwise_t_inference
- paired_t_inference
- These functions can all handle log-transformed responses, with a backtransform parameter to specify whether output is on the log or original scale.

The functions about models (those starting with model_) apply to linear and logistic models. They also have a backtransform parameter to specify whether output is on the log or original scale (for linear models with log-transformed response) or on the logistic or probability scale (for logistic models).

Fitting models:
- linear models: lm(y ~ x, data = dataset)
- logistic models: glm(y ~ x, data = dataset, family=binomial)
- For multiple predictors, use ~ x1 + x2 for an additive model or x1 * x2 to include interactions
Inference about Models
- correlation_inference
- model_anova
- model_glance
- model_coefs
- model_means, pairwise_model_means
- model_slopes, pairwise_model_slopes
  - For means and slopes, use | in the formula to specify groupings and at to specify specific values to obtain the means or slopes at.

Additional Options

Several additional options for controlling the output are available.

as_gt to use gt formatting options
tab_compact to change font size and spacing
set_digits to control rounding (except for p-values)
fmt_pvalue to control rounding of p-values
as_tibble to get the underlying result as a data set
Using + and | to run multiple tests at the same time
combine_tests to combine results together in a single table