1 Introduction

Author

Affiliation

1.1 The Scientific Method in Practice

Answering questions about the natural world using the scientific method requires that we draw on many years’ of accumulated knowledge and experience. This workflow unpacks into roughly the following sequence of steps:

Look around you at the world, be curious about it, and ask questions to figure out an explanation for the pattern or phenomenon that tickled your interest.
Create an unambiguous statement of the question you want to answer, think about what is causing the pattern or phenomenon you observed, and how you might go about measuring the response (the thing you observed initially).
Translate this question into a testable hypothesis. This is the statement that you can test using the data you will collect.
Design an experiment or sampling campaign to collect data that will allow you to test this hypothesis. Clearly understand what the data you’ll collect will look like, both for the response and the explanatory variables. For example, do you have a categorical or continuous predictor, is the response continuous, binary, ordinal, etc.? For this, you should have a firm grasp of the various kinds of Data Classes and Structures in R.
Think deeply about any confounding influences that might affect your data, and specify exactly what additional data you will have to collect to isolate the hypothesised influence in your analysis. You need to fully understand all the ways that factors not considered in your hypothesis might affect your study’s outcome. Omissions cannot be rectified after the fact without repeating the entire experiment or sampling work. It requires knowledge and experience to avoid confounding influences ruining your work.
Depending on your experiment’s design (4) and the nature of the data you’ll obtain (4, 5), choose the appropriate statistical methods to analyse them. You should be able to develop a good idea of what statistical methods you’ll use—even before the experiment has been done! Decide on the parametric test, or, should the statistical god with the dice not provide an outcome that favours your expectations, you can also decide upfront on a non-parametric equivalent. It is important not to decide on the statistical method after you’ve collected the data. This is called p-hacking, and it is almost a cardinal sin in science.
Do the experiment or go out into the world to sample, and collect the data. Have fun—this is why we do science, afterall!
Go have a few drinks after a hard day’s work and celebrate your success.
Analyse your newly-collected data. This will include explaratory data analyses (see Exploring With Summaries and Descriptions and Exploring With Figures), and then the application of the statistical methods you chose in step 6.
Communicate your results in tables and figures.

This textbook deals with many of these steps (except for 1, 5, 7, and 8). This knowledge is codified in the form of the statistical method, which provides a systematic framework for collecting,¹ analysing, and interpreting data. In this chapter, I will introduce the fundamental concepts of inferential statistics, which allow us to make inferences about populations based on sample data. I will also provide an overview of the types of statistical methods used in inferential statistics, and discuss the importance of understanding the assumptions underlying these methods.

¹ Yes, statistics also informs us about how to collect data.

1.2 Inferential Statistics

This book covers the suite of inferential statistics available to biologists. These methods are the cornerstone of hypothesis-driven scientific research. Inferential statistics provide the tools needed to generalise from a sample to a population or to make predictions about future observations. In doing so, we can draw general conclusions or test hypotheses about populations or processes.

Inferential statistics build upon basic exploratory data analysis (EDA), which often includes a substantial use of descriptive (or summary) statistics. Descriptive statistics describe and summarise the characteristics of a dataset, such as its central tendency, variability, and distribution. While descriptive statistics offer a snapshot of the data, they do not allow us to draw conclusions about the population from which the data were sampled.

Descriptive and inferential statistics work hand in hand, with the former laying the groundwork for more advanced analyses. Inferential statistics allow us not only to draw conclusions from our data but also to quantify the uncertainty associated with these inferences. This uncertainty arises because we are analysing only a sample and wish to generalise our insights to the entire population from which the sample was drawn. Inferential statistics offer a systematic framework for making these inferences and assessing the strength of the evidence supporting a hypothesis.

The type of statistical approach we choose depends heavily on the biological processes that generate our data. The confident application of inferential statistics is grounded in an understanding of both biological theory and the data’s characteristics. A key element in choosing the right approach is recognising that the probability distribution of your data is closely linked to the natural processes that produce the observed outcomes. Biological data can be influenced by many factors, such as genetics, environmental conditions, and random variation, all of which shape the underlying distribution. For example:

Plant Height (Normal Distribution): The heights of individual plants in a population typically follow a normal distribution. This distribution arises from the combined effects of genetic factors and environmental conditions that influence plant growth, such as soil quality, light, and water availability.
Litter Size in Mammals (Poisson Distribution): In many mammal species, the number of offspring per litter may follow a Poisson distribution, which is common for count data. This distribution reflects the biological processes involved in reproduction, where most females have an average litter size, and larger litters are progressively rarer.

1.3 Parametric or Nonparametric: Understanding Your Data’s Distribution

Inferential statistics can be broadly categorised into parametric and nonparametric methods. The choice between them hinges on understanding the distribution of our data and the assumptions underlying each method. Parametric statistics traditionally rely on specific assumptions about the underlying probability distribution of the population from which the sample data are drawn. The two key assumptions are normality, where the data follow a normal (Gaussian) distribution, and homoscedasticity, which requires equal variances across groups or levels of predictors.

However, parametric methods don’t always require normally distributed data. The core requirement is that the data follow a known probability distribution, which must be specified in advance. Many biological datasets don’t follow a normal distribution but can still be analysed using parametric methods. This flexibility is evident in Generalised Linear Models (GLMs), which extend the parametric framework to accommodate a wider range of response variables.

GLMs can handle various distributions, such as Poisson for count data or binomial for binary outcomes. They use a link function to relate the mean of the response to the predictors, adhering to parametric principles while offering flexibility for non-normal data. This makes GLMs well-suited for ecological and biological datasets, where non-normal data are common. Many statistical tests have been extended to other probability distributions. Examples include Generalised Additive Models (GAMs), which are semi-parametric methods, Generalised Non-Linear Models (GNLMs) that fit non-linear models to non-parametric data, and Generalised Linear and Non-Linear Mixed-Effects Models (GLMMs and GNLMMs) that handle hierarchical data.

When data don’t conform to any known distribution, nonparametric statistics offer an alternative. They make fewer assumptions about the data’s distribution and are more robust when dealing with non-standard or unknown distributions. This makes them suitable for biological processes not easily captured by parametric models.

Choosing between parametric and nonparametric methods involves several considerations. First, it’s important to assess whether your data meet the assumptions of parametric tests. Specific tests for this purpose will be discussed in Section X. Sample size is another crucial factor. Parametric methods can often tolerate moderate violations of the normality assumption due to the Central Limit Theorem (CLT), especially with large sample sizes. The CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, even if the underlying population is not perfectly normal.

When assumptions are met, parametric tests are more powerful. However, they are also more sensitive to violations of their assumptions. Therefore, we must consider the nature of our data and the processes that generated them when choosing a statistical approach.

1.4 The Statistical Toolbox

I broadly categorise parametric and nonparametric methods into four main types, each serving different research applications²:

² This categorisation reflects my teaching approach, based on the order in which I think topics need to be covered, rather than a strict classification by statisticians. It is intended to provide a high-level overview of the types of statistical methods used in inferential statistics.

Hypothesis Tests: These parametric and non-parametric techniques assess whether sample data provide evidence for or against a specific claim (hypothesis) about population parameters such as their means, medians, proportions, variances, or correlations between variables. Common hypothesis tests include:
- Comparisons of group means or medians for a continuous variable (e.g., t-tests, ANOVA, Mann-Whitney U test)
- Comparisons of group proportions for a categorical variable (e.g., \chi-square test, Fisher’s exact test)
- Assessments of the relationship between two continuous or ordinal variables (e.g., Pearson’s correlation, Spearman’s rank correlation)
Regression Analysis: Regression with its parametric and non-parametric offerings lets us analyse the relationship between a response variable and one or more predictor variables. Regression models estimate coefficients representing the predictor effects, allow for prediction of the response, and enable hypothesis tests on the predictors. Common regression models include:
- Linear regression for continuous response variables
- Logistic regression for binary response variables
- Generalised linear models (GLMs) for non-normal response variables
- Various non-linear regressions for complex relationships, such as generalised additive models (GAMs)
Survival Analysis: Methods like the Kaplan-Meier estimator and Cox proportional hazards model analyse time-to-event data, where the interest lies in modelling the waiting times until certain events occur. I do not cover survival analysis in this book or any of my modules.
Multivariate Analysis: This includes an assortment of methods to analyse multiple response and predictor variables simultaneously. Dimension reduction methods, such as canonical correlation analysis (CCA) and non-metric multidimensional scaling (nMDS), help simplify complex datasets by identifying key patterns and relationships. Classification, including cluster analysis, is used to group similar observations together based on their characteristics. Multivariate approaches make fewer assumptions about the data’s distribution, and there are techniques to deal with parametric and non-parametric data types (often without discrimination). Although these methods are not covered in this textbook, they are taught in my Quantitative Ecology module, which will eventually be developed into its own textbook.

I will cover the parametric methods first, in Part A, followed by non-parametric methods in Part B. Part C of the book will look at semi-parametric methods, which combine aspects of both parametric and non-parametric statistics

A. Hypotheses About the Means of Groups

The simplest form of comparison is to test whether the sample means of two or more groups differ.³ Although this seems quite unimaginative, comparisons of the measures of central tendency are very common statistical tests in biology. Because this concept is so simple to understand, it serves as a good starting point for learning about hypothesis testing and the interpretation of the statistics which tell us about the strength of the evidence for or against our hypotheses.

³ When it comes to central tendency, the mean is the parameter that is being compared by parametric statistics. Non-parametric statistics, on the other hand, consider the median as the statistic of central tendency.

You might have hypotheses that require you to compare the means of the outcomes of different experimental treatments, differences in the number of sea urchins among populations of kelp, or the number of species within replicate samples taken from different vegetation types. Look at some of the following examples to see if any of them resonate with your own research question, and then use this as a guide to find the appropriate statistical test in this book.

One-Sample t-Test (Section X.X.X)

Example: Is the mean height of a sample of Protea sp. grown in a specific experimental landscape (given below) different from the known (established a priori) average height of the same species (163.3 \pm 15.5 cm) in the general population?

The example requires that you have one normally-distributed continuous outcome variable with independent observations and that you want to compare its mean value against a known population mean established a priori.

In this case, you’ll want to use the R function t.test(). Since this function can accommodate data with equal or unequal variances⁴ via the var.equal argument, you only need to assure the data are normally distributed. The test can be one-sided or two-sided. Alternatively, consider non-parametric alternatives, such as the Wilcoxon signed-rank test.

⁴ A t-test for equal variances is typically called the Student’s t-test, while a t-test for unequal variances is called Welch’s t-test. By default, the t.test() function in R performs Welch’s t-test, which is more robust to unequal variances.

Two-Sample t-Test (Section X.X.X)

Example: Is the average number of leopard cubs born per female leopard in the Overberg region different from that in the Cederberg region? The dataset is:

      Region Cubs_Per_Female
1   Overberg               2
2   Overberg               3
3   Overberg               2
18 Cederberg               3
19 Cederberg               2
20 Cederberg               1

This requires that we obtain two samples of continuous, normally-distributed measurements. In other words, our experiment or sampling campaign will include two groups (sometimes two treatments, other times a treatment and a control) and we collect a sample of measurements of the response in both of them. This is again catered for by the t.test() function, and, as before, we don’t have to fuss too much about the variances as equal and unequal variances can be accommodated. If the normality assumption is not met, consider a non-parametric alternative such as the Mann-Whitney U test.

A variant of the two-sample t-test is the paired t-test, which is used when the two samples are related (not independent); for example, the same individuals are measured before and after applying a treatment.

Analysis of Variance (ANOVA) for >2 Samples (Section X.X.X)

Example: Is the chirp rate of bladder grasshoppers different between the four seasons?

	Season	Chirp Rate
1	Spring	17.7
2	Spring	13.9
3	Spring	15.7
58	Winter	10.2
59	Winter	4.0
60	Winter	10.6

Chirp Rate Data for Bladder Grasshoppers Across Four Seasons

We have three or more samples of continuous, normally-distributed observations. These data must also have more-or-less equal variances, so the homoscedasticity assumption is important. The aov() function in R is used to perform the ANOVA, which can be one-way, two-way, a repeated measures ANOVA, or an ANCOVA.⁵ If the normality or homoscedasticity assumptions are not met, consider non-parametric alternatives, such as the Kruskal-Wallis test, or try transforming the data.

⁵ A repeated measures ANOVA is used when the same subjects are measured at different time points or under different conditions. A two-way ANOVA is used when there are two independent variables (there are also higher-order ANOVAs but they become more of a pain to interpret and require cumbersome experimental designs). An ANCOVA is used when you want to compare the means of groups while controlling for the effect of a continuous covariate. There are many kinds of ANOVA designs and each relates to specific experimental designs well beyond the scope of this book. Tony Underwood provides a pedantic overview of ANOVA designs in his book Experiments in Ecology (Underwood 1997) if you really want to go there.

Analysis of Covariance (ANCOVA)* (Section X.X.X)

Example: We have a set of data about African penguins and we want to determine if there are differences between male and female penguins in terms of their mean foraging time, and if that difference is influenced by their diving depth. The dataset is as follows:

Sex	Foraging time (hr)	Diving depth (m)
Male	1.2	10
Male	1.5	15
Male	1.8	20
Female	2.0	25
Male	2.3	30
Female	2.5	35
Female	2.8	40
Female	3.0	45
Male	3.3	50
Male	3.5	55

Foraging time and diving depth of African penguin.

In this example, we are interested in the mean foraging time of male and female penguins, controlling for their diving depth. An ANCOVA focuses on the differences in means (the categorical variable), and the continuous covariates (diving depth) is specifically controlled for to remove its effect from the dependent variable. This reduces the error variance and so more accurately assesses the comparison of group means. The assumptions of normality and homoscedasticity apply. The functions aov() accommodates the categorical and continuous predictors.

Multivariate Analysis of Variance (MANOVA)

MANOVAs are similar to ANOVAs, except here you have multiple dependent variables, all independent, continuous, and normally-distributed. This is useful when you want to compare the means of multiple groups across multiple dependent variables. For example, you might want to compare the average foraging time together with diving depth of African penguins in three colonies (two in South Africa and one in Namibia) around the coast. The manova() function in R is used to perform a MANOVA and there are similar variants to what we have seen in ANOVA.

B. Hypotheses About the Proportions of Groups

You can compare the proportions of groups using tests for proportions when the outcome variable is binary (e.g., success/failure, presence/absence, up/down, day/night). These tests are used to determine if the proportion of successes differs between groups. Use the following tests to compare group proportions:

One-Sample Test for Proportions

Example: Is the proportion of African penguins foraging in a specific colony different from the known proportion of the same species in the general population? The data might look like this:

Sample data: 55 of the 100 penguins observed were foraging in a specific colony
The known proportion of penguins foraging in the general population is 60%

In this scenario, we are comparing the proportion of a single sample (the proportion of foraging African penguins in a specific colony) to a known population proportion. The data must consist of a binary outcome variable (e.g., foraging vs. not foraging) and the observations must be independent. The prop.test() function in R is used to perform this test, which can be either one-sided or two-sided. If the requirement of independent observations is not met, consider non-parametric alternatives, such as the sign test.

Two-Sample Test for Proportions

Example: Is the proportion of endangered sea turtles successfully reaching the ocean different between two beaches? Here are data:

Beach	Successes	Observed
Beach A	75	100
Beach B	65	120

Number of Sea Turtles Reaching the Ocean on Two Beaches

Here we compare the proportions from two independent samples (e.g., the proportion of sea turtles successfully reaching the ocean on Beach A versus Beach B). As before, the data yield a binary outcome (e.g., reached the ocean vs. did not reach the ocean) for each group, and the observations within each group are independent. The prop.test() function is used it has one-sided or two-sided options. If the sample sizes are small or expected frequencies are low, consider using Fisher’s exact test instead of the proportion test. If the assumption of independent observations within groups is violated, you may need to consider methods that account for dependency in the data, such as Generalised Estimating Equations (GEE) or mixed-effects models.

Chi-square Test for Count Data

Example: Is there an association between vegetation type and the presence of leopards in different areas of Kruger National Park? A hypothetical dataset:

	Presence	Absence
Grassland	20	30
Woodland	25	40
Shrubland	35	15

Contingency Table of Plant Species and Insect Occurrence

Here we examine the relationship between two categorical variables (vegetation type and leopard presence) within Kruger National Park. The data are organised into a contingency table, where each cell represents the count or frequency of observations for a specific combination of categories. The chi-square test of independence is used to determine if there’s a significant association between the variables.

As with other categorical tests, the data yield discrete outcomes (e.g., savanna, woodland, or riverine for vegetation type; present or absent for leopard presence). The observations should be independent, meaning the presence of a leopard in one area should not influence its presence in another.

The chisq.test() function in R is commonly used for this analysis. This test compares the observed frequencies in each cell of the contingency table to the frequencies that would be expected if there were no association between vegetation type and leopard presence.

If the sample size is large and the expected frequencies in each cell are adequate (typically > 5), the chi-square test is appropriate. However, if the sample size is small or if there are cells with low expected frequencies, consider using Fisher’s exact test instead.

If the assumption of independence is violated (e.g., if the data include multiple observations from the same leopard individuals or territories), you may need to consider more advanced methods that account for dependency in the data, such as log-linear models or Generalised Estimating Equations (GEE).

Fisher’s Exact Test

Example: Is there a significant association between the presence of certain plant species and the occurrence of rare fynbos endemic insects in the Cape Floristic Region? Here are the data:

	Present	Absent
Plant A	2	8
Plant B	3	7

Contingency Table of Plant Species and Insect Occurrence

Fisher’s Exact Test is used when we have two categorical variables and want to determine if there’s a significant association between them, particularly when sample sizes are small or when we have sparse data in some categories. This test is especially useful in ecological studies where rare species or events are being investigated.

In this example we examine the relationship between the presence of specific plant species and the occurrence of rare fynbos endemic insects. The data are organised into a 2x2 contingency table, where each cell represents the count of observations for a combination of presence/absence of the plant species and the insect species.

The test calculates the exact probability of observing the given set of cell frequencies under the null hypothesis of no association. It does not rely on approximations and it more accurate than the chi-square test for small samples. Use the fisher.test() function to perform this analysis. Like other categorical tests, the observations should be independent, meaning the presence of an insect in one area should not influence its presence in another.

Fisher’s Exact Test is particularly appropriate when:

The total sample size is less than 1000
The expected frequency in any cell of the contingency table is less than 5
You’re dealing with rare events or species

If the sample size becomes very large, Fisher’s Exact Test can become computationally intensive, and the chi-square test may be more practical.

If the assumption of independence is violated (e.g., if the data include multiple observations from the same locations over time), you may need to consider more advanced methods that account for dependency in the data, such as mixed-effects models or Generalised Estimating Equations (GEE).

C. Hypotheses About the Strength of Association

Example: Is there a relationship between the foraging time and diving depth of African penguins?

Foraging time (hr)	Diving depth (m)
1.2	10
1.5	15
1.8	20
2.0	25
2.3	30
2.5	35
2.8	40
3.0	45
3.3	50
3.5	55

Foraging time and diving depth of African penguin.

You’ll want to use a Pearson’s correlation to determine if there is a linear relationship between two continuous variables, both of them normally distributed and homoscedastic. A correlation analysis does not presume causation and does not provide a predictive model, both of which are the domain of regression. The strength of the relationship is quantified by the correlation coefficient called Pearson’s rho, which ranges from -1 to 1. Use the cor.test(..., method = "pearson") function in R to perform this analysis.

Non-parametric alternatives such as the Spearman’s rank correlation or Kendall’s tau correlation (see ‘II. Non-Parametric Methods’) are available and implemented with the same R function.

D. Modelling and Predicting Causal Relationships

The relationship between one or a few predictors and an outcome can be represented by a function, which is a model that reconstructs part of the ‘reality’ of the observed phenomenon. Regression analysis helps you understand how changes in the continuous predictor variable(s) drive changes in a continuous outcome variable. The model quantifies the strength of the associations and makes predictions for new data points. You may use regression models for hypothesis testing and for identifying which predictor variables have the most substantial impact on the outcome.

Simple Linear Regression

Example: The same dataset of foraging time and diving depth of African penguins can be used to model the relationship between these two variables. Does diving depth depend on foraging time?

What is different now is that we are interested in predicting the diving depth (response) of penguins based on their foraging time (predictor). Assuming there is a linear response, we can use a simple linear regression model to quantify the relationship between these two continuous variables. The model provides an equation that describes how the diving depth changes as the foraging time increases. The assumptions of normality and homoscedasticity apply to the residuals, and are accessed after having fit the model.

This calls for a simple linear regression model and you can fit it using the lm() function in R. The model can also be specified as a generalised linear model (GLM) with glm(..., family = gaussian).

If assumptions fail, apply data transformations (e.g., log, square root), robust regression (rlm() in MASS package), or consider non-linear models.

Polynomial Regression

I’ll not provide an example here. It suffices to say that a polynomial regression is effectively a simple linear regression that allows for a curvilinear relationship between the predictor and the outcome. To accomplish this, the model includes polynomial terms (e.g., quadratic, cubic, which are simply powers of the predictor) to capture the non-linear patterns in the data. The model can be fit using the lm() function in R.

Assess the relationship between x vs. y by making a scatterplot of the data and eye balling a best fit curve through the scatter of points. Is the line curvy or bendy? Do you know in advance if a more complicated model describes the response? If the answer is ‘yes’ to the first and ‘no’ to the second question, then a polynomial regression might be just the thing for you.

Multiple Linear Regression (MLR)

Example: I’ve added a second predictor to the dataset of foraging time and diving depth of African penguins. Does diving depth depend on the penguins’ body mass index (BMI) and foraging time?

BMI	Foraging time (hr)	Diving depth (m)
1.2	1.2	10
1.5	1.5	15
1.8	1.8	20
2.0	2.0	25
2.3	2.3	30
2.5	2.5	35
2.8	2.8	40
3.0	3.0	45
3.3	3.3	50
3.5	3.5	55

Foraging time and diving depth of African penguin.

The only difference between this example and the simple linear regression is that we now have two predictors (foraging time and BMI) instead of one. The predictors can be continuous (as in the example) and/or categorical. If you are more concerned with the means of the categorical variables, consider an ANCOVA as an alternative option. The multiple linear regression model can be extended to include interaction terms between predictors. You can quantify the relationship between both predictors and the outcome simultaneously, and ask which of the two best predicts the response. The same assumptions apply as in the simple linear regression and we hope for a linear relationship between x_1 and x_2 vs. y. Other considerations are provided in the chapter on MLR.

The R functions lm() and glm(..., family = gaussian) accommodate situations such as these where we have multiple predictors.

Generalised Linear Models (GLM)

GLMs are a class of regression models that extend the simple linear regression framework to accommodate various types of response distributions. As such, they can accommodate data that violate the assumptions of normality and homoscedasticity, as well as situations where the response variable is not continuous.

Use GLMs to model count data (e.g., number of occurrences), binary outcomes (e.g., success/failure), and other non-continuous response variables that cannot be adequately represented by a normal distribution. Unlike linear models, which assume a normal error distribution, GLMs specify the distribution of the response variable using a probability distribution from the exponential family, such as the Gaussian (normal), binomial, Poisson, or negative binomial distributions.

GLMs incorporate a link function that relates the linear predictor (a linear combination of the predictor variables) to the expected value of the response variable. This link function can take various forms, including the identity (linear), logit (for binary data), probit, or other transformations, depending on the nature of the response variable and the desired relationship between the predictors and the outcome.

The glm() function is a staple for fitting GLMs. It is designed to handle the exponential family distributions and will allow you to specify the appropriate distribution and link function for your data and research question. A few common types of GLMs are presented next.

Logistic Regression (Chapter 6)

You’ll encounter binomial data in experiments or processes with binary outcomes, such as presence/absence, success/failure, or alive/dead. To model this type of data, you will want to use logistic regression. Logistic regression estimates the log-odds of the outcome as a linear combination of the predictor variables. The logistic function is then used to convert these log-odds into probabilities, which range from 0 to 1, so it is suitable for predicting the likelihood of the binary outcomes.

Use When: You have a binary outcome variable and want to model the relationship between predictors and the probability of the outcome.
Data Requirements: Binary outcome, continuous or categorical predictors.
Assumptions: Linear relationship between the log-odds of the outcome and predictors.
Diagnostics: Check for influential observations, multicollinearity, and overall model fit.
If Assumptions Fail: Consider interactions, alternative link functions (probit, complementary log-log) in glm(), or non-linear logistic regression, zero-inflated models when excess zeroes.
R Function: glm(..., family = binomial)
Model Selection: Stepwise regression, regularisation techniques, information criteria (AIC, BIC).

Poisson Regression (Chapter 6)

Typical examples of count data include the number of offspring, parasites, or seeds. Poisson regression is used to model the relationship between predictors and the count outcome. The model assumes that the count data follow a Poisson distribution, where the mean and variance are equal. Poisson regression is suitable for data with a single count outcome.

Use When: You have count data and want to model the relationship between predictors and the count outcome.
Data Requirements: Count outcome, continuous or categorical predictors.
Assumptions: Equidispersion (variance equals the mean).
Diagnostics: Check for overdispersion, excess zeros, and overall model fit.
If Assumptions Fail: Negative binomial regression (glm.nb() in the MASS package, overdispersion), zero-inflated models (zeroinfl() in the pscl package, excess zeros).
R Function: glm(..., family = poisson)

Negative Binomial Regression

Negative binomial regression is an extension of Poisson regression that accommodates overdispersion, where the variance exceeds the mean. It is used when the count data exhibit more variability than expected under a Poisson distribution. The model assumes that the count data follow a negative binomial distribution, which has an additional parameter to account for overdispersion. Biological and ecological processes such as species abundance, parasite counts, and gene expression often exhibit overdispersion.

Use When: You have count data with overdispersion and want to model the relationship between predictors and the count outcome.
Data Requirements: Count outcome, continuous or categorical
Assumptions: Overdispersion (variance exceeds the mean).
Diagnostics: Check for overdispersion, excess zeros, and overall model fit.
R Function: glm.nb() in MASS package

Gamma Regression

Gamma regression is for modelling continuous, positive outcomes that exhibit a right-skewed distribution and possibly also a non-constant variance (heteroscedasticity). The gamma distribution is well suited for continuous measurements where the variability increases as the mean increases. You might encounter this kind of distribution in growth rates, enzyme activity levels, species abundance data, and other phenomena or processes characterised by positive, skewed data.

Use When: You have a continuous, positive outcome and want to model the relationship between predictors and the outcome.
Data Requirements: Continuous, positive outcome, continuous or categorical predictors.
Assumptions: Outcome values are positive, potentially non-constant variance.
Diagnostics: Check for overall model fit, influential observations, and residual
R Function: glm(..., family = Gamma)

Beta Regression

Beta regression is a statistical technique appropriate when the response variable is a continuous proportion or rate bounded between 0 and 1. These types of data might, for example, arise in ecology where one might study the proportions of time animals spend exhibiting different behaviours, the relative abundances of species in a community, or the proportions of habitat patches comprising a landscape. Proportional data inherently exhibit heteroscedasticity (non-constant variance).

Use When: You have a proportional outcome (0 < y < 1) and want to model the relationship between predictors and the outcome.
Data Requirements: Proportional outcome (0 < y < 1), continuous or categorical predictors.
Assumptions: Outcome values within (0,1), potentially non-constant variance.
Diagnostics: Check for overall model fit, influential observations, and residual analysis.
If Assumptions Fail: Transformations, consider alternative link functions, or zero/one-inflated beta regression.
R Function: betareg() in the betareg package

Modelling Non-Linear Relationships

We use non-linear models when the relationship between predictor variables and the outcome variable is not linear. This non-linearity arises from the predictor variables themselves being non-linearly related to the outcome or from the model’s parameters (coefficients) appearing non-linearly in the functional form. The visualised response curve is typically curved, rather than a straight line. These models are often derived from theoretical understanding or prior knowledge about the underlying mechanisms governing the relationship between the predictors and the outcome variables.

Non-Linear Least Squares (NLS) Regression (Chapter 7)

Use When: The relationship between the predictors and the outcome is non-linear.
Data Requirements: Continuous outcome, continuous predictors.
Assumptions: Appropriate functional form, normality, and homoscedasticity of residuals.
Diagnostics: Check residual plots, normality of residuals, and leverage/influence points.
R Function: nls() (for non-linear regression models with user-specified functions)

Generalised Non-Linear Models (GNLMs)

GNLMs are an extension of generalised linear models (GLMs) that allow for non-linear relationships between the predictors and the outcome variable. GNLMs are used when the relationship between the predictors and the outcome is non-linear, and the outcome variable follows a non-normal distribution. GNLMs are particularly useful for count data, binary outcomes, and other non-continuous response variables that exhibit non-linear relationships with the predictors.

Linear and Non-Linear Hierarchical Models (Mixed-Effects Models)

Hierarchical models are used when data are structured hierarchically, such as when multiple observations are nested within higher-level units (e.g., plants within fields, sheep within rangelands). These models account for the correlation between observations within the same group and allow for the estimation of both fixed effects (population-level parameters) and random effects (group-level parameters). Hierarchical models are also known as multilevel models or mixed-effects models.

Linear Mixed-Effects Models (LMMs) (Section X.X.X)

Use When: You have nested or hierarchical data structures and the relationship between the predictors and the outcome is linear.
Data Requirements: Continuous outcome, continuous predictors, potentially with nested or hierarchical data structures.
Assumptions: Normality, homoscedasticity of residuals, correct specification of random effects structure.
If Assumptions Fail: Consider transformations, robust regression, or non-linear mixed-effects models.
Diagnostics: Check residual plots, normality of residuals, and leverage/influence points, assess random effects structure.
R Function: lmer() in the lme4 package (for linear mixed-effects models with user-specified functions)

Non-Linear Mixed-Effects Models (NLMMs) (Chapter 7)

Use When: You have nested or hierarchical data structures and the relationship between the predictors and the outcome is non-linear.
Data Requirements: Continuous outcome, continuous predictors, potentially with nested or hierarchical data structures.
Assumptions: Appropriate functional form, normality, and homoscedasticity of residuals, correct specification of random effects structure.
If Assumptions Fail: Generalised non-linear mixed models (GNLMMs) and generalised additive mixed models (GAMMs) can be used when the assumptions of non-linear mixed models (NLMMs) are violated. Else, consult a statistician.
Diagnostics: Check residual plots, normality of residuals, and leverage/influence points, assess random effects structure.
R Function: nlme() in the nlme package (for non-linear mixed-effects models with user-specified functions)

Generalised Linear and Non-Linear Mixed-Effects Models (GLMMs and GNLMMs)

GLMMs and GNLMMs combine the flexibility of regression model generalisation (i.e. by accommodating non-Gaussian distribution families) with the ability to account for nested or hierarchical data structures. GLMMs are used when the outcome variable is not normally distributed (a different, known distribution) and the data are structured hierarchically. GLMMs include both fixed effects (population-level parameters) and random effects (group-level parameters) and can accommodate a wide range of outcome distributions, including binary, count, and continuous outcomes.

Use When: You have non-normally distributed outcome data and nested or hierarchical data structures.
Data Requirements: Binary outcome, continuous or categorical predictors, potentially with nested or hierarchical data structures.
Assumptions: Linear relationship between the log-odds of the outcome and predictors, correct specification of random effects structure.
Diagnostics: Check residual plots, normality of residuals, and leverage/influence points, assess random effects structure.
R Function: glmer() in the lme4 package

Other Regression Models

Zero-Inflated Models

Use When: You have count data with an excess of zeros and want to model the zero-inflation separately from the count process.
Data Requirements: Count outcome, continuous or categorical
Assumptions: Correct specification of zero-inflation and count processes, no omitted variables.
Diagnostics: Check zero-inflation and count process, overall model fit.
R Function: zeroinfl() in the pscl package

Survival Analysis

Data Requirements: Time-to-event outcome, continuous or categorical predictors.
Assumptions: Proportional hazards, non-informative censoring.
Diagnostics: Check proportional hazards assumption, influential observations, and overall model fit.
R Function: survival::coxph()

Time Series Analysis

Data Requirements: Time-ordered data, potentially with autocorrelation.
Assumptions: Stationarity, no autocorrelation in residuals.
Diagnostics: Check autocorrelation, stationarity, and overall model fit.
R Function: arima(), auto.arima() in the forecast package

Structural Equation Modelling (SEM)

Data Requirements: Continuous outcome, continuous
Assumptions: Correct specification of the structural model, no omitted variables, no measurement error.
Diagnostics: Check model fit, parameter estimates, and overall model validity.
R Function: sem() in the lavaan package

Bayesian Regression

Data Requirements: Continuous outcome, continuous or categorical predictors.
Assumptions: Correct specification of priors, likelihood, and model structure.
Diagnostics: Check for convergence, posterior predictive checks, and overall model fit.
R Function: brms::brm()

1.5 II. Non-Parametric Methods (Distribution-Free)

Non-parametric statistics are statistical methods that do not rely on assumptions about the specific form or parameters of the population distribution. They are also referred to as distribution-free methods. These methods often use ranks or other order statistics of the data rather than the actual data values themselves.

A. Hypotheses About Groups

One-Sample Tests for Medians

Use a one-sample test to compare the median of a single sample to a known population median. It is as an alternative to one-sample t-tests when the data do not meet the assumptions of parametric tests.

Wilcoxon signed-rank test
Sign test

Two-Sample Tests for Medians (Section X.X.X)

Use two-sample tests to compare the medians of two independent or related samples. Use it when the assumptions of parametric two-sample tests are violated.

Mann-Whitney U test (two independent groups)
Wilcoxon rank-sum test (two independent groups)
Kruskal-Wallis test (multiple groups)
Friedman test (related samples)

B. Hypotheses About Proportions

Chi-Square Test for Independence: Comparing proportions of two groups

C. Correlation Analysis for Tests of Association

Use non-parametric correlation to assess the strength and direction of a relationship between two continuous (or ordinal) variables when the assumptions of parametric correlation tests cannot be met.

Spearman’s Rank Correlation (Chapter 2)

A non-parametric measure of the strength and direction of association between two variables.

Kendall’s Tau Correlation (Chapter 2)

A non-parametric measure of the strength and direction of association between two variables.

D. Regression Analysis

Quantile Regression (Section X.X.X)

Models different quantiles of the response distribution.

Robust Regression (Section X.X.X)

Less sensitive to outliers than ordinary least squares regression.

Kernel Density Estimation

KDE is a non-parametric method for visualising the distribution of a continuous variable. Unlike histograms, which bin data into discrete intervals, KDE creates a smooth curve that represents the estimated probability density function (PDF) of the underlying data. It does this by placing a kernel function (often a symmetric curve like a Gaussian or Epanechnikov) at each data point and summing up the contributions of these kernels across the entire range of the variable. The bandwidth of the kernel controls the smoothness of the resulting density estimate. Wider bandwidths lead to smoother curves but may obscure finer details, while narrower bandwidths reveal more local fluctuations but can be noisy. KDE is useful when the underlying distribution of the data is unknown or non-standard and it offers a convenient way to visualise and understand the shape and spread of the data without being constrained by parametric assumptions.

Local Regression (LOESS)

LOESS (Locally Estimated Scatterplot Smoothing) is a non-parametric regression technique that produces a smooth curve through a set of data points by fitting simple models to localised subsets of the data. It achieves this by weighting the data points in each subset, with higher weights assigned to points closer to the point being estimated. The model used for local fitting is typically a low-degree polynomial, although other choices are possible.

LOESS is primarily used for data exploration and visualisation. It is best known for smoothing scatterplots and revealing underlying trends or patterns in the data. It is advantageous because it doesn’t assume any particular functional form for the relationship between the predictors and the response variable, so it to adapts to various data shapes. But LOESS does not provide a single, easily interpretable equation for the entire dataset, making it less suitable for making predictions or drawing global inferences. It can also be computationally demanding with large datasets as it fits separate models in the vicinity of locally-selected points.

Penalised Regression

Penalised regression (also known as regularisation) is used to enhance the performance of regression models. This might be desirable when dealing with high-dimensional data or when the predictor variables are highly collinear. It introduces a penalty to the regression objective function which discourages the model from having overly complex or large coefficients. This effectively prevents overfitting. Common types of penalised regression include Ridge regression (L2 regularisation), which adds the sum of the squared coefficients as a penalty term, and Lasso regression (L1 regularisation), which adds the sum of the absolute values of the coefficients. The penalty terms encourage simpler models by shrinking some coefficients towards zero, with Lasso potentially setting some coefficients exactly to zero, thus performing variable selection. The balance between fitting the data well and maintaining model simplicity helps in improving the model’s generalisation to new data. Penalised regression methods can achieve a trade-off between bias and variance and result in more robust and interpretable models.

1.6 III. Semi-Parametric Methods

Semi-parametric methods combine parametric and non-parametric techniques to provide a balance between flexibility and efficiency. These methods are useful when the assumptions of parametric tests are violated, but the data do not meet the requirements for non-parametric tests. Semi-parametric methods are often more powerful than non-parametric tests, as they make fewer assumptions about the data distribution. These methods are particularly useful when the sample size is small or when the data are skewed or have outliers.

Generalised Additive Models (GAMs) (Chapter 11)

Use When: You have non-linear relationships between predictors and outcome.
R Function: gam() in the mgcv package; also gamm4() in the gamm4 package
Data Requirements: Continuous, binary, or categorical outcome, continuous or categorical predictors, potentially with nested or hierarchical data structures.
Advantages: Flexible modelling of non-linear relationships using smoothing functions, can handle mixed-effects structures.
Limitations: Interpretation can be challenging, potential overfitting.

Generalised Estimating Equations (GEEs)

Use When: You have correlated data and non-normally distributed outcomes.
R Function: geeglm() in the geepack package; also functions in the gee package
Data Requirements: Correlated data, non-normal outcomes, continuous or categorical predictors.
Advantages: Robust to misspecification of the correlation structure, can handle non-normal outcomes, flexible in handling missing data.
Limitations: Assumes correct specification of the correlation structure, may be less efficient than mixed-effects models.

Semi-Parametric Survival Models

Use When: You have time-to-event data and want to model the hazard function.
R Function: coxph() in the survival package
Data Requirements: Time-to-event data, censoring, continuous or categorical predictors.
Assumptions: Proportional hazards assumption, independence of censoring.
Diagnostics: Check proportional hazards assumption, influential observations, goodness

Spline Regression

Use When: You have non-linear relationships between predictors and outcome.
R Function: lm() with splines, gam() in the mgcv package
Data Requirements: Continuous outcome, continuous predictors.
Assumptions: Linearity within each spline, potentially non-constant variance.
Diagnostics: Check for overall model fit, influential observations, and residual analysis.
If Assumptions Fail: Transformations, consider alternative link functions, or penalised regression.

1.7 IV. Machine Learning Methods

Machine learning methods are a set of algorithms that can learn patterns from data without being explicitly programmed. These methods are particularly useful for prediction, classification, and clustering tasks. Machine learning models can handle complex relationships in the data and are often more flexible than traditional statistical models. However, they can be more computationally intensive and may require more data to train effectively.

Random Forests

A machine learning method that uses an ensemble of decision trees to predict an outcome.

Support Vector Machines

A machine learning method that finds the optimal hyperplane to separate two classes of data.

Ensemble Methods

A machine learning technique that combines the predictions of multiple models to improve accuracy.

Neural Networks

A machine learning method that uses interconnected nodes to model complex relationships in data.

Deep Learning

A subset of machine learning that uses neural networks with multiple layers to model complex relationships in data.

1.8 V. Miscellaneous Methods

Bootstrapping

A resampling method for estimating the sampling distribution of a statistic.

Permutation Tests

A non-parametric method for testing hypotheses by randomly permuting the data.

Monte Carlo Simulation

A method for estimating the distribution of a statistic by generating random samples from a known distribution.

Bayesian Methods

A statistical approach that uses Bayes’ theorem to update prior beliefs based on observed data.

Dimensionality Reduction

Also called muitvariate analyses. A set of techniques for reducing the number of variables in a dataset while preserving important information.

Clustering

A set of unsupervised learning techniques for grouping similar data points together.

Feature Selection

A process for identifying the most important variables in a dataset for predicting an outcome.

Regularisation

See penalised regression. A technique for preventing overfitting by adding a penalty term to the model coefficients.

Cross-Validation

A method for estimating the performance of a model by splitting the data into training and test sets.

Hyperparameter Tuning

The process of selecting the optimal values for the parameters of a machine learning model.

Model Evaluation

The process of assessing the performance of a model using metrics such as accuracy, precision, recall, and F1 score.

Model Interpretation

The process of understanding how a model makes predictions by examining the relationship between the input variables and the output.

Model Deployment

The process of putting a trained model into production so that it can be used to make predictions on new data.

Model Monitoring

The process of tracking the performance of a deployed model over time to ensure that it continues to make accurate predictions.

Model Explainability

The process of explaining how a model makes predictions in a way that is understandable to humans.

Model Fairness

The process of ensuring that a model does not discriminate against certain groups of people based on sensitive attributes.

Model Robustness

The process of ensuring that a model performs well on new data that is different from the training data.

# Introduction ```{r} #| echo: false #| message: false #| warning: false library(tidyverse) library(kableExtra) ``` ```{r} #| echo: false options(knitr.table.format = function() { if (knitr::is_latex_output()) "latex" else "pipe" }) ``` ## The Scientific Method in Practice Answering questions about the natural world using the scientific method requires that we draw on many years' of accumulated knowledge and experience. This workflow unpacks into roughly the following sequence of steps: 1. Look around you at the world, be curious about it, and ask questions to figure out an explanation for the pattern or phenomenon that tickled your interest. 2. **Create an unambiguous statement of the question you want to answer**, think about what is causing the pattern or phenomenon you observed, and how you might go about measuring the response (the thing you observed initially). 3. **Translate this question into a testable hypothesis**. This is the statement that you can test using the data you will collect. 4. Design an experiment or sampling campaign to collect data that will allow you to test this hypothesis. Clearly **understand what the data you'll collect will look like**, both for the response and the explanatory variables. For example, do you have a categorical or continuous predictor, is the response continuous, binary, ordinal, etc.? For this, you should have a firm grasp of the various kinds of [Data Classes and Structures in R](https://tangledbank.netlify.app/BCB744/basic_stats/01-data-in-R.html). 5. Think deeply about any confounding influences that might affect your data, and specify exactly what additional data you will have to collect to isolate the hypothesised influence in your analysis. You need to fully understand all the ways that factors not considered in your hypothesis might affect your study's outcome. Omissions cannot be rectified after the fact without repeating the entire experiment or sampling work. It requires knowledge and experience to avoid confounding influences ruining your work. 6. Depending on your experiment's design (4) and the nature of the data you'll obtain (4, 5), **choose the appropriate statistical methods** to analyse them. You should be able to develop a good idea of what statistical methods you'll use---even before the experiment has been done! Decide on the parametric test, or, should the statistical god with the dice not provide an outcome that favours your expectations, you can also decide upfront on a non-parametric equivalent. It is important not to decide on the statistical method after you've collected the data. This is called *p*-hacking, and it is almost a cardinal sin in science. 7. Do the experiment or go out into the world to sample, and collect the data. Have fun---this is why we do science, afterall! 8. Go have a few drinks after a hard day's work and celebrate your success. 9. **Analyse your newly-collected data**. This will include explaratory data analyses (see [Exploring With Summaries and Descriptions](https://tangledbank.netlify.app/BCB744/basic_stats/02-summarise-and-describe.html) and [Exploring With Figures](https://tangledbank.netlify.app/BCB744/basic_stats/03-visualise.html)), and then the application of the statistical methods you chose in step 6. 10. **Communicate your results** in tables and figures. This textbook deals with many of these steps (except for 1, 5, 7, and 8). This knowledge is codified in the form of the statistical method, which provides a systematic framework for collecting,[^0] analysing, and interpreting data. In this chapter, I will introduce the fundamental concepts of inferential statistics, which allow us to make inferences about populations based on sample data. I will also provide an overview of the types of statistical methods used in inferential statistics, and discuss the importance of understanding the assumptions underlying these methods. [^0]: Yes, statistics also informs us about how to collect data. ## Inferential Statistics This book covers the suite of **inferential statistics** available to biologists. These methods are the cornerstone of hypothesis-driven scientific research. Inferential statistics provide the tools needed to generalise from a sample to a population or to make predictions about future observations. In doing so, we can draw general conclusions or test hypotheses about populations or processes. Inferential statistics build upon basic exploratory data analysis (EDA), which often includes a substantial use of **descriptive (or summary) statistics**. Descriptive statistics describe and summarise the characteristics of a dataset, such as its central tendency, variability, and distribution. While descriptive statistics offer a snapshot of the data, they do not allow us to draw conclusions about the population from which the data were sampled. Descriptive and inferential statistics work hand in hand, with the former laying the groundwork for more advanced analyses. Inferential statistics allow us not only to draw conclusions from our data but also to quantify the uncertainty associated with these inferences. This uncertainty arises because we are analysing only a sample and wish to generalise our insights to the entire population from which the sample was drawn. Inferential statistics offer a systematic framework for making these inferences and assessing the strength of the evidence supporting a hypothesis. The type of statistical approach we choose depends heavily on the biological processes that generate our data. The confident application of inferential statistics is grounded in an understanding of both biological theory and the data’s characteristics. A key element in choosing the right approach is recognising that the probability distribution of your data is closely linked to the natural processes that produce the observed outcomes. Biological data can be influenced by many factors, such as genetics, environmental conditions, and random variation, all of which shape the underlying distribution. For example: 1. **Plant Height (Normal Distribution):** The heights of individual plants in a population typically follow a **normal distribution**. This distribution arises from the combined effects of genetic factors and environmental conditions that influence plant growth, such as soil quality, light, and water availability. 2. **Litter Size in Mammals (Poisson Distribution):** In many mammal species, the number of offspring per litter may follow a **Poisson distribution**, which is common for count data. This distribution reflects the biological processes involved in reproduction, where most females have an average litter size, and larger litters are progressively rarer. ## Parametric or Nonparametric: Understanding Your Data's Distribution Inferential statistics can be broadly categorised into **parametric** and **nonparametric** methods. The choice between them hinges on understanding the distribution of our data and the assumptions underlying each method. Parametric statistics traditionally rely on specific assumptions about the underlying probability distribution of the population from which the sample data are drawn. The two key assumptions are normality, where the data follow a normal (Gaussian) distribution, and homoscedasticity, which requires equal variances across groups or levels of predictors. However, parametric methods *don't always require normally distributed data*. The core requirement is that the data follow a **known probability distribution**, which must be specified in advance. Many biological datasets don't follow a normal distribution but can still be analysed using parametric methods. This flexibility is evident in **Generalised Linear Models (GLMs)**, which extend the parametric framework to accommodate a wider range of response variables. GLMs can handle various distributions, such as Poisson for count data or binomial for binary outcomes. They use a link function to relate the mean of the response to the predictors, adhering to parametric principles while offering flexibility for non-normal data. This makes GLMs well-suited for ecological and biological datasets, where non-normal data are common. Many statistical tests have been extended to other probability distributions. Examples include Generalised Additive Models (GAMs), which are semi-parametric methods, Generalised Non-Linear Models (GNLMs) that fit non-linear models to non-parametric data, and Generalised Linear and Non-Linear Mixed-Effects Models (GLMMs and GNLMMs) that handle hierarchical data. When data don't conform to any known distribution, nonparametric statistics offer an alternative. They make fewer assumptions about the data's distribution and are more robust when dealing with non-standard or unknown distributions. This makes them suitable for biological processes not easily captured by parametric models. Choosing between parametric and nonparametric methods involves several considerations. First, it's important to assess whether your data meet the assumptions of parametric tests. Specific tests for this purpose will be discussed in Section X. Sample size is another crucial factor. Parametric methods can often tolerate moderate violations of the normality assumption due to the Central Limit Theorem (CLT), especially with large sample sizes. The CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, even if the underlying population is not perfectly normal. When assumptions are met, parametric tests are more powerful. However, they are also more sensitive to violations of their assumptions. Therefore, we must consider the nature of our data and the processes that generated them when choosing a statistical approach.           ## The Statistical Toolbox I broadly categorise parametric and nonparametric methods into four main types, each serving different research applications[^1]: [^1]: This categorisation reflects my teaching approach, based on the order in which I think topics need to be covered, rather than a strict classification by statisticians. It is intended to provide a high-level overview of the types of statistical methods used in inferential statistics. 1. **Hypothesis Tests**: These parametric and non-parametric techniques assess whether sample data provide evidence for or against a specific claim (hypothesis) about population parameters such as their means, medians, proportions, variances, or correlations between variables. Common hypothesis tests include: - Comparisons of group means or medians for a continuous variable (e.g., *t*-tests, ANOVA, Mann-Whitney *U* test) - Comparisons of group proportions for a categorical variable (e.g., $\chi$-square test, Fisher's exact test) - Assessments of the relationship between two continuous or ordinal variables (e.g., Pearson's correlation, Spearman's rank correlation) 2. **Regression Analysis**: Regression with its parametric and non-parametric offerings lets us analyse the relationship between a response variable and one or more predictor variables. Regression models estimate coefficients representing the predictor effects, allow for prediction of the response, and enable hypothesis tests on the predictors. Common regression models include: - Linear regression for continuous response variables - Logistic regression for binary response variables - Generalised linear models (GLMs) for non-normal response variables - Various non-linear regressions for complex relationships, such as generalised additive models (GAMs) 3. **Survival Analysis**: Methods like the Kaplan-Meier estimator and Cox proportional hazards model analyse time-to-event data, where the interest lies in modelling the waiting times until certain events occur. I do not cover survival analysis in this book or any of my modules. 4. **Multivariate Analysis:** This includes an assortment of methods to analyse multiple response and predictor variables simultaneously. Dimension reduction methods, such as canonical correlation analysis (CCA) and non-metric multidimensional scaling (nMDS), help simplify complex datasets by identifying key patterns and relationships. Classification, including cluster analysis, is used to group similar observations together based on their characteristics. Multivariate approaches make fewer assumptions about the data’s distribution, and there are techniques to deal with parametric and non-parametric data types (often without discrimination). Although these methods are not covered in this textbook, they are taught in my [Quantitative Ecology](https://tangledbank.netlify.app/BCB743/BCB743_index.html) module, which will eventually be developed into its own textbook. I will cover the parametric methods first, in Part A, followed by non-parametric methods in Part B. Part C of the book will look at semi-parametric methods, which combine aspects of both parametric and non-parametric statistics ### A. Hypotheses About the Means of Groups The simplest form of comparison is to test whether the sample **means** of two or more groups differ.[^4] Although this seems quite unimaginative, comparisons of the measures of central tendency are very common statistical tests in biology. Because this concept is so simple to understand, it serves as a good starting point for learning about hypothesis testing and the interpretation of the statistics which tell us about the strength of the evidence for or against our hypotheses. [^4]: When it comes to central tendency, the mean is the parameter that is being compared by parametric statistics. Non-parametric statistics, on the other hand, consider the median as the statistic of central tendency. You might have hypotheses that require you to compare the means of the outcomes of different experimental treatments, differences in the number of sea urchins among populations of kelp, or the number of species within replicate samples taken from different vegetation types. Look at some of the following examples to see if any of them resonate with your own research question, and then use this as a guide to find the appropriate statistical test in this book. #### One-Sample *t*-Test (Section X.X.X) **Example:** Is the mean height of a sample of *Protea* sp. grown in a specific experimental landscape (given below) different from the known (established *a priori*) average height of the same species (163.3 $\pm$ 15.5 cm) in the general population? ```{r} #| echo: false # Sample data: Heights (cm) of Protea sp. grown in a specific experimental landscape heights <- c(150, 152, 148, 149, 151, 147, 153, 150, 149, 148) # Create the dataframe proteas_data <- data.frame(Height = heights) # Display the dataframe ht(proteas_data) ``` The example requires that you have *one normally-distributed continuous outcome variable* with *independent observations* and that you want to compare its mean value against a known population mean established *a priori*. In this case, you'll want to use the R function `t.test()`. Since this function can accommodate data with equal or unequal variances[^5] via the `var.equal` argument, you only need to assure the data are normally distributed. The test can be one-sided or two-sided. Alternatively, consider non-parametric alternatives, such as the Wilcoxon signed-rank test. [^5]: A *t*-test for equal variances is typically called the Student's *t*-test, while a *t*-test for unequal variances is called Welch's *t*-test. By default, the `t.test()` function in R performs Welch's *t*-test, which is more robust to unequal variances. #### Two-Sample *t*-Test (Section X.X.X)  **Example:** Is the average number of leopard cubs born per female leopard in the Overberg region different from that in the Cederberg region? The dataset is: ```{r} #| echo: false # Create a dataframe with fictitious data for the example Region <- c(rep("Overberg", 10), rep("Cederberg", 10)) Cubs_Per_Female <- c(2, 3, 2, 4, 3, 2, 3, 3, 4, 2, # Overberg 1, 3, 2, 2, 3, 1, 2, 3, 2, 1) # Cederberg # Create the dataframe leopard_cubs_data <- data.frame(Region, Cubs_Per_Female) # Display the dataframe ht(leopard_cubs_data) ``` This requires that we obtain *two samples of continuous, normally-distributed measurements*. In other words, our experiment or sampling campaign will include two groups (sometimes two treatments, other times a treatment and a control) and we collect a sample of measurements of the response in both of them. This is again catered for by the `t.test()` function, and, as before, we don't have to fuss too much about the variances as equal and unequal variances can be accommodated. If the normality assumption is not met, consider a non-parametric alternative such as the Mann-Whitney U test. A variant of the two-sample *t*-test is the paired *t*-test, which is used when the two samples are related (not independent); for example, the same individuals are measured before and after applying a treatment. #### Analysis of Variance (ANOVA) for >2 Samples (Section X.X.X)  **Example:** Is the chirp rate of bladder grasshoppers different between the four seasons? ```{r} #| echo: false # Set seed for reproducibility set.seed(42) # Generate fictitious chirp rate data for bladder grasshoppers across four seasons Season <- c(rep("Spring", 15), rep("Summer", 15), rep("Autumn", 15), rep("Winter", 15)) Chirp_Rate <- c(round(rnorm(15, mean = 15, sd = 2), 1), # Spring chirp rates round(rnorm(15, mean = 18, sd = 2), 1), # Summer chirp rates round(rnorm(15, mean = 13, sd = 2), 1), # Autumn chirp rates round(rnorm(15, mean = 10, sd = 2), 1)) # Winter chirp rates # Create the dataframe grasshopper_chirp_data <- data.frame(Season, Chirp_Rate) |> rename("Chirp Rate" = Chirp_Rate) |> ht() # Display the dataframe using kable kable(grasshopper_chirp_data, format = "markdown", caption = "Chirp Rate Data for Bladder Grasshoppers Across Four Seasons") # unique(grasshopper_chirp_data$Season) ``` We have *three or more samples of continuous, normally-distributed observations*. These data must also have more-or-less equal variances, so the homoscedasticity assumption is important. The `aov()` function in R is used to perform the ANOVA, which can be one-way, two-way, a repeated measures ANOVA, or an ANCOVA.[^6] If the normality or homoscedasticity assumptions are not met, consider non-parametric alternatives, such as the Kruskal-Wallis test, or try transforming the data. [^6]: A repeated measures ANOVA is used when the same subjects are measured at different time points or under different conditions. A two-way ANOVA is used when there are two independent variables (there are also higher-order ANOVAs but they become more of a pain to interpret and require cumbersome experimental designs). An ANCOVA is used when you want to compare the means of groups while controlling for the effect of a continuous covariate. There are many kinds of ANOVA designs and each relates to specific experimental designs well beyond the scope of this book. Tony Underwood provides a pedantic overview of ANOVA designs in his book *Experiments in Ecology* [@underwood1997experiments] if you really want to go there. #### Analysis of Covariance (ANCOVA)* (Section X.X.X) **Example:** We have a set of data about African penguins and we want to determine if there are differences between male and female penguins in terms of their mean foraging time, and if that difference is influenced by their diving depth. The dataset is as follows: ```{r} #| echo: false # Set seed for reproducibility set.seed(123) # Generate sample data: Foraging time, diving depth, and sex of African penguins foraging_time <- c(1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5) diving_depth <- c(10, 15, 20, 25, 30, 35, 40, 45, 50, 55) # Randomly assign sex to each penguin sex <- sample(c("Male", "Female"), size = length(foraging_time), replace = TRUE) # Create a data frame penguin_data <- data.frame(Sex = sex, Foraging_Time = foraging_time, Diving_Depth = diving_depth) # Display the dataframe penguin_data |> kbl(caption = "Foraging time and diving depth of African penguin.", col.names = c("Sex", "Foraging time\n(hr)", "Diving depth\n(m)"), align = "c", booktabs = TRUE) |> kable_classic(full_width = FALSE, html_font = "Palatino") ``` In this example, we are interested in the mean foraging time of male and female penguins, controlling for their diving depth. An ANCOVA focuses on the differences in means (the categorical variable), and the continuous covariates (diving depth) is specifically controlled for to remove its effect from the dependent variable. This reduces the error variance and so more accurately assesses the comparison of group means. The assumptions of normality and homoscedasticity apply. The functions `aov()` accommodates the categorical and continuous predictors. #### Multivariate Analysis of Variance (MANOVA) MANOVAs are similar to ANOVAs, except here you have *multiple dependent variables*, all *independent, continuous, and normally-distributed*. This is useful when you want to compare the means of multiple groups across multiple dependent variables. For example, you might want to compare the average foraging time together with diving depth of African penguins in three colonies (two in South Africa and one in Namibia) around the coast. The `manova()` function in R is used to perform a MANOVA and there are similar variants to what we have seen in ANOVA. ### B. Hypotheses About the Proportions of Groups You can compare the proportions of groups using tests for proportions when the outcome variable is binary (e.g., success/failure, presence/absence, up/down, day/night). These tests are used to determine if the proportion of successes differs between groups. Use the following tests to compare group proportions: #### One-Sample Test for Proportions **Example:** Is the proportion of African penguins foraging in a specific colony different from the known proportion of the same species in the general population? The data might look like this: - `Sample data: 55 of the 100 penguins observed were foraging in a specific colony` - `The known proportion of penguins foraging in the general population is 60%` In this scenario, we are comparing the proportion of a single sample (the proportion of foraging African penguins in a specific colony) to a known population proportion. The data must consist of a *binary outcome variable* (e.g., foraging vs. not foraging) and the observations must be independent. The `prop.test()` function in R is used to perform this test, which can be either one-sided or two-sided. If the requirement of independent observations is not met, consider non-parametric alternatives, such as the sign test. #### Two-Sample Test for Proportions **Example:** Is the proportion of endangered sea turtles successfully reaching the ocean different between two beaches? Here are data: ```{r} #| echo: false # Sample data: Number of sea turtles reaching the ocean on two beaches Beach <- c("Beach A", "Beach B") Successes <- c(75, 65) Observed <- c(100, 120) # Create the dataframe sea_turtle_data <- data.frame(Beach, Successes, Observed) # Display the dataframe using kable kable(sea_turtle_data, format = "pipe", caption = "Number of Sea Turtles Reaching the Ocean on Two Beaches") ``` Here we compare the proportions from two independent samples (e.g., the proportion of sea turtles successfully reaching the ocean on Beach A versus Beach B). As before, the data yield *a binary outcome* (e.g., reached the ocean vs. did not reach the ocean) for each group, and the observations within each group are independent. The `prop.test()` function is used it has one-sided or two-sided options. If the sample sizes are small or expected frequencies are low, consider using Fisher's exact test instead of the proportion test. If the assumption of independent observations within groups is violated, you may need to consider methods that account for dependency in the data, such as Generalised Estimating Equations (GEE) or mixed-effects models. #### Chi-square Test for Count Data **Example:** Is there an association between vegetation type and the presence of leopards in different areas of Kruger National Park? A hypothetical dataset: ```{r} #| echo: false # Sample data: Counts of leopard presence/absence across different vegetation types # Vegetation types: Grassland, Woodland, Shrubland # Presence/Absence of leopards: Presence, Absence # Create a matrix with the counts data_matrix <- matrix(c(20, 30, 25, 40, 35, 15), nrow = 3, byrow = TRUE, dimnames = list(Vegetation_Type = c("Grassland", "Woodland", "Shrubland"), Leopard_Presence = c("Presence", "Absence"))) # Convert matrix to a contingency table contingency_table <- as.table(data_matrix) # Display the contingency table using kable kable(contingency_table, format = "pipe", caption = "Contingency Table of Plant Species and Insect Occurrence") ``` Here we examine the relationship between two categorical variables (vegetation type and leopard presence) within Kruger National Park. The data are organised into a contingency table, where each cell represents the count or frequency of observations for a specific combination of categories. The chi-square test of independence is used to determine if there's a significant association between the variables. As with other categorical tests, the data yield *discrete outcomes* (e.g., savanna, woodland, or riverine for vegetation type; present or absent for leopard presence). The observations should be independent, meaning the presence of a leopard in one area should not influence its presence in another. The `chisq.test()` function in R is commonly used for this analysis. This test compares the observed frequencies in each cell of the contingency table to the frequencies that would be expected if there were no association between vegetation type and leopard presence. If the sample size is large and the expected frequencies in each cell are adequate (typically > 5), the chi-square test is appropriate. However, if the sample size is small or if there are cells with low expected frequencies, consider using Fisher's exact test instead. If the assumption of independence is violated (e.g., if the data include multiple observations from the same leopard individuals or territories), you may need to consider more advanced methods that account for dependency in the data, such as log-linear models or Generalised Estimating Equations (GEE). #### Fisher's Exact Test **Example:** Is there a significant association between the presence of certain plant species and the occurrence of rare fynbos endemic insects in the Cape Floristic Region? Here are the data: ```{r} #| echo: false # Sample data: Counts of insect occurrence across different plant species # Plant species: Plant_A, Plant_B # Insect occurrence: Present, Absent # Create a matrix with the counts data_matrix <- matrix(c(2, 8, 3, 7), nrow = 2, byrow = TRUE, dimnames = list(Plant_Species = c("Plant A", "Plant B"), Insect_Occurrence = c("Present", "Absent"))) # Convert matrix to a contingency table contingency_table <- as.table(data_matrix) # Display the contingency table using kable kable(contingency_table, format = "pipe", caption = "Contingency Table of Plant Species and Insect Occurrence") ``` Fisher's Exact Test is used when we have two categorical variables and want to determine if there's a significant association between them, particularly when sample sizes are small or when we have sparse data in some categories. This test is especially useful in ecological studies where rare species or events are being investigated. In this example we examine the relationship between the presence of specific plant species and the occurrence of rare fynbos endemic insects. The data are organised into a 2x2 contingency table, where each cell represents the count of observations for a combination of presence/absence of the plant species and the insect species. The test calculates the exact probability of observing the given set of cell frequencies under the null hypothesis of no association. It does not rely on approximations and it more accurate than the chi-square test for small samples. Use the `fisher.test()` function to perform this analysis. Like other categorical tests, the observations should be independent, meaning the presence of an insect in one area should not influence its presence in another. Fisher's Exact Test is particularly appropriate when: - The total sample size is less than 1000 - The expected frequency in any cell of the contingency table is less than 5 - You're dealing with rare events or species If the sample size becomes very large, Fisher's Exact Test can become computationally intensive, and the chi-square test may be more practical. If the assumption of independence is violated (e.g., if the data include multiple observations from the same locations over time), you may need to consider more advanced methods that account for dependency in the data, such as mixed-effects models or Generalised Estimating Equations (GEE). ### C. Hypotheses About the Strength of Association {#sec-pearson} **Example:** Is there a relationship between the foraging time and diving depth of African penguins? ```{r} #| echo: false # Sample data: Foraging time and diving depth of African penguins foraging_time <- c(1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5) diving_depth <- c(10, 15, 20, 25, 30, 35, 40, 45, 50, 55) # Create a data frame penguin_data <- data.frame(Foraging_Time = foraging_time, Diving_Depth = diving_depth) penguin_data |> kbl(caption = "Foraging time and diving depth of African penguin.", col.names = c("Foraging time\n(hr)", "Diving depth\n(m)"), align = "c", booktabs = TRUE) |> kable_classic(full_width = FALSE, html_font = "Palatino") # Display the data frame # ht(penguin_data) ``` You'll want to use a Pearson's correlation to determine if there is a linear relationship between *two continuous variables*, both of them normally distributed and homoscedastic. A correlation analysis does not presume causation and does not provide a predictive model, both of which are the domain of regression. The strength of the relationship is quantified by the correlation coefficient called Pearson's rho, which ranges from -1 to 1. Use the `cor.test(..., method = "pearson")` function in R to perform this analysis. Non-parametric alternatives such as the Spearman's rank correlation or Kendall's tau correlation (see 'II. Non-Parametric Methods') are available and implemented with the same R function. ### D. Modelling and Predicting Causal Relationships The relationship between one or a few predictors and an outcome can be represented by a function, which is a model that reconstructs part of the 'reality' of the observed phenomenon. Regression analysis helps you understand how changes in the continuous predictor variable(s) drive changes in a continuous outcome variable. The model quantifies the strength of the associations and makes predictions for new data points. You may use regression models for hypothesis testing and for identifying which predictor variables have the most substantial impact on the outcome. #### Simple Linear Regression **Example:** The same dataset of [foraging time and diving depth of African penguins](#sec-pearson) can be used to model the relationship between these two variables. Does diving depth depend on foraging time? What is different now is that we are interested in *predicting the diving depth* (response) of penguins based on their foraging time (predictor). Assuming there is a linear response, we can use a simple linear regression model to quantify the relationship between these two continuous variables. The model provides an equation that describes how the diving depth changes as the foraging time increases. The assumptions of normality and homoscedasticity apply to the residuals, and are accessed after having fit the model. This calls for a simple linear regression model and you can fit it using the `lm()` function in R. The model can also be specified as a generalised linear model (GLM) with `glm(..., family = gaussian)`. If assumptions fail, apply data transformations (e.g., log, square root), robust regression (`rlm()` in **MASS** package), or consider non-linear models. #### Polynomial Regression I'll not provide an example here. It suffices to say that a polynomial regression is effectively a simple linear regression that allows for a curvilinear relationship between the predictor and the outcome. To accomplish this, the model includes polynomial terms (e.g., quadratic, cubic, which are simply powers of the predictor) to capture the non-linear patterns in the data. The model can be fit using the `lm()` function in R. Assess the relationship between $x$ vs. $y$ by making a scatterplot of the data and eye balling a best fit curve through the scatter of points. Is the line curvy or bendy? Do you know in advance if a more complicated model describes the response? If the answer is 'yes' to the first and 'no' to the second question, then a polynomial regression might be just the thing for you. #### Multiple Linear Regression (MLR) {#sec-mlr} **Example:** I've added a second predictor to the dataset of [foraging time and diving depth of African penguins](#sec-pearson). Does diving depth depend on the penguins' body mass index (BMI) and foraging time? ```{r} #| echo: false # Sample data: Foraging time, diving depth, and body mass index of African penguins body_mass_index <- c(1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5) foraging_time <- c(1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5) diving_depth <- c(10, 15, 20, 25, 30, 35, 40, 45, 50, 55) # Create a data frame penguin_data <- data.frame(BMI = body_mass_index, Foraging_Time = foraging_time, Diving_Depth = diving_depth) # Display the data frame penguin_data |> kbl(caption = "Foraging time and diving depth of African penguin.", col.names = c("BMI", "Foraging time\n(hr)", "Diving depth\n(m)"), align = "c", booktabs = TRUE) |> kable_classic(full_width = FALSE, html_font = "Palatino") ``` The only difference between this example and the simple linear regression is that we now have two predictors (foraging time and BMI) instead of one. The predictors can be *continuous* (as in the example) *and/or categorical*. If you are more concerned with the means of the categorical variables, consider an ANCOVA as an alternative option. The multiple linear regression model can be extended to include interaction terms between predictors. You can quantify the relationship between both predictors and the outcome simultaneously, and ask which of the two best predicts the response. The same assumptions apply as in the simple linear regression and we hope for a linear relationship between $x_1$ and $x_2$ vs. $y$. Other considerations are provided in the chapter on [MLR](multiple_linear_regression.qmd). The R functions `lm()` and `glm(..., family = gaussian)` accommodate situations such as these where we have multiple predictors. #### Generalised Linear Models (GLM) GLMs are a class of regression models that extend the simple linear regression framework to accommodate various types of response distributions. As such, they can accommodate data that violate the assumptions of normality and homoscedasticity, as well as situations where the response variable is not continuous. Use GLMs to model count data (e.g., number of occurrences), binary outcomes (e.g., success/failure), and other non-continuous response variables that cannot be adequately represented by a normal distribution. Unlike linear models, which assume a normal error distribution, GLMs specify the distribution of the response variable using a probability distribution from the exponential family, such as the Gaussian (normal), binomial, Poisson, or negative binomial distributions. GLMs incorporate a link function that relates the linear predictor (a linear combination of the predictor variables) to the expected value of the response variable. This link function can take various forms, including the identity (linear), logit (for binary data), probit, or other transformations, depending on the nature of the response variable and the desired relationship between the predictors and the outcome. The `glm()` function is a staple for fitting GLMs. It is designed to handle the exponential family distributions and will allow you to specify the appropriate distribution and link function for your data and research question. A few common types of GLMs are presented next. *Logistic Regression* (@sec-generalised-linear-model) You'll encounter binomial data in experiments or processes with binary outcomes, such as presence/absence, success/failure, or alive/dead. To model this type of data, you will want to use logistic regression. Logistic regression estimates the log-odds of the outcome as a linear combination of the predictor variables. The logistic function is then used to convert these log-odds into probabilities, which range from 0 to 1, so it is suitable for predicting the likelihood of the binary outcomes. - **Use When:** You have a binary outcome variable and want to model the relationship between predictors and the probability of the outcome. - **Data Requirements:** Binary outcome, continuous or categorical predictors. - **Assumptions:** Linear relationship between the log-odds of the outcome and predictors. - **Diagnostics:** Check for influential observations, multicollinearity, and overall model fit. - **If Assumptions Fail:** Consider interactions, alternative link functions (probit, complementary log-log) in `glm()`, or non-linear logistic regression, zero-inflated models when excess zeroes. - **R Function:** `glm(..., family = binomial)` - **Model Selection:** Stepwise regression, regularisation techniques, information criteria (AIC, BIC). *Poisson Regression* (@sec-generalised-linear-model) Typical examples of count data include the number of offspring, parasites, or seeds. Poisson regression is used to model the relationship between predictors and the count outcome. The model assumes that the count data follow a Poisson distribution, where the mean and variance are equal. Poisson regression is suitable for data with a single count outcome. - **Use When:** You have count data and want to model the relationship between predictors and the count outcome. - **Data Requirements:** Count outcome, continuous or categorical predictors. - **Assumptions:** Equidispersion (variance equals the mean). - **Diagnostics:** Check for overdispersion, excess zeros, and overall model fit. - **If Assumptions Fail:** Negative binomial regression (`glm.nb()` in the **MASS** package, overdispersion), zero-inflated models (`zeroinfl()` in the **pscl** package, excess zeros). - **R Function:** `glm(..., family = poisson)` *Negative Binomial Regression* Negative binomial regression is an extension of Poisson regression that accommodates overdispersion, where the variance exceeds the mean. It is used when the count data exhibit more variability than expected under a Poisson distribution. The model assumes that the count data follow a negative binomial distribution, which has an additional parameter to account for overdispersion. Biological and ecological processes such as species abundance, parasite counts, and gene expression often exhibit overdispersion. - **Use When:** You have count data with overdispersion and want to model the relationship between predictors and the count outcome. - **Data Requirements:** Count outcome, continuous or categorical - **Assumptions:** Overdispersion (variance exceeds the mean). - **Diagnostics:** Check for overdispersion, excess zeros, and overall model fit. - **R Function:** `glm.nb()` in **MASS** package *Gamma Regression* Gamma regression is for modelling continuous, positive outcomes that exhibit a right-skewed distribution and possibly also a non-constant variance (heteroscedasticity). The gamma distribution is well suited for continuous measurements where the variability increases as the mean increases. You might encounter this kind of distribution in growth rates, enzyme activity levels, species abundance data, and other phenomena or processes characterised by positive, skewed data. - **Use When:** You have a continuous, positive outcome and want to model the relationship between predictors and the outcome. - **Data Requirements:** Continuous, positive outcome, continuous or categorical predictors. - **Assumptions:** Outcome values are positive, potentially non-constant variance. - **Diagnostics:** Check for overall model fit, influential observations, and residual - **R Function:** `glm(..., family = Gamma)` *Beta Regression* Beta regression is a statistical technique appropriate when the response variable is a continuous proportion or rate bounded between 0 and 1. These types of data might, for example, arise in ecology where one might study the proportions of time animals spend exhibiting different behaviours, the relative abundances of species in a community, or the proportions of habitat patches comprising a landscape. Proportional data inherently exhibit heteroscedasticity (non-constant variance). - **Use When:** You have a proportional outcome ($0 < y < 1$) and want to model the relationship between predictors and the outcome. - **Data Requirements:** Proportional outcome ($0 < y < 1$), continuous or categorical predictors. - **Assumptions:** Outcome values within ($0,1$), potentially non-constant variance. - **Diagnostics:** Check for overall model fit, influential observations, and residual analysis. - **If Assumptions Fail:** Transformations, consider alternative link functions, or zero/one-inflated beta regression. - **R Function:** `betareg()` in the **betareg** package **Modelling Non-Linear Relationships** We use non-linear models when the relationship between predictor variables and the outcome variable is not linear. This non-linearity arises from the predictor variables themselves being non-linearly related to the outcome or from the model's parameters (coefficients) appearing non-linearly in the functional form. The visualised response curve is typically curved, rather than a straight line. These models are often derived from theoretical understanding or prior knowledge about the underlying mechanisms governing the relationship between the predictors and the outcome variables. *Non-Linear Least Squares (NLS) Regression* (@sec-nonlinear-regression) - **Use When:** The relationship between the predictors and the outcome is non-linear. - **Data Requirements:** Continuous outcome, continuous predictors. - **Assumptions:** Appropriate functional form, normality, and homoscedasticity of residuals. - **Diagnostics:** Check residual plots, normality of residuals, and leverage/influence points. - **R Function:** `nls()` (for non-linear regression models with user-specified functions) **Generalised Non-Linear Models (GNLMs)** GNLMs are an extension of generalised linear models (GLMs) that allow for non-linear relationships between the predictors and the outcome variable. GNLMs are used when the relationship between the predictors and the outcome is non-linear, and the outcome variable follows a non-normal distribution. GNLMs are particularly useful for count data, binary outcomes, and other non-continuous response variables that exhibit non-linear relationships with the predictors. **Linear and Non-Linear Hierarchical Models (Mixed-Effects Models)** Hierarchical models are used when data are structured hierarchically, such as when multiple observations are nested within higher-level units (e.g., plants within fields, sheep within rangelands). These models account for the correlation between observations within the same group and allow for the estimation of both fixed effects (population-level parameters) and random effects (group-level parameters). Hierarchical models are also known as multilevel models or mixed-effects models. *Linear Mixed-Effects Models (LMMs)* (Section X.X.X) - **Use When:** You have nested or hierarchical data structures and the relationship between the predictors and the outcome is linear. - **Data Requirements:** Continuous outcome, continuous predictors, potentially with nested or hierarchical data structures. - **Assumptions:** Normality, homoscedasticity of residuals, correct specification of random effects structure. - **If Assumptions Fail:** Consider transformations, robust regression, or non-linear mixed-effects models. - **Diagnostics:** Check residual plots, normality of residuals, and leverage/influence points, assess random effects structure. - **R Function:** `lmer()` in the **lme4** package (for linear mixed-effects models with user-specified functions) *Non-Linear Mixed-Effects Models (NLMMs)* (@sec-nonlinear-regression) - **Use When:** You have nested or hierarchical data structures and the relationship between the predictors and the outcome is non-linear. - **Data Requirements:** Continuous outcome, continuous predictors, potentially with nested or hierarchical data structures. - **Assumptions:** Appropriate functional form, normality, and homoscedasticity of residuals, correct specification of random effects structure. - **If Assumptions Fail:** Generalised non-linear mixed models (GNLMMs) and generalised additive mixed models (GAMMs) can be used when the assumptions of non-linear mixed models (NLMMs) are violated. Else, consult a statistician. - **Diagnostics:** Check residual plots, normality of residuals, and leverage/influence points, assess random effects structure. - **R Function:** `nlme()` in the **nlme** package (for non-linear mixed-effects models with user-specified functions) **Generalised Linear and Non-Linear Mixed-Effects Models (GLMMs and GNLMMs)** GLMMs and GNLMMs combine the flexibility of regression model generalisation (i.e. by accommodating non-Gaussian distribution families) with the ability to account for nested or hierarchical data structures. GLMMs are used when the outcome variable is not normally distributed (a different, known distribution) and the data are structured hierarchically. GLMMs include both fixed effects (population-level parameters) and random effects (group-level parameters) and can accommodate a wide range of outcome distributions, including binary, count, and continuous outcomes. - **Use When:** You have non-normally distributed outcome data and nested or hierarchical data structures. - **Data Requirements:** Binary outcome, continuous or categorical predictors, potentially with nested or hierarchical data structures. - **Assumptions:** Linear relationship between the log-odds of the outcome and predictors, correct specification of random effects structure. - **Diagnostics:** Check residual plots, normality of residuals, and leverage/influence points, assess random effects structure. - **R Function:** `glmer()` in the **lme4** package **Other Regression Models** *Zero-Inflated Models* - **Use When:** You have count data with an excess of zeros and want to model the zero-inflation separately from the count process. - **Data Requirements:** Count outcome, continuous or categorical - **Assumptions:** Correct specification of zero-inflation and count processes, no omitted variables. - **Diagnostics:** Check zero-inflation and count process, overall model fit. - **R Function:** `zeroinfl()` in the **pscl** package *Survival Analysis* - **Data Requirements:** Time-to-event outcome, continuous or categorical predictors. - **Assumptions:** Proportional hazards, non-informative censoring. - **Diagnostics:** Check proportional hazards assumption, influential observations, and overall model fit. - **R Function:** `survival::coxph()` *Time Series Analysis* - **Data Requirements:** Time-ordered data, potentially with autocorrelation. - **Assumptions:** Stationarity, no autocorrelation in residuals. - **Diagnostics:** Check autocorrelation, stationarity, and overall model fit. - **R Function:** `arima()`, `auto.arima()` in the **forecast** package *Structural Equation Modelling (SEM)* - **Data Requirements:** Continuous outcome, continuous - **Assumptions:** Correct specification of the structural model, no omitted variables, no measurement error. - **Diagnostics:** Check model fit, parameter estimates, and overall model validity. - **R Function:** `sem()` in the **lavaan** package *Bayesian Regression* - **Data Requirements:** Continuous outcome, continuous or categorical predictors. - **Assumptions:** Correct specification of priors, likelihood, and model structure. - **Diagnostics:** Check for convergence, posterior predictive checks, and overall model fit. - **R Function:** `brms::brm()` ## II. Non-Parametric Methods (Distribution-Free) Non-parametric statistics are statistical methods that do not rely on assumptions about the specific form or parameters of the population distribution. They are also referred to as *distribution-free methods*. These methods often use ranks or other order statistics of the data rather than the actual data values themselves. ### A. Hypotheses About Groups *One-Sample Tests for Medians* Use a one-sample test to compare the median of a single sample to a known population median. It is as an alternative to one-sample *t*-tests when the data do not meet the assumptions of parametric tests. - Wilcoxon signed-rank test - Sign test *Two-Sample Tests for Medians* (Section X.X.X) Use two-sample tests to compare the medians of two independent or related samples. Use it when the assumptions of parametric two-sample tests are violated. - Mann-Whitney U test (two independent groups) - Wilcoxon rank-sum test (two independent groups) - Kruskal-Wallis test (multiple groups) - Friedman test (related samples) ### B. Hypotheses About Proportions - *Chi-Square Test for Independence:* Comparing proportions of two groups ### C. Correlation Analysis for Tests of Association Use non-parametric correlation to assess the strength and direction of a relationship between two continuous (or ordinal) variables when the assumptions of parametric correlation tests cannot be met. *Spearman's Rank Correlation* (@sec-correlation) A non-parametric measure of the strength and direction of association between two variables. *Kendall's Tau Correlation* (@sec-correlation) A non-parametric measure of the strength and direction of association between two variables. ### D. Regression Analysis *Quantile Regression* (Section X.X.X) Models different quantiles of the response distribution. *Robust Regression* (Section X.X.X) Less sensitive to outliers than ordinary least squares regression. *Kernel Density Estimation* KDE is a non-parametric method for visualising the distribution of a continuous variable. Unlike histograms, which bin data into discrete intervals, KDE creates a smooth curve that represents the estimated probability density function (PDF) of the underlying data. It does this by placing a kernel function (often a symmetric curve like a Gaussian or Epanechnikov) at each data point and summing up the contributions of these kernels across the entire range of the variable. The bandwidth of the kernel controls the smoothness of the resulting density estimate. Wider bandwidths lead to smoother curves but may obscure finer details, while narrower bandwidths reveal more local fluctuations but can be noisy. KDE is useful when the underlying distribution of the data is unknown or non-standard and it offers a convenient way to visualise and understand the shape and spread of the data without being constrained by parametric assumptions. *Local Regression (LOESS)* LOESS (Locally Estimated Scatterplot Smoothing) is a non-parametric regression technique that produces a smooth curve through a set of data points by fitting simple models to localised subsets of the data. It achieves this by weighting the data points in each subset, with higher weights assigned to points closer to the point being estimated. The model used for local fitting is typically a low-degree polynomial, although other choices are possible. LOESS is primarily used for data exploration and visualisation. It is best known for smoothing scatterplots and revealing underlying trends or patterns in the data. It is advantageous because it doesn't assume any particular functional form for the relationship between the predictors and the response variable, so it to adapts to various data shapes. But LOESS does not provide a single, easily interpretable equation for the entire dataset, making it less suitable for making predictions or drawing global inferences. It can also be computationally demanding with large datasets as it fits separate models in the vicinity of locally-selected points. *Penalised Regression* Penalised regression (also known as regularisation) is used to enhance the performance of regression models. This might be desirable when dealing with high-dimensional data or when the predictor variables are highly collinear. It introduces a penalty to the regression objective function which discourages the model from having overly complex or large coefficients. This effectively prevents overfitting. Common types of penalised regression include Ridge regression (L2 regularisation), which adds the sum of the squared coefficients as a penalty term, and Lasso regression (L1 regularisation), which adds the sum of the absolute values of the coefficients. The penalty terms encourage simpler models by shrinking some coefficients towards zero, with Lasso potentially setting some coefficients exactly to zero, thus performing variable selection. The balance between fitting the data well and maintaining model simplicity helps in improving the model’s generalisation to new data. Penalised regression methods can achieve a trade-off between bias and variance and result in more robust and interpretable models. ## III. Semi-Parametric Methods Semi-parametric methods combine parametric and non-parametric techniques to provide a balance between flexibility and efficiency. These methods are useful when the assumptions of parametric tests are violated, but the data do not meet the requirements for non-parametric tests. Semi-parametric methods are often more powerful than non-parametric tests, as they make fewer assumptions about the data distribution. These methods are particularly useful when the sample size is small or when the data are skewed or have outliers. *Generalised Additive Models (GAMs)* (@sec-generalised-additive-models) - **Use When:** You have non-linear relationships between predictors and outcome. - **R Function:** `gam()` in the **mgcv** package; also `gamm4()` in the **gamm4** package - **Data Requirements:** Continuous, binary, or categorical outcome, continuous or categorical predictors, potentially with nested or hierarchical data structures. - **Advantages:** Flexible modelling of non-linear relationships using smoothing functions, can handle mixed-effects structures. - **Limitations:** Interpretation can be challenging, potential overfitting. *Generalised Estimating Equations (GEEs)* - **Use When:** You have correlated data and non-normally distributed outcomes. - **R Function:** `geeglm()` in the **geepack** package; also functions in the **gee** package - **Data Requirements:** Correlated data, non-normal outcomes, continuous or categorical predictors. - **Advantages:** Robust to misspecification of the correlation structure, can handle non-normal outcomes, flexible in handling missing data. - **Limitations:** Assumes correct specification of the correlation structure, may be less efficient than mixed-effects models. *Semi-Parametric Survival Models* - **Use When:** You have time-to-event data and want to model the hazard function. - **R Function:** `coxph()` in the **survival** package - **Data Requirements:** Time-to-event data, censoring, continuous or categorical predictors. - **Assumptions:** Proportional hazards assumption, independence of censoring. - **Diagnostics:** Check proportional hazards assumption, influential observations, goodness *Spline Regression* - **Use When:** You have non-linear relationships between predictors and outcome. - **R Function:** `lm()` with splines, `gam()` in the **mgcv** package - **Data Requirements:** Continuous outcome, continuous predictors. - **Assumptions:** Linearity within each spline, potentially non-constant variance. - **Diagnostics:** Check for overall model fit, influential observations, and residual analysis. - **If Assumptions Fail:** Transformations, consider alternative link functions, or penalised regression. ## IV. Machine Learning Methods Machine learning methods are a set of algorithms that can learn patterns from data without being explicitly programmed. These methods are particularly useful for prediction, classification, and clustering tasks. Machine learning models can handle complex relationships in the data and are often more flexible than traditional statistical models. However, they can be more computationally intensive and may require more data to train effectively. *Random Forests* A machine learning method that uses an ensemble of decision trees to predict an outcome. *Support Vector Machines* A machine learning method that finds the optimal hyperplane to separate two classes of data. *Ensemble Methods* A machine learning technique that combines the predictions of multiple models to improve accuracy. *Neural Networks* A machine learning method that uses interconnected nodes to model complex relationships in data. *Deep Learning* A subset of machine learning that uses neural networks with multiple layers to model complex relationships in data. ## V. Miscellaneous Methods *Bootstrapping* A resampling method for estimating the sampling distribution of a statistic. *Permutation Tests* A non-parametric method for testing hypotheses by randomly permuting the data. *Monte Carlo Simulation* A method for estimating the distribution of a statistic by generating random samples from a known distribution. *Bayesian Methods* A statistical approach that uses Bayes' theorem to update prior beliefs based on observed data. *Dimensionality Reduction* Also called muitvariate analyses. A set of techniques for reducing the number of variables in a dataset while preserving important information. *Clustering* A set of unsupervised learning techniques for grouping similar data points together. *Feature Selection* A process for identifying the most important variables in a dataset for predicting an outcome. *Regularisation* See penalised regression. A technique for preventing overfitting by adding a penalty term to the model coefficients. *Cross-Validation* A method for estimating the performance of a model by splitting the data into training and test sets. *Hyperparameter Tuning* The process of selecting the optimal values for the parameters of a machine learning model. *Model Evaluation* The process of assessing the performance of a model using metrics such as accuracy, precision, recall, and F1 score. *Model Interpretation* The process of understanding how a model makes predictions by examining the relationship between the input variables and the output. *Model Deployment* The process of putting a trained model into production so that it can be used to make predictions on new data. *Model Monitoring* The process of tracking the performance of a deployed model over time to ensure that it continues to make accurate predictions. *Model Explainability* The process of explaining how a model makes predictions in a way that is understandable to humans. *Model Fairness* The process of ensuring that a model does not discriminate against certain groups of people based on sensitive attributes. *Model Robustness* The process of ensuring that a model performs well on new data that is different from the training data.