This vignette provides an overview of the primary function of the phylosamp package: how to estimate the false discovery rate given a sample size. In the examples provided, we use the default `assumption`

argument (multiple transmissions and multiple links, `mtml`

), though alternative assumptions can also be specified.

The most basic function of the package is `truediscoveryrate()`

, which calculates the probability that an identified link represents a true transmission event. This calculation relies on the following parameters:

`library(phylosamp)`

`truediscoveryrate(eta=0.99, chi=0.95, rho=0.75, M=100, R=1)`

`## Calculating true discovery rate assuming multiple-transmission and multiple-linkage`

`## [1] 0.2334906`

In other words, given a sample size of 100 infections (representing 75% of the total population), a linkage criteria with a specificity of 99% for identifying infections linked by transmission and a specificity of 95%, fewer than 25% of identified pairs will represent true transmission events. Increasing the specificity to 99.5% has a significant impact on our ability to distinguish linked and unlinked pairs:

`truediscoveryrate(eta=0.99, chi=0.995, rho=0.75, M=100, R=1)`

`## Calculating true discovery rate assuming multiple-transmission and multiple-linkage`

`## [1] 0.7528517`

The other core functions are designed to calculate the expected number of true transmission pairs identified in the sample (`true_pairs`

) and the total number of linkages one can expect to identify given the sensitivity and specificity of the linkage criteria and a particular sample size and proportion (`exp_links`

).

`true_pairs(eta=0.99, rho=0.75, M=100, R=1)`

`## Calculating expected number of links assuming multiple-transmission and multiple-linkage`

`## [1] 74.25`

`exp_links(eta=0.99, chi=0.95, rho=0.75, M=100, R=1)`

`## Calculating expected number of links assuming multiple-transmission and multiple-linkage`

`## [1] 318`

It is important to recognize that \(R\) in these functions represents the average \(R\) in the sampled population (alternatively denoted \(R_\text{pop}\)). Because any sampling frame contains a finite number of cases, there will always be more cases than infection events (at minimum, all infectees in a transmission chain plus a single index case), so \(R_\text{pop}\leq1\). For outbreaks with a single introduction, \(R_\text{pop}\) is approximately equal to 1; sampling frames containing cases from separate introduction events will have lower values of \(R_\text{pop}\).