An R
package for analysing population-genetic data represented as the site frequency spectrum (SFS)
The site frequency spectrum (SFS; also known as the allele frequency spectrum) is the histogram over allele counts in a population. More formally, for a population consisting of
The SFS can be generalized to higher dimensions when more than one population is considered; in this case we may call it the joint SFS. In this case each entry
The SFS provides a rich summary of polymorphism data. Departures in the shape of the SFS are informative for demography and/or selection. Most common summary statistics for polymorphism data both within and between populations can be computed directly from the SFS.
Although the SFS can be easily tabulated from discrete genotypes, it may be advantageous to treat it as a parameter to be estimated from data under some model of uncertainty and technical error. Such methods are implemented in software like ANGSD
, but I found few resources for manipulating and analyzing the resulting SFS. Hence the sfsr
package.
At the moment the package supports only unfolded SFS -- those for which alleles can be polarized without ambiguity as ancestral vs derived. This happens to be the case for problems I work on, and gets around the data-dependency of the "minor allele" designation.
- File I/O
- read
ANGSD
-style SFS from disk (read_sfs()
)
- read
- Manipulation and transformation of SFS
- marginalization (
margnialize_sfs()
) - changes of ancestral-vs-derived polarity (
repolarize_sfs()
)
- marginalization (
- Common within-population summary statistics
- Estimators of
$\theta$ - Tajima's pairwise, aka
$\pi$ (theta_pi()
) - Watterson's (
theta_w()
) - from singleton count (
theta_zeta()
)
- Tajima's pairwise, aka
- Neutrality statistics
- Tajima's
$D$ (tajimaD()
) - Fu and LI's
$D$ (fuliD()
) - Fu and LI's
$F$ (fuliF()
) - Achaz's
$Y$ , robust to sequencing errors (achazY()
)
- Tajima's
- Estimators of
- Between-population summary statistics
- fixed vs polymorhic sites (
fixpoly()
) - average divergence (
d_xy()
) - Weir and Cockerham's
$F_{st}$ (f_st()
)
- fixed vs polymorhic sites (
- Hudson-Kreitman-Aguade test (
hka_test()
)
Because bootstrap resampling from the SFS is a straightforward way to estimate uncertainty, SFS as reprsented by sfsr
may carry bootstrap replicates as an attribute. In this case most summary statistics are automatically calculated across the bootstraps for convenient estimation of standard errors.