On Rare Reads

By Ryan Kelly

A recent Twitter thread got me thinking more about rare reads in metabarcoding studies. I have, at points in the past, claimed that 1 read is just as valid as 10,000 reads… but under what circumstances (if any) is this true?

The answer will depend upon the purpose of the analysis, but it’s first worth thinking about where reads come from.

Where Reads Come From

Read abundance is itself an artifact of the processes of extraction and PCR and sequencing, probably to a much greater extent than it is a result of biology. For a given taxon/ASV/OTU, there are many ways (process-wise) to arrive at 1 or 5 or 10 reads: perhaps they arise from a truly rare taxon (low biological abundance), or result from a high-abundance taxon that is poorly amplified by the primer set (i.e., low amplification rate, relative to the others in the sample). Perhaps it’s by chance, where variability in the extraction/PCR/sequencing process resulted in an improbably low number of reads for that taxon/ASV/OTU in that particular sequencing reaction, or it could be due to low read-depth of the sample (although I’d usually assume people think about this first). And so forth. The point is: raw read abundance means very little on its own. And therefore, low read abundance means little on its own.

Stochasticity

Insofar as reads are the result of a long chain of analytical steps, each of these steps is subject to a great deal of sampling variation. So raw read-abundance is a random variable, the stochastic outcome of these complex observation processes. Sequence the exact same DNA extract 10 times, and you’ll get 10 different abundances for a given sequence variant. These won’t be 10 *totally* different abundances — I hope — but they will vary enough that we should not expect any one realization of a sequenced sample to be “truth”. Rather, it is one random draw from a very large number of likely outcomes.

This applies with particular force to rare reads, which will — by definition, because of their rarity — only occur stochastically in samples, due to sampling variance. For example: if we estimate that a PCR reaction results in something like 10^13 amplicons, but then we only get 10^7 amplicons out of the sequencer, that means only one in a million amplicons got turned into a sequenced read. This is a massively small sample, and most rare things won’t be sampled; most rare ASVs won’t occur in the sequencing data. If you sequenced the same DNA extraction again in a technical replicate, all of the same *common* amplicons would show up in the replicate, but many of the *rare* amplicons wouldn’t, and instead you would see different rare amplicons… due solely to chance.

So, as a baseline matter, we should see rare things as more obviously stochastic than common things.

Rare Reads and Richness

Because we know that read abundance doesn’t scale predictably with biomass, we often treat metabarcoding data as somewhat-glorified presence/absence data: we turn detections into 1s and 0s, deciding that a taxon (or ASV/OTU) is either detected or it is not detected. Often the first subsequent analytical stop here is an estimate of richness: how many unique entities did I detect in my samples?

Even putting aside the fact that rare ASVs are likely to be stochastic — and so any given sequenced sample surely didn’t approach the total richness present — here we run headlong into another difficult problem: most metabarcoding ASVs are rare, and so rare ASVs contain most of the richness information. But PCR error is also a real thing (see below) and we want to make sure we are only including real biological sequences in our analysis, excluding artifacts of the PCR process. One common way to handle this is to simply throw out all rare ASVs, but (1) this is likely to grossly mischaracterize the true richness in the dataset, given that most (real) ASVs are likely to be rare, and (2) it’s not obvious how to pick a threshold of rarity for separating the wheat from the chaff.

We’ve tried occupancy modeling to help with this in the past — sequencing replicate PCR reactions is useful in identifying real from spurious reads (see here and here, e.g.) — but where a given ASV occurs only once in the entire dataset (or across multiple datasets), one might have to rely on clustering or other distance- or tree-based techniques to identify likely artifacts. For coding genes like COI, looking at the amino-acid translation can help identify artifacts and pseudogenes as well. But there’s no real magic wand as far as I’m aware. The point is to have some analytical reasoning behind the decisions you make here; show your work, and know why you are doing what you are doing.

Rare Reads and Community Ecology

Other analyses don’t treat metabarcoding data as presence/absence, but rather as an assay of community composition. The subsequent analysis is then often Bray-Curtis dissimilarity or something equivalent, asking how different any two samples are from one another. Assuming that metabarcoding data have been appropriately transformed in some way, I think this is probably a statistically acceptable thing to do, even knowing that read abundance doesn’t equate to species abundance; we aren’t asking how different the two realities of our sample communities are, but rather how different the results of our assays of the two communities are.

Here, rare reads mean something a bit different. They tend to get down-weighted by Bray-Curtis and other dissimilarity/distance metrics, in the sense that these metrics are more influenced by high-abundance ASVs than low-abundance ones. But of course, if you’ve transformed the data (see here and here) such that every ASV occurs on a 0-1 scale (or done some similar transformation), all ASVs will be weighted equally in the dissimilarity calculation. And because most ASVs are rare, this will mean rare ASVs have a disproportionately large influence on the analytical outcome.

One way of handling these issues is to collapse into annotated taxa before undertaking community-level analysis. After all, we often care about the community of taxa, rather than the (larger) community of unique sequence variants. But again, there’s no magic wand, and the best that one can hope for is transparent reasoning in handling difficult issues.

Rare Reads and Absolute Abundance

Finally, if one is trying to estimate actual biological abundances from sequencing reads, we have to assess rare reads yet again. This analysis treats read abundance as a result of a set of biological and analytical processes to be modeled, and finds the most likely biological abundance, given the observed reads and some fitted model parameters. Given the amount of variance at each of these steps in the analytical process leading to sequence data, rare reads simply provide very little information to the model. Again, there are many ways to arrive at 1 or 5 or 10 reads, and here the model has no way of figuring out which of these ways is most likely. For example, if we assume observed read abundance for a given taxon is (for simplicity) a Poisson process, with the expected value being (proportional_amplifiable_DNA)*(1 + amplification_efficiency)^N_PCR_cycles, there are lots of places for variability to arise in terms of both process variance (uncertainty in proportional_amplifiable_DNA) and observation variance (uncertainty in amplification_efficiency and the sampling variability of the Poisson process… and this actually is far less sampling variability than we normally see among technical replicates). With large read-abundances, it’s easy to fit this model. With small read-abundances, it’s impossible without more information to inform the model terms.

PCR Errors

Apart from all of these analytical techniques and questions, there is the overriding concern that rare reads could be artifacts of PCR: that is, not real sequences that exist in the world, but rather mistakes of the process. There is a large literature on this phenomenon, and in general, we need some way to distinguish real sequences from spurious ones. I mentioned occupancy modeling above, and clustering as well. DADA2 and other common tools do some of this work for you, but at the end of the day, you have to decide why you believe a sequence read is real vs. spurious, and I hope I’ve made the case that rarity alone isn’t a sufficient reason.

So, then, under what circumstances is 1 read just as valid as 10,000? I think most of the time. Any time you don’t have a very definite reason to believe the 1 read is a spurious artifact of the observation process. But the question of rare reads still bears thinking about, because the cumulative influence of rare reads on an analytical outcome can be quite large.