One Apple, One Orange

(or, “Why does my metabarcoding dataset look different from my visual surveys/nets/traps/etc?”)

by Ryan Kelly (with thanks to Ole Shelton, Erin D’Agnese, Eily Allan, and Zack Gold)

As eDNA datasets have become more common, a frequent and reasonable step for many applications is to compare some eDNA (mainly metabarcoding, sometimes qPCR or ddPCR) results to those of some other, more traditional sampling method. Examples include nets of all kinds, acoustic signals, microscopy, traps, visual surveys, and so on. All are some flavor of “compare and contrast eDNA results and X”, with the idea that if X is an accepted sampling method in ecology or similar field — and if the eDNA results are similar enough to the results of X — then eDNA methods will simply slot into existing survey methods, statistics, and projects, off-the-shelf technological upgrades for existing methods.

But of course, this rarely works so cleanly. Inevitably, we all end up trying to compare one apple and one orange — eDNA data and some other data, collected on the same day in the same place — and perhaps squinting and standing on one leg to show how, in the right light, they might be correlated. But this is madness. (And it’s madness in which we’ve engaged in ourselves; there’s no shame here).

Perhaps the most fundamental way in which eDNA data are different from traditional methods, as a whole, is that the PCR process is exponential. This leads to really different-looking data resulting from metabarcoding studies, in particular. And — principally amplification bias of primer-template matches — the number of sequenced reads from a given species can bear no relationship to the starting (or proportional) biomass of that species in the sampled environment. So eDNA data might be a complete mismatch to, say, visual surveys, or the reads might be 5 orders of magnitude different from the visual counts. And that’s not because eDNA is wrong or inherently flawed, but because small biases get exponentiated and can dominate the underlying signal. Put differently: “normal” ecological surveys happen on the scale you are interested in; eDNA surveys are different because we never observe the eDNA directly, but only an exponentiated, crazy version of it.

Another way eDNA is different is that those same amplification biases aren’t predictable in the way that visual surveys (or whatever) have predictable biases. Primers designed for mammals might amp most mammals (but not all)… just because no grey whale showed up in your dataset doesn’t mean no grey whale was present in the sample. Which is frustrating. And different from most other sampling methods.

Which brings me back to the comparison that many of us would like to do: one apple, one orange. Completely reasonable, and generally doomed to failure, given that there’s no reason to *expect* metabarcoding results to look like visual surveys or net tows — or culture-counts, in the case of microbiome studies. The processes leading to the two datasets are just really different.

The solution lies in understanding this process, and building that into the expectations for your data. Say we have done 35 PCR cycles in a metabarcoding study with primers amplifying mammals. The number of grey whale reads we expect is

s * (b*(1 + a)^35) * uncertainty,

where `b’ is the proportion of the amplifiable DNA present, `a’ is the amplification efficiency (somewhere between 0 and 1, and not predictable in silico) of the primer-template match, and `s’ is a scaling parameter that tells us what fraction of the amplicons put onto the sequencing run were actually sequenced. If we know grey whale as an a = 0.75, and we observe some number of reads on our sequencing run, we can estimate `b’, the amount of whale present.

But what about that uncertainty (error) term? That captures all of the variability that happens sample-to-sample and replicate-to-replicate. It comes from pipetting error, and random sampling error, and so forth. A dataset with technical replicates makes it possible to parse some of these out into separate terms, but ultimately, this is all noise in your data, obscuring the relationship between the amount of a species and the number of reads that come out of the sequencer.

Crucially, it’s clear that the number of reads recovered for a species *does indeed* have a relationship to how much of that thing there is in the world. It’s just that the relationship isn’t linear. And why would we expect it to be, with an exponent of 35 (or 40, etc) in the process?

So all isn’t lost for our apple and our orange. We *do* expect visual counts and nets, etc., to correlate with the `b’ term in the equation above. And where `a’ is high — that is, the primer set does a good job of amplifying the target species, relative to the others in the pool — we expect that correlation to be quite strong. Where `a’ is low, by contrast, the error term swamps any signal in the data, and the number of reads is uncorrelated to observations from other methods.

A somewhat principled cheat for the equation above is to normalize read counts within a taxon… to create an index of eDNA abundance for that particular taxon. (We did this in Kelly et al. 2019, first controlling for different read-depth across samples, and then scaling those proportions within a taxon to create the eDNA index; see also McLaren et al 2019). The index says “all other elements being equal, changes in reads over time — within a taxon, but not between taxa — will track changes in biomass.” You don’t need to know any of the terms in the equation. This approach assumes that `a’ is constant no matter the biological context — which isn’t precisely true, and also, the relative ranks of the efficiencies of species present matters a lot: what if a new species entered the community that had a huge `a’ and swamped everything out? — and elides over the finite read-depth that creates compositional datasets. But it seems to work pretty well in practice… most especially when a species is amplified well by the primer set in hand. And all of those species at the top of your data matrix — the ones with very high read abundances — are, nearly by definition, amplifying very well. (The terrors of compositional data will surely be a later discussion point).

And that whatever the non-eDNA dataset is that you’re using for comparison? That also has some underlying process leading from biological entity in the world to the data on your screen. We just tend not to think about that. Every way of seeing the world has different biases; every data collection is imperfect. So building in information about those imperfections is also a good idea, in terms of a fair comparison. For example, is it likely your visual survey observed all of the individuals that were really present?

In sum, the straight-up apple-to-orange comparison won’t work well. But that’s not to say the eDNA-to-other-data comparison won’t work well. It just takes a bit more analysis before things line up the way you hope they will. This, in more math-heavy language, is what Shelton et al. 2016 was all about, and I’ve been spending the past five years trying to get that into my head as my default view of the world.