So far, we have been using the true counts of our real (and imagined) populations. Ordinarily we only have a small proportion, a sample, of this data. We will now investigate how much it matters if we only have a small sample of data to work with rather than the whole picture.
Usually we are not in the happy situation of having a massive dataset and only wanting to look at part of it. In this context, it turns out that some diversity measures are very sensitive to the amount of data that has been collected, and so if you want to compare two populations to identify the more diverse, you may need to subsample the larger dataset to the size of the smaller one to make a fair comparison.
There are several ways of subsampling a population like the one we
have here. The simplest way to approximate the process
is to use rmultinom()
to generate a small dataset with
appropriate probabilities as we did in Project 1 to create
rand.pop
.
Write and document a function (like the
sample_by_species
, etc. functions that we provided
for you in your data package) to subsample a dataset to a given size
called sample_by_individuals()
(perhaps using this
rmultinom
function) – it should take a vector of population
counts and a sample size as its arguments and return a vector of sample
counts. Note this is slightly different from the
sample_...()
functions that we provided that take the whole
bci_2010
dataset as an argument (a matrix, not a
vector).
Incidentally, can you work out why this is only an approximation? Can you think of a better solution? This is not easy (or we can’t see how to do it easily anyway!), so don’t worry if you just use the suggested method. If you can work out what’s wrong, put an explanation into the documentation for the function.
Now create a new demo. For each dataset that has the full number of
individuals (i.e. not quadrat.pop
,
quadrat10.pop
or rand50.pop
), plot the
diversity profile of the full dataset, and then plot the diversity
profile of a subsample of 10 individuals, then 100, 1000, 10000, and so
on until you reach the full population size. See what effect sample size
has on diversity for different values of q
. At this point
you have two choices:
Either (a more qualitative, graphical project) look at plots of the full dataset, and then multiple repeats of, say, 100 or 1000 individuals and see how variable your answers are with small samples, investigating issues arising from not collecting enough data.
Or alternatively (a more analytical project) you
could choose to run analyses on the different populations and subsamples
of them, and then repeat the same analyses on the exact same populations
and subsamples using our rdiversity
package and compare the
results numerically (as in Project 3, remember that you can test whether
two numbers are approximately equal to each other using
if (abs(x-y) < 0.000001)
) to ensure they give the same
results.