We are now going to look at two special cases, where the equation we used before needs to be adapted – Shannon entropy, which is \(\log(D_1)\), and the Berger-Parker index, which is \(\frac{1}{D_\infty}\), and adapt the code to cope with these.
\[D_1 = \prod_{i \in \{1 \dots N_S, p_i \neq 0\}} p_i^{-p_i}\]
\[D_\infty = \frac{1}{\max_i p_i}\]
Shannon entropy is an extremely well-known mathematical measure of the information content of a set of data. It is used here as a measure of “surprise” – how surprising the species of the next individual observed in a sequence is likely to be. Where there is only 1 species, each observation is the same, and there is no surprise. However, when there are lots of species that are all equally abundant (we are ignoring how well mixed the population is), every encounter is likely to surprise us.
The Berger-Parker index simply tells us how dominant the most abundant species is. It ignores everything else. The lower the index, the less dominant the most abundant species is, and so the more other species are also observed.
You may want to copy your existing function code from Project 2 as a starting point for this exercise. If you do, remember that you need different function names in this exercise so that they don’t conflict.
You will be writing a function to calculate diversity, and then two
functions to calculate species richness and Simpson index from it as you
did in Project 2. Then you need to write two new very simple functions
which calculate the Shannon entropy and the Berger-Parker index, using
the generic function you created above and not by
calculating them from scratch (they will also need the small amount of
additional calculation – Shannon entropy is actually \(\log(D_1)\) and
Berger-Parker is \(\frac{1}{D_\infty}\) – as detailed
above in the overview). You will probably find that the functions as
written do not work! You may choose to test what the results should be
again by comparing the outputs from the generic function at \(q=1\) and \(q=\infty\) (infinity, written as
Inf
in R) to matching results from the
rdiversity
package.
You should see that the generic function from Project 2 produces very different answers for a value of \(q = 1\) and a value near 1 (say \(q = 0.99\) or \(q = 1.01\)), and similarly for \(q = \infty\) and a reasonably large number (say \(q = 20\) or \(q = 30\)). In fact, you should actually get errors for exactly 1 and \(\infty\), though you may in fact just produce the wrong answers (that are inconsistent with “nearby” numbers). In the former case, this is because we calculate \(\frac{1}{0}\) in the equation (which is not a number), and in the latter we have a rounding error and end up calculating \(0^0\) instead of \(0.000001^{0.000001}\) which turn out to be very different numbers.
You will need to rewrite the general function using:
if (some_test) {
# do something
} else if (some_other_test) {
# do something different
} else {
# a third thing
}
So that it checks for these two specific values of \(q\), and instead use the following two equations:
\[D_1 = \exp(-\sum(p \times
\log(p)))\] which is to say exp(-sum(p * log(p)))
in
R, where p
is the vector of non-zero species
proportions.
\[D_\infty = \frac{1}{\max(p)}\]
which is just 1/max(p)
in R. You should to end up writing
the updated general function and the four specific functions that call
this new updated function to calculate the four measures of diversity we
have discussed so far.
Run the two new functions again to make sure there are no errors, and also make sure that the generic function now gives the same (or very similar) results for 1 and 1.01, and for 20 and \(\infty\).
NB It is just possible that you may find a problem
with the code still not giving you the correct answer when \(q=1\). You can test for \(q\) being infinity by just by comparing the
value to Inf
, but this doesn’t work reliably for other
numbers – what goes wrong is a rounding error that is very common in
computer programming. Whole numbers cannot be correctly represented as
decimals without small errors creeping in – so 1 may actually appear as
1.000000000001 – and testing whether it is then equal to 1 will fail.
Instead you should test whether it is sufficiently close to 1 for your
purposes. Fortunately, we can do this by saying
if (abs(q-1) < 0.000001)
, say, instead of
if (q == 1)
, which will check whether you are within
0.000001 of 1 (i.e. 0.999999 to 1.000001). I’ve picked
0.000001, but you can pick another value if it seems better. R actually
provides a number for this itself called the “machine epsilon”,
.Machine$double.eps
, which is about \(10^{-16}\), but it may be too small for our
purposes. People often use sqrt(.Machine$double.eps)
as a
compromise (about \(10^{-8}\)). I
honestly don’t know why except that it looks quite technical, but feel
free to use it if you like!
Now write a demo that loads in this new functionality and then calculates the four diversity measures we have discussed for each dataset.
The demo should then call the general function directly (not the 4 specific functions) and just calculate the general diversity measures for the datasets provided with \(q = 0\), 1, 2, and \(\infty\) (remember, these correspond to the 4 individual diversity measures we have investigated so far). Do you see a progression in the values you observe within each set of population counts as \(q\) increases? Comment on this in the report.