Odds and ends: statistics
Hello everybody.
As promised, I’m continuing to post the last batch of new content from Implementing QuantLib. In this post, I take a look at statistics classes.
In other news, look out for a quick bug-fix QuantLib release in the
next day or two. As a matter of fact, it might already be out if
you’re not reading this the day I posted, so you might want to head
over to the QuantLib site and
check; the new version number would be 1.10.1. The reason I’ve making
the bug fix is that Boost 1.65 was released last week, and it turns
out that
it causes a name clash
in one of our tests. The lesson to learn from this? Don’t use
multiple using namespace
statements in the same file. You never
know what you might catch.
Subscribe to my Substack to receive my posts in your inbox, or follow me on Twitter or LinkedIn if you want to be notified of new posts, or subscribe via RSS if you’re the tech type: the buttons for all that are in the footer. Also, I’m available for training, both online and (when possible) on-site: visit my Training page for more information.
Statistics
The statistics classes were written mostly to collect samples from Monte Carlo simulations (you’ll remember them from a past series of posts). The full capabilities of the library are implemented as a series of decorators, each one adding a layer of methods, instead of a monolith; as far as I can guess, that’s because you can choose one of two classes at the bottom of the whole thing. A layered design gives you the possibility to build more advanced capabilities just once, based on the common interface of the bottom layer.
The first class you can choose, shown below, is called
IncrementalStatistics
, and has the interface you would more or
less expect: it can return the number of samples, their combined
weight in case they were weighed, and a number of statistics results.
Samples can be added one by one or as a sequence delimited by two iterators. The shtick of this class is that it doesn’t store the data it gets passed, but instead it updates the statistics on the fly; the idea was to save the memory that would be otherwise used for the storage, and in the year 2000 (when a computer might have 128 or 256 MB of RAM) it was a bigger concern than it is now. The implementation used to be homegrown; nowadays it’s written in terms of the Boost accumulator library.
The second class, GeneralStatistics
, implements the same
interface and adds a few other methods, made possible by the fact that
it stores (and thus can return) the passed data; for instance, it can
return percentiles or sort its data. It also provides a template
expectationValue
method that can be used for bespoke
calculations; if you’re interested, there’s more on that in the aside
at the end of this section.
Next, the outer layers. The first ones add statistics associated with
risk, like expected shortfall or value at risk; the problem being that
you usually need the whole set of samples for those. In the case of
incremental statistics, therefore, we have to forgo the exact
calculation and look for approximations. One possibility is to take
the mean and variance of the samples, suppose they come from a
Gaussian distribution with the same moments, and get analytic results
based on this assumption; that’s what the
GenericGaussianStatistics
class does.
As I mentioned, and as you can see, it’s implemented as a decorator; it
takes the class to decorate as a template parameter, inherits from it
so that it still has all the methods of its base class, and adds the
new methods. The library provides a default instantiation,
GaussianStatistics
, in which the template parameter is
GeneralStatistics
. Yes, I would have expected the incremental
version, too, but there’s a reason for this; bear with me for a
minute.
When the base class stores the full set of samples, we can write a
decorator that calculates the actual risk measures; that would be the
GenericRiskStatistics
class. As for the Gaussian statistics, I
won’t discuss the implementation (you can look them up in the
library).
As you can see, the layers can be combined; the default
instantiation provided by the library, RiskStatistics
, takes
GaussianStatistics
as its base and thus provides both Gaussian
and actual measures. This was also the reason why
GeneralStatistics
was used as the base for the latter.
On top of it all, it’s possible to have other decorators; the library
provides a few, but I won’t show their code here. One is
SequenceStatistics
, that can be used when a sample is an array
instead of a single number, uses internally a vector of instances of
scalar statistics classes, and also adds the calculation of the
correlation and covariance between the elements of the samples; it is
used, for instance, in the LIBOR market model, where each
sample usually collects cash flows at different times. Other two are
ConvergenceStatistics
and DiscrepancyStatistics
; they
provide information on the properties of the sequence of samples,
aren’t used anywhere else in the library, but at least we had the
decency of writing unit tests for both of them.
Aside: extreme expectations.
Looking back at the GeneralStatistics
class, I’m not sure
if I should be proud or ashamed of it, because—oh boy, I really
went to town with generalization there.
It might have been the mathematics. It started with a straightforward implementation; but looking at the formulas for the mean, the variance, and even some more complex one defined in other layers, I saw that they all could be written (give or take some later adjustments) as
\[\frac{\sum_{x_i \in \mathcal{R}} f(x_i) w_i}{ \sum_{x_i \in \mathcal{R}} w_i},\]that is, the expected value of \( f(x) \) over some range
\( \mathcal{R} \). For the mean, \( f(x) \) would be the
identity and \( \mathcal{R} \) would be the full domain of the
samples; for the variance, \( f(x) \) would be
\( \left(x - \bar{x} \right)^2 \) over the same range; and so
on. The result was a template expectationValue
method that
would take the function \( f \) and the range
\( \mathcal{R} \) and return the corresponding result and the
number of samples in the range; most other methods are implemented
by calling it with the relevant inputs. If you’re a bit confused
at first about the mean being implemented as
then I can’t blame you. By the way, I must have been learning
about functional programming at the time; the range is passed as a
function that takes a sample and return true
or
false
depending on whether it’s in the range, and
everywhere
above is one of a few small predefined helper
functions I added to write this kind of code. It’s all fun and
games until someone writes \( \left(x - \bar{x} \right)^2 \) as
Self-snark aside: the above is very general and allows client code
to create new calculations, but probably caters a bit too much to
the math-inclined and can make for cryptic code, so I’m not sure
that I stroke the right balance here. Replacing binds with
lambdas in C++11 might certainly help; we’ll see how this code
turns out when we start using them. On the other hand,
performance shouldn’t be a problem: expectationValue
is a
template, and so are the functions like compose
and
everywhere
above, so the compiler can see their
implementation and can probably inline them. In that case, the
result would be the simpler loop we might have written in a direct
implementation of the mean or variance formulas.