Odds and ends: statistics — Implementing QuantLib

Hello everybody.

As promised, I’m continuing to post the last batch of new content from Implementing QuantLib. In this post, I take a look at statistics classes.

In other news, look out for a quick bug-fix QuantLib release in the next day or two. As a matter of fact, it might already be out if you’re not reading this the day I posted, so you might want to head over to the QuantLib site and check; the new version number would be 1.10.1. The reason I’ve making the bug fix is that Boost 1.65 was released last week, and it turns out that it causes a name clash in one of our tests. The lesson to learn from this? Don’t use multiple using namespace statements in the same file. You never know what you might catch.

Subscribe to my Substack to receive my posts in your inbox, or follow me on Twitter or LinkedIn if you want to be notified of new posts, or subscribe via RSS if you’re the tech type: the buttons for all that are in the footer. Also, I’m available for training, both online and (when possible) on-site: visit my Training page for more information.

Statistics

The statistics classes were written mostly to collect samples from Monte Carlo simulations (you’ll remember them from a past series of posts). The full capabilities of the library are implemented as a series of decorators, each one adding a layer of methods, instead of a monolith; as far as I can guess, that’s because you can choose one of two classes at the bottom of the whole thing. A layered design gives you the possibility to build more advanced capabilities just once, based on the common interface of the bottom layer.

The first class you can choose, shown below, is called IncrementalStatistics, and has the interface you would more or less expect: it can return the number of samples, their combined weight in case they were weighed, and a number of statistics results.

    class IncrementalStatistics {
      public:
        typedef Real value_type;
        IncrementalStatistics();

        Size samples() const;
        Real weightSum() const;
        Real mean() const;
        Real variance() const;
        Real standardDeviation() const;
        Real errorEstimate() const;
        // skewness, kurtosis, min, max...

        void add(Real value, Real weight = 1.0);
        template <class DataIterator>
        void addSequence(DataIterator begin, DataIterator end);
    };

Samples can be added one by one or as a sequence delimited by two iterators. The shtick of this class is that it doesn’t store the data it gets passed, but instead it updates the statistics on the fly; the idea was to save the memory that would be otherwise used for the storage, and in the year 2000 (when a computer might have 128 or 256 MB of RAM) it was a bigger concern than it is now. The implementation used to be homegrown; nowadays it’s written in terms of the Boost accumulator library.

The second class, GeneralStatistics, implements the same interface and adds a few other methods, made possible by the fact that it stores (and thus can return) the passed data; for instance, it can return percentiles or sort its data. It also provides a template expectationValue method that can be used for bespoke calculations; if you’re interested, there’s more on that in the aside at the end of this section.

    class GeneralStatistics {
      public:
        // ... same as IncrementalStatistics ...

        const std::vector<std::pair<Real,Real> >& data() const;

        template <class Func, class Predicate>
        std::pair<Real,Size>
        expectationValue(const Func& f,
                         const Predicate& inRange) const;

        Real percentile(Real y) const;
        // ... other inspectors ...

        void sort() const;
        void reserve(Size n) const;
    };

Next, the outer layers. The first ones add statistics associated with risk, like expected shortfall or value at risk; the problem being that you usually need the whole set of samples for those. In the case of incremental statistics, therefore, we have to forgo the exact calculation and look for approximations. One possibility is to take the mean and variance of the samples, suppose they come from a Gaussian distribution with the same moments, and get analytic results based on this assumption; that’s what the GenericGaussianStatistics class does.

    template<class S>
    class GenericGaussianStatistics : public S {
      public:
        typedef typename S::value_type value_type;
        GenericGaussianStatistics() {}
        GenericGaussianStatistics(const S& s) : S(s) {}

        Real gaussianDownsideVariance() const;
        Real gaussianDownsideDeviation() const;
        Real gaussianRegret(Real target) const;
        Real gaussianPercentile(Real percentile) const;
        Real gaussianValueAtRisk(Real percentile) const;
        Real gaussianExpectedShortfall(Real percentile) const;
        // ... other measures ...
    };

    typedef GenericGaussianStatistics<GeneralStatistics>
                                           GaussianStatistics;

As I mentioned, and as you can see, it’s implemented as a decorator; it takes the class to decorate as a template parameter, inherits from it so that it still has all the methods of its base class, and adds the new methods. The library provides a default instantiation, GaussianStatistics, in which the template parameter is GeneralStatistics. Yes, I would have expected the incremental version, too, but there’s a reason for this; bear with me for a minute.

When the base class stores the full set of samples, we can write a decorator that calculates the actual risk measures; that would be the GenericRiskStatistics class. As for the Gaussian statistics, I won’t discuss the implementation (you can look them up in the library).

    template <class S>
    class GenericRiskStatistics : public S {
      public:
        typedef typename S::value_type value_type;

        Real downsideVariance() const;
        Real downsideDeviation() const;
        Real regret(Real target) const;
        Real valueAtRisk(Real percentile) const;
        Real expectedShortfall(Real percentile) const;
        // ... other measures ...
    };

    typedef GenericRiskStatistics<GaussianStatistics>
                                               RiskStatistics;

As you can see, the layers can be combined; the default instantiation provided by the library, RiskStatistics, takes GaussianStatistics as its base and thus provides both Gaussian and actual measures. This was also the reason why GeneralStatistics was used as the base for the latter.

On top of it all, it’s possible to have other decorators; the library provides a few, but I won’t show their code here. One is SequenceStatistics, that can be used when a sample is an array instead of a single number, uses internally a vector of instances of scalar statistics classes, and also adds the calculation of the correlation and covariance between the elements of the samples; it is used, for instance, in the LIBOR market model, where each sample usually collects cash flows at different times. Other two are ConvergenceStatistics and DiscrepancyStatistics; they provide information on the properties of the sequence of samples, aren’t used anywhere else in the library, but at least we had the decency of writing unit tests for both of them.

Aside: extreme expectations.

Looking back at the GeneralStatistics class, I’m not sure if I should be proud or ashamed of it, because—oh boy, I really went to town with generalization there.

It might have been the mathematics. It started with a straightforward implementation; but looking at the formulas for the mean, the variance, and even some more complex one defined in other layers, I saw that they all could be written (give or take some later adjustments) as

\[\frac{\sum_{x_i \in \mathcal{R}} f(x_i) w_i}{ \sum_{x_i \in \mathcal{R}} w_i},\]

that is, the expected value of \( f(x) \) over some range \( \mathcal{R} \). For the mean, \( f(x) \) would be the identity and \( \mathcal{R} \) would be the full domain of the samples; for the variance, \( f(x) \) would be \( \left(x - \bar{x} \right)^2 \) over the same range; and so on. The result was a template expectationValue method that would take the function \( f \) and the range \( \mathcal{R} \) and return the corresponding result and the number of samples in the range; most other methods are implemented by calling it with the relevant inputs. If you’re a bit confused at first about the mean being implemented as

    return expectationValue(identity<Real>(),
                            everywhere()).first;

then I can’t blame you. By the way, I must have been learning about functional programming at the time; the range is passed as a function that takes a sample and return true or false depending on whether it’s in the range, and everywhere above is one of a few small predefined helper functions I added to write this kind of code. It’s all fun and games until someone writes \( \left(x - \bar{x} \right)^2 \) as

    compose(square<Real>(),
            std::bind2nd(std::minus<Real>(), mean()))

Self-snark aside: the above is very general and allows client code to create new calculations, but probably caters a bit too much to the math-inclined and can make for cryptic code, so I’m not sure that I stroke the right balance here. Replacing binds with lambdas in C++11 might certainly help; we’ll see how this code turns out when we start using them. On the other hand, performance shouldn’t be a problem: expectationValue is a template, and so are the functions like compose and everywhere above, so the compiler can see their implementation and can probably inline them. In that case, the result would be the simpler loop we might have written in a direct implementation of the mean or variance formulas.