Portfolio optimisation: lessons learnt

Over the past few months I have been busy doing a mixture of blockchain consulting and quantitative finance work for a couple of companies in South East Asia. In particular, I have had the opportunity to investigate the interesting problem of portfolio management for cryptoassets – it was not my first experience with portfolio optimisation, having implemented efficient frontier portfolios at a roboadvisor startup, but this time I took the opportunity to do a deep dive into the subject.

Though I am not an expert, I do believe I am at the stage where I can do a kind of write-up, recounting some of the things I’ve learnt along the way. A lot of my experience has been baked into PyPortfolioOpt, a comprehensive portfolio optimisation package which I recently revamped, but this post will act as an informal guidebook for those interested in pursing the subject.

1. Be careful with in-sample vs out-of- testing

For most people, ‘portfolio optimisation’ basically refers to mean-variance optimisation (MVO) in the style of Markowitz. MVO will tell you the mathematically optimal portfolio allocation, which sounds great, but the caveat is that it requires the (future) expected returns and the covariance matrix as inputs. For those of us who aren’t clairvoyant, these quantities are inaccessible, so we must make do with estimates. The quality of these estimates will be discussed in the next section, but for now I would like to highlight a common mistake, which is backtesting with a forward-looking bias.

The standard way (in textbooks or pedagogical materials) to estimate the expected returns is to take the mean annual return over the past few years. However, it is surprisingly easy to accidentally let future data worm its way into the estimate, resulting in vastly overinflated performance. One of the early mistakes I made was to optimise an equity portfolio using data from 2006-2010, then testing it from 2006-2015. The problem here is the overlap – the 2006 test has had access to data up to 2010, so performance in the 2006-2010 portion of the backtest will be very good.

To be fair, this is often quite an obvious mistake to diagnose, because your portfolios will outperform the benchmark by a ridiculous margin and you will realise that something is wrong. But there are also more subtle ways that you can include future data (e.g survivorship bias), in which case it may not be so clear.

2. Forget about expected returns

Despite their strong theoretical guarantees, efficient frontier portfolios often have poor real-life performance owing to estimation errors in the inputs.

Anyone who has ever tried to pick stocks will know how absurd it is to expect that a stock’s mean return over the past few years will be a good indicator of its future returns. Such simple relationships have almost certainly been arbitraged away – I therefore contend that mean historical returns are almost pure noise. The problem with this is that MVO has no way of distinguishing noise and signal, so often what happens is that the optimiser highlights the noise in the input. This is why there is a ‘running joke’ in the portfolio optimisation literature that a mean-variance optimiser is really just an “error maximiser”1.

In practice, I have found that standard MVO can perform at least in line with the benchmark for equities, but for other asset classes (particularly those which are primarily driven by speculation, like cryptocurrency), the mean historical return is a useless estimator of future return. However, in the next section we will see a simple way round this.

3. You can do better than 1/N

A significant body of research suggests that, in light of the aforementioned failures of MVO, we should instead use 1/N portfolios2 – it is noted that they often beat efficient frontier optimisation significantly in out-of-sample testing.

However, an interesting paper by Kritzman et al. (2010) finds that it is not the case that there is anything special about 1/N diversification: it is just that the expected returns are an incredibly poor estimator and any optimisation scheme which relies on them will likely go astray.

The easiest way to avoid this problem is to not provide the expected returns to the optimiser, and just optimise on the sample covariance matrix instead. Effectively we are saying that although previous returns won’t predict future returns, previous risks might predict future risks. This is intuitively a lot more reasonable – the sample covariance matrix really seems like it should contain a lot of information. Empirical results support this, showing that minimum variance portfolios outperform both standard MVO and 1/N diversification.

In my own work I have found that the standard minimum variance portfolio is a very good starting point, from which you can try a lot of new things:

  • Shrinkage estimators on the covariance matrix
  • Exponential weighting
  • Different historical windows
  • Additional cost terms in the objective function (e.g small-weights penalty)

4. Don’t ignore rebalance costs

When it comes to investment strategy or algorithmic trading, most people leave the “realities” like commission, slippage, and latency to the last stage. However, because of the general high turnover nature of efficient frontier portfolios, this could be quite a dangerous mistake.

In the context of portfolio management, the two main realities that must be accounted for in all backtests are commission and slippage. Latency is not very important because of the longer time horizons involved. Commissions are not hard to analyse: just multiply your broker’s percentage commission by the turnover. Slippage, on the other, hand is a little bit more subtle (though it is only really an issue in less liquid markets). There are a number of variables that affect slippage, but the important ones are transaction volume and urgency (refer to models such as that of Almgren for more).

Specifically, in the case of cryptoasset portfolios, one major problem is the difference in liquidity between bitcoin and a small altcoin like OmiseGo. A portfolio that shifts positions too much among the altcoins may look like it performs well in backtests, but when traded in real life the 0.5% slippage will be a killer.

Once all transaction costs have been included, you may find that it can be difficult to outperform a buy-and-hold market-cap weighted benchmark. Incidentally, this is another failure with 1/N portfolios: for reasonably volatile assets, there is high turnover at each rebalance and thus a large transaction cost.

5. Overfitting

One thing that I respect a lot about machine learning education is that they are fundamentally honest about one of the major issues in the subject: overfitting. Essentially, it is possible to encourage most classifiers to wiggle their decision boundary to accommodate all of the training examples at the cost of generalisation ability. This is done by adjusting the hyperparameters (which do things like specify learning rates) and seeing which ones perform better. Hyperparameter tuning is not a bad thing, but it must be done responsibly lest we simply choose the best parameters that fit the particular train/test setup.

When it comes to standard portfolio optimisation, one doesn’t really have to worry about this because there aren’t any hyperparameters. However, once you start adding different terms to the cost function (each scaled by some parameter), things get a little bit more complicated. For example, let us consider the simple example of minimum variance optimsiation with an L2 regularisation term.

The performance of this portfolio varies greatly depending on the choice of $\gamma$. A natural instinct is to run the optimisation for 10 different values of $\gamma$, but if this is not done carefully, we are probably fitting to the noise within the train and test sets.

Remember that ceteris paribus, a simpler model should be preferred. Black-Litterman, or optimisation with an exotic objective with multiple parameters may be seem to improve performacne on the backtest, but the proof will always be in the out-of-sample pudding. Approach portfolio optimisation like you would any financial machine learning task: with a healthy dose of skepticism.

Conclusion

I still have a lot to learn about portfolio optimisation, particularly with respect to optimisation of higher moments (skew and kurtosis) and things like copula, but the lessons highlighted in this post definitely still apply. If you found this interesting, check out:

PyPortfolioOpt   Star

References

  1. Michaud, R. (1989). “The Markowitz Optimization Enigma: Is Optimization Optimal?” Financial Analysts Journal 45(1), 31–42. 

  2. Demiguel, Victor & Garlappi, Lorenzo & Uppal, Raman. (2009). Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy?. Review of Financial Studies. 22. 10.1093/rfs/hhm075. 


Exponential Covariance

For the past few months, I have been doing a lot of research into portfolio optimisation, whose main task can be summarised as follows:

Is there a way of combining a set of risky assets to produce superior risk-adjusted returns compared to a market-cap weighted benchmark?

The answer of Markowitz (1952) is in the affirmative, with some major caveats. Given the expected returns and the covariance matrix (which encodes asset volatilities and correlations), one can find the combination of asset weights which maximises the Sharpe ratio. But because we don’t know the expected returns or future covariance matrix a priori, we commonly replace these with the mean historical return and sample covariance matrix. The problem is that these are very noisy estimators (especially the mean return), so much so that a significant body of research suggests that a naive diversification strategy (giving each asset equal weight) outperforms most weighting schemes. However, work by Kritzman, Page and Turkington (2010) affirms the intuition that there must be some information in the sample covariance matrix; accordingly, they observe that minimum variance portfolios can beat 1/N diversification.

In my own research, I have found this to be true. Minimum variance optimisation and its variants (no pun intended) can significantly outperform both 1/N diversification and a market-cap weighted benchmark. The nice thing about minimum variance optimisation is that success largely depends on how well you can estimate the covariance matrix, which is easier than estimating future returns. The most common methods are

  • Sample covariance – standard, unbiased and efficient, but it is known to have high estimation error which is particularly dangerous in the context of a quadratic optimiser.
  • Shrinkage estimators – pioneered by Ledoit and Wolf, shrinkage estimators attempt to reduce the estimation error by blending the sample covariance matrix with a highly structured estimator.
  • Robust covariance estimates – estimators that are robust to recording errors, such as Rousseeuw’s Minimum Covariance Determinant .

I have been experimenting with a new alternative, which I call the exponential covariance matrix (to be specific, the exponentially-weighted sample covariance matrix). In this post, I will give a brief outline of the motivation and conceptual aspects of the exponential covariance. However, because I am currently using this professionally, I will be intentionally vague regarding implementation details.

Motivation

In many technical indicators, we see the use of an exponential moving average (EMA) rather than the simple moving average (SMA). The EMA captures the intuition that recent prices are (exponentially) more relevant than previous prices. If we let $p_0$ denote today’s price, $p_1$ denote yesterday’s price, $p_n$ denote the price n days ago, the exponentially weighted mean is given by:

$\alpha$ parameterises the decay rate ($0 < |\alpha| < 1$): by observation we note that higher $\alpha$ gives more weight to recent results, and lower $\alpha$ causes the exponential mean to tend to the arithmetic mean. Additionally, because $1/\alpha = 1 + (1-\alpha) + (1-\alpha)^2 + \ldots$, we note that ‘weights’ sum to one.

In practice, we do not compute the infinite sum above. Rather, observing that the weights rapidly become negligible, we limit the calculation to some window. This window is not to be confused with the span of the EMA, which is another way of specifying the decay rate – a good explanation can be found on the pandas documentation.

The EMA is useful because it ‘reacts’ to recent data much better than the SMA owing to the exponential weighting scheme, while still preserving the memory of the timeseries.

Covariance

Covariance, like correlation, measures how two random variables X and Y move together. Let us think about how we would define this metric. For variables that have high covariance, we would expect that when $x$ is high, so is $y$. When $y$ is low, $x$ should be too. To capture this idea, we can proceed as follows.

For each pair $(x_i, y_i)$ in the population, we measure how far away $x_i$ and $y_i$ are from their respective means, then multiply these distances together. If these differences have the same sign, i.e $x_i$ is greater than $\bar{x}$ and $y_i$ is greater than $\bar{y}$, there is a positive contribution to the covariance. If $x_i$ is less than $\bar{x}$ but $y_i$ is greater than $\bar{y}$, there is a negative contribution. We can then sum over all observations, and divide by the number of observations to get some kind of ‘average co-variation’. In fact, this is the exact definition of the population covariance:

Please note that in practice you would always use the sample covariance instead, which has a factor of $1/(N-1)$ rather than $1/N$, but this is not something we will worry about here.

Covariance of asset returns

One would think that the easiest approach is to take two price series (e.g stock prices for AAPL and GOOG), then compute the daily percentage change or log returns, before feeding these into a covariance calculation. This does work, and is the standard approach. But in my view, this is carelessly throwing away a good deal of information, because:

Covariance does not preserve the order of observations.

You will get the same covariance whether you provide $(x_2, y_2), (x_{17}, y_{17}), (x_8, y_8), \ldots$ or $(x_1, y_1), (x_2, y_2), (x_3, y_3), \ldots$.

However, in the case of time series, the order of the returns is of fundamental importance. Thus we need some way of incorporating the sequential nature of the data into the definition of covariance. Fortunately, it is simple to apply our intuition of the EMA to come up with a similar metric for covariance.

We will rewrite our previous definition as follows:

Rather than letting $(x_i, y_i)$ be any observations from the dataset, let us preserve the order by saying that $(x_i, y_i)$ denotes the returns of asset X and Y $i$ days ago. Thus $(x_1 - \bar{x})(y_1 - \bar{y})$ specifically refers to the co-variation of the returns yesterday.

The next step should now be clear. We simply give each co-variation term an exponential weight as follows:

Or more simply:

And we are done! This simple procedure is all that is required to incorporate the temporal nature of asset returns into the covariance matrix.

Conclusion

This post has presented a modification of the covariance matrix especially suited to time series like asset returns. It is simple to extend this to an exponential covariance matrix and use it in portfolio optimisation – it is reasonable to suggest that this matrix will be positive definite if the sample covariance matrix is positive definite.

I have used the exponential covariance to great effect regarding portfolio optimisation on real assets. Backtested results have affirmed that the exponential covariance matrix strongly outperforms both the sample covariance and shrinkage estimators when applied to minimum variance portfolios.

At some stage in future, I will consider implementing this in my portfolio optimisation package PyPortfolioOpt, but for the time being this post will have to suffice. Please drop me a note if you’d like to use the ideas herein for any further research or software.

Update: as of 20/9/18, exponential covariance has been added to PyPortfolioOpt!


Stormy Seas for Proof of Work

In this post we will be examining one of the main problems with Proof of Work (PoW) – not the energy inefficiency (as it is debatable how much of a problem this really is), but something more fundamental with the consensus process. In the past couple of months we have seen a number of cryptocurrencies fall victim to 51% attacks. Verge, Bitcoin Gold, ZenCash, and Electroneum are just a few coins that have been targeted, resulting in a total equivalent theft of $5 million (not to mention the subsequent loss in market value of the coins).

51% attacks are a basic problem in distributed ledger technology, covered in any crypto 101 course. Essentially, each individual node in a decentralised network is responsible for validating transactions and optionally submitting a block of these transactions to the blockchain – doing so requires the node to solve hash puzzles (this is why it is called Proof of Work). The beauty of this system is that each node has a say in what happens, proportional to the amount of hash power they contribute – thus the system is a democracy of sorts. However, a natural corollary of this is that any node or group of nodes that achieves a majority of the hash power can ‘outvote’ the rest of the network, allowing them to conduct a 51% attack.

Standard theory dictates that if there are enough independent nodes on a distributed ledger, we can reap the benefits of democracy while knowing that it would be immensely costly for a malicious party to achieve 51% of the hash power. This may be true for cryptocurrencies with many active nodes (like Bitcoin and Ethereum), but with the proliferation of multitudinous altcoins, we may not be able to say the same for more obscure tokens. It is thus the case that 51% attacks have gone from being a textbook problem to something very real, actively destroying both reputation and value in the crypto space.

In this post, we will explore how an attacker might go about conducting a 51% attack, examine the features that make a coin most vulnerable, and comment on prevention and mitigation. An obvious disclaimer: nothing in this post should be construed as a recommendation or a practical guide on conducting a 51% attack.

Malicious actors

It is no secret that cryptocurrency has attracted a large number of malicious parties, ranging from exchange-hackers to phishing scammers. Part of the reason for this is the underlying anonymity/psuedonymity of the system. If someone were to steal one million USD, they would likely require a network of offshore bank accounts to get away with it. However, if you were to steal 200BTC, all you’d have to do is convert it to a privacy coin like monero on any one of the exchanges that doesn’t require KYC, and law enforcement would have a very hard time catching up with you.

In this post we are dealing with a narrow subset of malicious actors: those who play by the rules. There are multiple ways a 51% attacker can ‘legally’ (in terms of the blockchain protocol) use their majority hash power to behave maliciously, one example being the denial of service to users in a network. But by far the most direct way of benefiting at the cost of others is to double spend, which is the when the same coin is sent to multiple parties in a network.

How a 51% attacker might double spend

Once an attacker has identified a suitable target (more on this later), this is a rough sketch of how they might profit. Again, this is purely educational and hypothetical – by no means is it a practical guide.

I will refer to the target coin as TCOIN. Many of these steps also require an anonymous “exit coin” – I will use Monero (XMR) as an example.

  1. Acquire some TCOIN anonymously, e.g via an offline swap of fiat for XMR then XMR for TCOIN.
  2. Set up an account with exchange A and exchange B. Clearly these exchanges should have minimal KYC.
  3. Send 1000 TCOIN from your TCOIN address to that of exchange A, then immediately cash it out to XMR.
  4. Acquire 51% of the TCOIN network’s hash power, then make a new TCOIN transfer to exchange B. This transaction should be put into a block that orphans the previous block, so although exchange A thinks they have received your TCOIN and you have cashed it out to XMR, in reality it is exchange B that has received the TCOIN.
  5. On exchange B, convert TCOIN to XMR and send it to your monero wallet.

In general terms, this describes how the double spend lets you manufacture 1000 TCOIN from thin air (at the expense of the first exchange). An optional additional step is to first short TCOIN, because we have seen that 51% attacks severely reduce public trust in the coin’s development team and tend to lead to a sudden price drop.

Thus far this has all been theoretical – we have presented a textbook situation of how a double spend might work in theory. In the next section, we will understand the worrying practical reality of the situation.

Vulnerability of different coins

One would think that there should be a perfect correlation between the market cap of a coin and its overall hash rate – that is, the demand for a coin should be correlated with the ‘strength’ of the network validation. However, empirically this is not the case. Because of the rise of ill-informed speculation, we see that some coins with a high market cap (a few hundred million USD) have a surprisingly low hash rate. This is the clear yet disturbing message of crypto51.app, a simple site which displays the cost of becoming a 51% attacker (over 1 hour) for different PoW cryptocurrencies.

I submit that there are four main features an attacker would look for in an ideal target coin, apart from the obvious factor of having a smaller hash rate (and thus a lower attack cost).

  • High market cap (as a proxy for interest/liquidity in the coin)
  • Low transaction time (i.e very few block confirmations)
  • Support on KYC-free exchanges
  • A common hashing algorithm, or at least one that is supported on a mining marketplace like NiceHash. The advantage of this for an attacker is that they can just rent hash power anonymously rather than having to acquire hardware.

Using python’s requests and beautifulsoup, I scraped the data from crypto51 in order to do my own analysis (available on a jupyter notebook) of which coins were most vulnerable:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.crypto51.app/")
soup = BeautifulSoup(r.text, "lxml")
data = []
headers = []
table = soup.find('table')
column_names = table.find('thead').find_all('th')
headers = [i.text.strip() for i in column_names]
rows = table.find('tbody').find_all('tr')
for row in rows:
    data.append([i.text.strip() for i in row.find_all('td') if i])
df = pd.DataFrame(data, columns=headers)

I then processed the data and generated a plot of the attack cost versus the market cap, coloured by NiceHash-ability (all log-transformed). This is very easy to do using pandas plotting:

df.plot.scatter("Log Market Cap", "Log Attack Cost", 
                c="Log NiceHash", colormap="plasma")

In a graph like this, a target coin should be as close to the lower-right quadrant as possible (high market cap but low attack cost). The lighter the datapoint, the easier it is to attack via NiceHash, which may or may not be important to an attacker.

A linear trendline fits the data with $R^2 = 0.65$, and because a straight line on log-log axes corresponds to a power law relationship, we can calculate the coefficients as follows:

Actually, it is not the general relationship that matters but rather the specific outliers – altcoins below the trendline are those with especially low attack costs relative to their market cap.

The top portion is self explanatory. In the bottom-left quadrant are coins that are easy to attack, but also have such small market caps that they are likely not worth attacking (e.g very poor exchange support). I thus determine the danger zone to be coins with slightly higher market caps (in the order of USD 10 million to 100 million). However, I happened to notice that some of the recently attacked altcoins (listed in the introduction) formed a narrow band slightly above my danger zone. Perhaps this is a liquidity sweet-spot that makes it easier for attackers to exit onto another exchange, or perhaps it’s a coincidence.

In any case, if I were holding something like BTCP (Bitcoin Private), I’d be a little bit worried. It has a market cap of more than \$200 million, and reasonably liquid trading pairs, but costs less than \$500 an hour to attack. If the attacker were unwilling to set up hardware, they might prefer to target something like EMC2 (Einsteinium), which has a higher NiceHash-ability.

Now, in a perfectly efficient market, one would imagine that the possible financial benefit from conducting a 51% attack would outweigh the financial costs. However, recent attacks show that this is clearly not the case (especially because one can now short the target coin prior to the attack). So there is a clear problem in the crypto markets.

What can we do about it?

An obvious solution is to ditch PoW and go with Proof of Stake (PoS). Much easier said than done. Arguably the most mature PoS development effort is that of Ethereum, in the form of Casper. Yet despite the undeniable talent of the dev team, solving the nothing-at-stake problem and ironing out the wrinkles in the implementation is not proving to be a straightforward task. Delegated PoS may be a stepping stone, but the concessions with regard to decentralisation may be a bit off-putting. So assuming that PoW is still the de-facto consensus mechanism, what can the ecosystem do to reduce 51% attacks? Here are some ideas.

Speculators/investors should consider the hash rate of any PoW coins they are looking to invest in, and be aware that a low hash rate makes that blockchain vulnerable for a 51% attack – certainly not good for their investment.

New PoW altcoins should not use the same hashing algorithm as big coins, even if it’s easier to implement – a large miner can just point their hash power at the new blockchain and immediately become a majority. Dev teams for these projects should implement programmatic checks for 51% attacks: a quick response can be critical. Fiat reserves should be maintained, which can quickly be used to add hashing power to the network if a potential 51% attack is detected.

Exchanges should be wary of what tokens they list, and should increase the required block confirmations for transactions, making it more costly for attackers to rewrite the recent history of a blockchain. Yes, this would make it slightly more inconvenient for users, but 51% attacks can directly lead to material losses.

Conclusion

Proof of Work is a genius solution to the distributed consensus problem, and I often have to remind myself just how amazing the original bitcoin protocol is. But I believe that in PoW, 51% attacks should be possible – they are a direct consequence of democracy. For large decentralised networks like that of bitcoin or Ethereum this is not a problem, but owing to the speculative interest in cryptocurrencies, the demand for some altcoins has separated entirely from the security of their network, resulting in a kind of arbitrage which allows malicious actors to profit by conducting a double spending 51% attack.

This post’s analysis of coin vulnerability has not been perfect: the main flaw is our use of market cap as a proxy for the profitability of a 51% attack. In reality, the profitability is a function of block confirmation times and exchange liquidity. Market cap is arguably correlated with the two, but it is not as direct.

I don’t claim to have all the solutions to the problem of 51% attacks, but it is evident that no single party can solve the issue alone. Until the different players (miners, users, exchanges, investors) in the ecosystem contribute their part, coins with low hash rates will continue to be low hanging fruit for the aspiring 51% attacker; I’d wager that we’ll see a few more big attacks before the year ends.