Evolving cellular automata to solve problems, part 2

We will be picking up where the previous post left off. As a brief summary, we are attempting to replicate the results of Evolving Cellular Automata with Genetic Algorithms (Mitchell, Crutchfield and Das 1996), dealing with the density classification task for 1D binary cellular automata (CAs). To put it simply, we are trying to design a ruleset such that the final configuration of a cellular automaton after M iterations is either all 1s or all 0s depending on which class was more common in the initial state. The caveat is that each cell in the universe can only make its decision based on the three neighbours to the left and right.

We examined the failures of the majority ruleset, an ‘obvious’ solution in which each cell updates to the local majority among its neighbourhood, and thus see the need for something more sophisticated. At the same time, we saw that the space of possible solutions is in the order of $10^{38}$, meaning that any kind of brute force method will not be feasible. This invites the use of genetic algorithms (GAs), with which we could evolve a CA to perform this task without ever explicitly devising a ruleset.

Genetic algorithms in the context of CAs

A complete treatment of genetic algorithms is not in the scope of this post (the wikipedia page is a good start), but I will attempt to cover the salient points:

  • We start with a population of different possible rulesets. This is a very important point – the individuals in the population are not the cellular automata, they are what we are trying to optimise, which in this case is the ruleset.
  • Rulesets in the population have different fitnesses, corresponding to how well they solve the $\rho_c = 1/2$ task. We will define the fitness F to be the proportion of random ICs on which the ruleset succeeds (becomes all 1 or all 0 correctly). So if we test a ruleset on 100 random ICs and 35 are classified correctly, $F = 0.35$
  • In each generation, we compute the fitness for each ruleset in the population. The fittest rulesets (the elites) are cloned to the next generation, but we also crossover (i.e breed) these rulesets to form offspring, which are then mutated before joining the next generation.
  • Over many generations, we should see the fitness increase, as the traits of the favourable rulesets are passed onto the next generation.

The GA acts as a more efficient way to traverse the huge space of possible rulesets in search of one that can suitably solve the density classification task – it does not guarantee that the global optimum will be reached.

Implementation

For the most part, the functions used in this post are the same as in part 1, though took care to make obvious optimisations because the evolution of the CAs will involve repeated calls of the same functions.

The initial population consists of random rulesets. To generate a single random ruleset, we zip the the rule inputs (7-bit binary strings) with random 1s or 0s which are obtained from the uniform_random_binary function we defined. For convenience, we can uniquely refer to a given ruleset with a hexadecimal ID, which is made by converting the binary string of all rule outputs (assuming ordered inputs) into hexadecimal.

def random_ruleset(n_rules):
    rule_values = uniform_random_binary(n_rules)
    hex_id = hex(int(rule_values, 2))
    return hex_id, dict(zip(rule_keys(n_rules), rule_values))

To generate the initial population of 100 (or more generally, n_ic) ICs, it is not as simple as running random_ruleset 100 times because we want exactly half of the ICs to have a majority of 1s.

def generate_random_ICs(universe_size, n_ic=100):
    ics = []
    cutoff = int(0.5 * universe_size) + 1
    # Exactly half of the initial conditions will have p < 0.5
    for _ in range(n_ic // 2):
        minority = uniform_random_binary(universe_size, 0, cutoff)
        majority = uniform_random_binary(universe_size, cutoff, 
                                         universe_size + 1)
        ics.append(minority).append(majority)
    return ics

Prior to implementing crossover and mutation, we will define a helper function that converts a hex ID into a ruleset, by converting the hexadecimal to binary and zipping this with a list of 7-bit rule inputs:

def hex_to_ruleset(hex_id, n_rules=128):
    rule_values = bin(int(hex_id, 16))[2:].zfill(n_rules)
    return dict(zip(rule_keys(n_rules), rule_values))

As discussed in the paper, crossover involves pointwise swaps between two parent rulesets, and mutation involves randomly switching bits.

def crossover_and_mutate(hex1, hex2, n_rules=128, 
                         n_crossover_points=1, n_mutations=2):
    bin1 = list(bin(int(hex1, 16))[2:].zfill(n_rules))
    bin2 = list(bin(int(hex2, 16))[2:].zfill(n_rules))
    # Crossover: get indices of unequal, then
    # randomly select an index to swap
    unequal_indices = [i for i, digit in enumerate(bin1) if 
                       bin2[i] != digit]
    if len(unequal_indices) > n_crossover_points:
        crossover_points = random.sample(unequal_indices, 
                                         n_crossover_points)
        for c in crossover_points:
            bin1[c] = bin2[c]
    # Mutation
    for _ in range(n_mutations):
        idx = random.randrange(n_rules)
        # Flip bit
        bin1[idx] = str(int(not bool(bin1[idx])))
    return hex(int("".join(bin1),2))

Metaparameters

As with most optimisation methods, genetic algorithms have a number of metaparameters that can drastically change the resulting behaviour, for example the number of generations, the mutation rate, the size of the initial population of chromosomes, etc.

GEN = 50
N_CHROMOSOMES = 100  # number of rulesets in each generation
N_CROSSOVER_POINTS = 1
N_MUTATIONS = 2

In addition, we shouldn’t forget the CA parameters:

N = 59  # universe size
MAX_ITER = 100  # max number of iterations to perform the task
N_IC = 100  # number of ICs each ruleset is tested on
N_RULES = 128  # number of rules in each ruleset

These are mostly similar to the choices of Mitchell et al., though I have opted for fewer generations and a much smaller universe (they use 100 and 149 respectively), which should be faster to compute and serve as a proof-of-concept.

It is important to note that genetic algorithms can be quite sensitive to initial conditions, which is why we will parameterise each run by a random seed.

Evolution

We are now ready to implement the actual genetic algorithm. The first thing to do is to set the random seed and generate the initial population. We will use a dictionary, with the keys being hex IDs and values being the rulesets.

random.seed(0)
population = {}
for _ in range(N_CHROMOSOMES):
    id, ruleset = random_ruleset(N_RULES)
    population[id] = ruleset

In each generation, we will first compute the fitness of each ruleset on 100 random ICs:

generation_ics = generate_random_ICs(N, N_IC)
fitness_log = []

for hex_id, chromosome in population.items():
    # Evaluate performance on random ICs.
    fitness = 0
    for config in generation_ics:
        # Count the majority
        majority = config.count("1") > len(config) // 2
        # Iterate automaton
        hist = iterate_automata(config, chromosome, MAX_ITER, 
                                verbose=False)
        # If the CA has classified correctly, increment fitness
        if hist[-1] == str(int(majority)) * len(config):
            fitness += 1
    print("hex_id", hex_id, "  Fitness", fitness)
    fitness_log.append((hex_id, fitness))

Next, we determine who are the elites (the top 20 rulesets):

# Find top 20 in population
fitness_log.sort(key=lambda x: x[1], reverse=True)
elites = [i[0] for i in fitness_log[:20]]
print("Generation best:", fitness_log[0])
print("Least fit elite:", fitness_log[20])

As suggested by the paper, we will copy these elites directly to the next generation, then form the rest of the population by breeding these elites with replacement.

# Copy the top 20 to the next generation
new_population = {}
for i in elites:
    new_population[i] = hex_to_ruleset(i)

# Crossover and mutate random pairs of elites
for _ in range(N_CHROMOSOMES - 20):
    parent1, parent2 = random.sample(elites, 2) 
    child_hex = crossover_and_mutate(parent1, parent2, 
                                     N_CROSSOVER_POINTS, 
                                     N_MUTATIONS)
    new_population[child_hex] = hex_to_ruleset(child_hex)

population = new_population

We then repeat the above for many generations, and hopefully should see the fitnesses increase.

Initial observations

Judging by the success of the previous post, I was expecting the script to work immediately and yield interesting results. Alas, programming is rarely so simple. The genetic algorithm was slow (I’m used to the 5-10 seconds it takes to train most sklearn models), but what is worse, it was not able to solve the task. After 50 generations, the GA has not managed to beat random guessing:

Gen 49 
==========
hex_id 0x455d652de3a5534054ed51a12093d25   Fitness 50
hex_id 0x455d652de3a5534054ec51a12093d27   Fitness 50
hex_id 0x455d652de3a5534054ed51212093d25   Fitness 50
hex_id 0x455d652de3a55b40146d51a12093d27   Fitness 50
...
Generation best: ('0x455d652de3a5534054ed51a12093d25', 50)
Least fit elite: ('0x455d652de3a55b40146d51a12093d277', 50)

A closer analysis revealed that many of the final rulesets corresponded to the naïve strategy of mapping any possible input to a 1, which results in the ruleset being correct about 50% of the time.

It should be noted that Mitchell et al. trained their genetic algorithm across 300 different runs, so it could be the case that this was one of the unlucky runs in which the GA fails to learn. But testing multiple random seeds yielded no further improvements.

A possible fix

I hypothesised that the initial population contained too many rulesets that had a clear majority of 1s or 0s, disproportionately mapping their input to one of the two regardless of the IC. Remember that this was by design, we chose to sample rulesets and ICs from a distribution uniform over the density of 1s rather than the unbiased distribution which is binomial over the density of 1s. In light of the poor empirical results, I modified the random_ruleset function to use the unbiased distribution instead:

def random_ruleset2(n_rules):
    random_binary = bin(random.randint(0, 2**n_rules - 1))
    rule_values = str(random_binary)[2:].zfill(n_rules)
    hex_id = hex(int(rule_values, 2))
    return hex_id, dict(zip(rule_keys(n_rules), rule_values))

A brief interjection: because it seemed that I would have to test multiple parameters, I refactored the GA script into a function:

evolve_automata(59, n_crossover_points=1, n_mutations=6,       
                random_seed=0, n_ic=100, n_gen=50, 
                ruleset_generator=2)

Using this new ruleset generator, we are much less likely to have naïve outputs of all 1s or all 0s. My guess was that although this would result in the GA starting with poorer fitnesses, they would not get ‘stuck’ with naïve random-guessing strategies. However, testing revealed that fitness still converged to 50%, just at a much slower rate.

To combat this, I reasoned that the simplest way to escape local optima is to pump up the mutation rate in the hope that elites will be found that beat random guessing, and these will pass on their traits to the next generation.

Finally, we achieved some progress. I noticed that the combination of a high mutation rate and an unbiased random ruleset meant that the GA didn’t always get stuck with a 50% fitness. For N_MUTATIONS = 6, tested over 20 different seeds, mean elite fitnesses in the last 10 generations were:

[83.5, 52.2, 92.6, 51.2, 51.1, 50.0, 51.5, 51.5, 51.7, 79.2, 
 83.1, 51.7, 73.5, 57.2, 51.6, 51.2, 52.1, 50.0, 78.7, 82.9]

Eight of the 20 runs resulted in fitnesses beyond the 50% level, however only one of the 20 beat 90%. This falls very far short of Mitchell et al’s claim to have at least 90% fitness on all 300 of the runs, but it is certainly progress.

Interestingly, increasing N_MUTATIONS to 12 produced worse results, with the maximum mean elite fitness being 80.2 and only six of the 20 runs resulting significantly beating 50%. N_MUTATIONS=18 does even worse, with only one of the 20 beating 50%. I think this is because having many mutations reduces the impact of crossover (which is only at a single point), so we don’t get the benefits of ‘meiosis’.

Analysis of evolved CAs

We will examine one of the rulesets which achieved a fitness of above 90%. For some different initial configurations, this is what we observe:

Remember that purple is alive and white is dead; I have displayed the actual majority class above each subplot. Thus the fifth subplot represents an error, because despite the majority of 1s, the final state is all 0s.

As in the paper, the ruleset works by quickly reaching a steady state of all 0s unless large blocks of 1s are found – it is not really the sophisticated emergent behaviour that I was hoping to see. The only exception is the first subplot, in which some semblance of the particle strategy is achieved.

Conclusion

Though we did achieve some interesting results, I don’t think that this exploration overall was a success. I could not replicate the results of Mitchell et al., even using a simple universe of 59 cells rather than 149. The default setup used in the paper kept getting stuck at 50% fitness, and it was only after I deviated from their suggestions that I managed to do better.

I have become somewhat disillusioned regarding the scientific value of the paper (though I still find it fascinating). The sophisticated particle strategy discovered by Mitchell et al. is remarkable, yet they seem to downplay the fact that it was observed only in a few of the 300 runs of the genetic algorithm. This is hardly reproducible, and really feels to me like grabbing at patterns in the random darkness.

Additionally, the evolution of CAs with genetic algorithms is a very slow process (I had to leave my code running over a few nights), and the results are very inconsistent. I can only imagine how long it would have taken for Mitchell et al. to conduct 300 runs with a universe size of 149 (before the turn of the millennium!). This severely limits the practical applications, and the prospect that evolved CAs could tackle more complex tasks like image processing (as suggested in the paper’s conclusion) seems chimerical.

Nevertheless, I am happy to have attempted to replicate the results, because it has been a while since I’ve played around with genetic algorithms. It was also some extra practice with python’s standard language features, of which I make extensive use. Perhaps some time in future I may realise that my implementation had a flaw and author a part 3, but for now I’m content to stay in this local optimum.

The code for this post, along with plots and data, is available as a jupyter notebook on GitHub.


Evolving cellular automata to solve problems, part 1

Recently I finished reading Complexity: A Guided Tour, by Melanie Mitchell, which reminded me a lot of Gödel, Escher Bach (indeed, the book is dedicated to Douglas Hofstadter). It has reminded me that emergence is an incredibly fascinating concept – simple individual units somehow coming together to result in complex behaviour that cannot really be explained in terms of the components. One of the first examples presented is the army ant: even in groups of 100, these ants are completely useless and will walk in circles until they die of exhaustion. But in a colony, they show remarkable sophistication and a “superintelligence”. I would highly recommend the article Army Ants: a Collective Intelligence, by Nigel Franks, and this short David Attenborough clip for more information. Apart from ants, Mitchell also makes consistent reference to cellular automata, the study of which really led to the growth of complexity science as a distinct field.

Cellular automata (CAs) have managed to divert the attention of many great scientists – Stephen Wolfram considers them the foundation of a “new kind of science”. I am clearly not one of these great thinkers, because I have never really understood what the fuss is all about. Although I have written about Conway’s Game of Life (a special case of CAs) in a previous post), I admit that I was more interested in the implementation than in the CAs.

But Complexity has given me a fascinating new perspective. In Chapter 11, Mitchell introduces the idea that CAs can be designed (or should I say evolved) to solve computational tasks. Although it is widely known that the Game of Life is Turing-complete (arguably more interesting than Mitchell’s idea), this always felt far too abstract for me to appreciate – the proof relies on a demonstration that some logic gates can be emulated, rather than actually constructing a nontrivial program. However, in Evolving Cellular Automata with Genetic Algorithms (the journal article on which Chapter 11 is based), Mitchell and coauthors tangibly demonstrate that genetic algorithms can be used to evolve a CA to perform a simple but nontrivial task.

The paper is very interesting, but does not come with source code (in their defense, it was written in 1996 and they refer to the internet as the “World Wide Web”). I think that their work deserves a modern recreation and analysis, with the benefit of intuitive code in python and more processing power.

Because this is quite a large topic, I will split it into parts. Today’s post will set up the problem, dealing with notation and terminology, as well as discussing the nature of the CAs considered. We will leave it to a future post to actually apply genetic algorithms to evolve CAs.

Contents

A quick introduction to cellular automata

A cellular automaton can be thought of as a set of units (cells) whose states change based on simple rules. In the Game of Life, the universe is a 2D grid of binary cells (either alive or dead), with the update rules as follows:

  • A living cell (black) will die if it has fewer than two live neighbours or more than three live neighbours – roughly corresponding to underpopulation and overpopulation respectively
  • A dead cell (white) will come alive if it has exactly three living neighbours (captures the idea of reproduction).

Note that to begin applying the rules, some initial condition must be specified – in practice, small changes in this starting point can drastically affect the resulting behaviour.

For our purposes we will be dealing with an arguably simpler model: a 1D binary CA. In this case, the universe consists of a row of N cells, each of which can either be dead (0) or alive (1), and is updated by looking at the radius R of surrounding cells. So if $R=3$, any update rules must map a 7-bit input (the 3 cells on either side, and the cell itself) to a 1-bit output (whether the cell will be dead or alive). A given rule might look something like this, bearing in mind that the output bit determines the new state of the middle cell:

A ruleset is the set of mappings between all possible input configurations and their output states. We may as well begin to speak in specific terms referring to $R=3$, in which case a ruleset is a mapping between all possible 7-bit binary strings to a binary output:

Input Output
0000001 1
0000010 0
0100101 0
0100110 1
1111110 1
1111111 0

If we assume that the input is ordered, then as a convenience of notation we can say that a ruleset is equal to the binary string formed by concatenating the output bits – in the above example, this is $10…01…10$.

It is very important to be comfortable with this concept: we are representing rulesets with binary strings of length 128, because there are 128 possible 7-bit inputs. Later on, we will also be using binary numbers to represent configurations of cells, so please remember to distinguish the two.

Because we are dealing with long 1D universes, animation is probably not the visualisation tool of choice (though it works very nicely for 2D). Instead, we will use a space-time diagram, which essentially stacks the generations on top of each other (with time going down the page). Here’s an example, with a simple rule that switches input bits:

Computation

I am going to conveniently sidestep the issue of properly defining “computation”. As far as this post is concerned, computation is what a computer does – a CA “computes” if it solves some task for a variable input. I would refer you to Complexity if you’d like a meaningful discussion.

The paper’s main objective is the density classification task, in which the CA must decide whether the ‘density’ of 1s in the initial configuration is greater than some cutoff proportion. Because density is typically denoted by $\rho$, this is more precisely referred to as the $\rho_c = 1/2$ task.

To put it simply, the CA must decide whether the initial setup had more living or dead cells (1s or 0s). If there were more 1s, then after some number of iterations, we want the CA to display all 1s. Otherwise, if there were more 0s, we want the CA to display all 0s. An example mapping is shown below:

You may be questioning the difficulty of that task. After all, it’s as easy as iterating over the universe and counting the number of 1s, then determining the output based on this value! But remember the catch: each cell only has access to a fixed neighbourhood of three cells on either side. There is nowhere we can store the count: each cell can only make a decision based on its neighbours and change its state accordingly.

The majority ruleset

Before we bother with genetic algorithms, it is worth checking how a simple solution performs. For this, we will use the majority ruleset, in which each cell will just change its state to match the majority of its 7-cell neighbourhood.

Although it is conceptually straightforward, it is at this stage that we need to consider the details of the implementation in python. We will be using strings of binary digits to store configurations; they are efficient and easy to work with. At a high level, we need to do the following:

  • generate the majority ruleset (of course we are not going to hard code all 128 rules)
  • generate a random initial configuration (i.e a random binary number)
  • iterate the initial configuration according to the majority ruleset
  • plot the space-time diagram

Here are some of the global constants that we need for now, though we will introduce many more when we get onto genetic algorithms:

R = 3  # size of neighbourhood
N_RULES = 2 ** (2 * R + 1)  #  number of rules in a ruleset
N = 79  # number of cells in universe

The number of rules is $2^7 = 128$ (the number of possible 7-bit inputs). Incidentally, this means that the number of possible rulesets is $2^{128}$ – a clear indication that we will not be able to find a solution by brute force.

Generating the majority ruleset

We are going to treat rulesets as python dictionaries. This is quite natural, since a ruleset is essentially a bijective mapping between all possible 7-bit inputs (keys) and the output bits (values). We will generate the keys and values separately.

def rule_keys(n_rules, radius=3):
    rule_input = []
    for i in range(n_rules):
        binary = str(bin(i))[2:].zfill(2*radius + 1)
        rule_input.append(binary)
    return rule_input

This could also be refactored into a one-line generator expression:

rule_keys = (str(bin(i))[2:].zfill(7) for i in range(N_RULES))

But out of respect for Guido I have kept it as a function with a loop inside (which is slightly faster anyway). The [2:] is used to slice away the ugly 0b that python prepends to a binary number, and zfill pads the string with leading zeroes.

To complete the majority ruleset, we just need to know if there are more 1s or 0s in the input. We will use a bit of python trickery to make this more concise:

majority_ruleset = {}
for rule_input in rule_keys(N_RULES):
    # 1 if the majority of the neighbourhood is 1.
    out = int((sum(d == "1" for d in rule_input) > R))
    majority_ruleset.update({rule_input: str(out)})

The statement (sum(d == "1" for d in rule_input) > R is True if there are more than 3 living cells in the input (remember that R = 3), and False otherwise. We then cast this to an int (1 or 0) then a str (“1” or “0”), and update the ruleset with this key-value pair. This is an excerpt from the resulting ruleset:

 {
     '1010010': '0',
     '1010011': '1',
     '1010100': '0',
     '1010101': '1',
     '1010110': '1',
     '1010111': '1',
     '1011000': '0',
     '1011001': '1',
 }

Generating a random initial configuration

One might think that the best way to generate a random configuration is just to combine N random bits where N is the size of the universe. However, consider the density of 1s in such a configuration – it is a (literal) textbook example of a binomial distribution with $\rho \sim B(N, 0.5)$.

Notice the strong peak at 0.5, meaning that most of the initial configurations will have a roughly equal split of 0s and 1s. This is not ideal for us to explore the density classification task, because we do want to have enough easy cases (i.e clear majorities).

As such, we will adopt a slightly different strategy:

  1. Randomly choose a density of 1s that we want in the configuration (sampled from a uniform distribution).
  2. Form an ordered universe with that number of 1s, and the rest 0s
  3. Shuffle the universe

A small snippet that does this is as follows:

n_ones = random.randrange(0, N)
binary = "1" * n_ones + "0" * (N - n_ones)
result = "".join(random.sample(binary, N))

We could use random.shuffle() in the third line, but that works in-place (and requires a list), so the chosen solution is cleaner. In the end, I decided to generalise the above slightly, letting you change the range of the uniform sample and specify the length of the output string:

def uniform_random_binary(length, lower=0, upper=None):
    # The defualt upper bound is the length
    if upper is None:
        upper = length
    n_ones = random.randrange(lower, upper)
    binary = "1" * n_ones + "0" * (length - n_ones)
    return "".join(random.sample(binary, length))

Iterating an automaton

To remain general, our iteration function should take in a configuration, a ruleset, and a radius. One iteration consists of applying the appropriate rule to each bit in the universe’s configuration, which is done by getting a cell’s neighbours as a 7-bit binary string, then using the ruleset dictionary to find the output bit.

def single_iteration(config, ruleset, radius):
    next_iteration = ""
    # Support wrapping by extending either end
    extended_config = config[-radius:] + config + config[:radius]
    for i in range(radius, len(config) + radius):
        neighbours = extended_config[i-radius:i+radius+1]
        next_iteration += ruleset[neighbours]
    return next_iteration

The complexity in the above implementation arises because we need a circular universe, so we must extend the configuration to allow for ‘wrapping’.

To iterate multiple times, we repeatedly apply the single_iteration() function:

def iterate_automata(initial, ruleset, n_iter, radius=3,
                     verbose=False):
    config = initial
    history = [config]
    for _ in range(n_iter):
        config = single_iteration(config, ruleset, radius)
        if verbose:
            print(config)
        history.append(config)
    return history

This function also keeps track of the CAs history, which is necessary if we want to use a space-time diagram for visualisation.

Space-time diagrams

If we run iterate_automata() with verbose=True, it’ll print the universe at each time step, so in a sense that is a space-time diagram. However, it is fair to ask for something more stylish, for which we will use matplotlib’s imshow:

def plot_spacetime_diagram(binary_history, filename=None):
    arr = np.array([[int(digit) for digit in row]
                     for row in binary_history])
    plt.figure()
    plt.imshow(arr, cmap="Purples_r")
    if filename is None:
        plt.show()
    else:
        plt.savefig(filename)

Tying it together

Now that we have implemented the pieces, testing the majority ruleset is very simple:

random.seed(0)
initial_config = uniform_random_binary(N)
majority_history = iterate_automata(initial_config,
                                    majority_ruleset,
                                    n_iter=N, 
                                    radius=R)
plot_spacetime_diagram(majority_history)

Looking at the space-time diagrams for different random seeds shows that in general, the majority ruleset does not consistently solve the $\rho_c = 1/2$ task:

The above results are very similar to what Mitchell et al. found (though they use $N = 149$), which gives us a good indication that our implementation thus far has been accurate.

Conclusion

In this post, we have laid the framework for using cellular automata to solve a computational task, namely, classifying the density of living cells in the initial conditions.

We have examined the failings of a ‘local’ solution that naively treats a cell’s immediate neighbourhood as a proxy for the entire universe, and thus see the need for emergent computation, in which individual cells somehow learn to appreciate the rest of the universe and come up with a global solution.

In the next post, we will discuss how genetic algorithms can be applied to evolve the cellular automata to solve the density classification task, rather than having to supply a hand-crafted rule, or do a brute force search over all possible rulesets (which we have shown to be infeasible). Stay tuned.


Classifying financial time series using Discrete Fourier Transforms

Introduction

A financial time series represents the collective decisions of many individual traders; it is fair to hypothesise that the nature of these decisions differs based on the underlying asset. For example, a company with a higher market cap may be more liquid, and subject to larger individual buy/sell orders including institutional investment. Thus, there is a case to be made that information such as the market cap of a company should be ‘encoded’ into its price movements. While these characteristics may be difficult to pinpoint on a chart, in principle it may be possible for a machine learning algorithm to find statistical relationships between the time series and the market cap of the company. This post will investigate that claim.

Concretely, a raw dataset will be constructed, consisting of 100-day price charts which belong to companies that are respectively in the top and bottom 1000 tickers in the Russell 3000 ordered by market cap. We will then run feature extraction (which is the core of this post), before applying a standard XGBoost classifier.

This post thus represents an intuitive (if naive) first approximation for classifying whether a given price chart belongs to a stock with high or low market cap. The hypothesis is that this problem is learnable, so the goal is to show that a machine learning model can classify a time series with greater than 50% accuracy. Though accuracy often has flaws as a metric, in this case it is sufficient because the classes will be almost perfectly balanced. But as a matter of good practice, we will also present precision.

A discussion on classifying time series

Because much of real life data is temporal in nature, classifying and clustering time series has widespread importance. Some example applications:

  • speech recognition
  • medical data such as ECGs
  • cyber-kinetic prosthetics: classifying electrical activity of muscles in terms of what action it corresponds to
  • meteorological data
  • seismic data

Some general approaches to classifying time series are as follows (a thorough survey can be found in Rani & Sikka 2012):

  • directly clustering with an algorithm like k-Nearest Neighbours. What is important here is a relevant distance metric between two different time series. Euclidean distance may not be sufficient as it has trouble dealing with timeseries that have been shifted slightly (either vertically or horizontally).
  • Hidden Markov Models, which model the timeseries as a Markov Process, whose transition matrix may be used as input to a classifier.
  • LSTM Neural Networks, which are fed in a time series and contain a sequential ‘memory’. The output of a cell can be run through the sigmoid function to convert it into a binary class value.
  • Extracting summary statistics, then passing these results into a standard classifier.

This post will explore the last method. On the surface, it seems to be the most straightforward approach, but the magic is in choosing which aspects of the time series should be used. There are existing libraries that can extract a multitude of features, e.g tsfresh, but although it is very likely that there is predictive power somewhere in here, it would be nice to find a simpler (and perhaps faster) approach.

Thus, I have decided to take the Discrete Fourier Transform (DFT) of the time series, and to use the largest terms as features. Such a method is really under the domain of spectral analysis, a well established subfield of time series analysis. However, we will adopt a rather informal approach. It should be noted that the Fourier Transform certainly shouldn’t be used to forecast the future of a financial time series because of the strong assumptions of periodicity. But it should be able to extract the main (sinusoidal) signals from a time series, which we can then put into a classifier.

Preparing the raw data

The data used in this project will be:

  • keystats.csv: parsed yahoo finance key statistics from which the latest market cap of securities will be ascertained. In order to reproduce something like this dataset (though only for the S&P500), please refer to my repo on GitHub
  • My stock_prices database, which contains daily prices for many tickers. For more information about this database, I’ve written a post on how to create your own price database.

After extracting the data from the csv files and MySQL database, I was left with a pandas dataframe sorted_tickers_by_mcap, containing tickers and their market caps, sorted by market cap in ascending order.

print("Number of tickers with mcap data available:",
       len(sorted_tickers_by_mcap))
print(sorted_tickers_by_mcap.head())
print(sorted_tickers_by_mcap.tail())
Number of tickers with mcap data available: 3382
Ticker
heat      67830.0
aryx     103730.0
gmet     336260.0
coco    1090000.0
dvox    1140000.0
Name: Market Cap, dtype: float64
Ticker
xom      3.632700e+11
msft     4.304200e+11
goog     4.942700e+11
googl    4.944500e+11
aapl     5.413300e+11
Name: Market Cap, dtype: float64

I then extracted the top and bottom 1000 tickers (removing the tickers for which data was not available), resulting in two lists of tickers: top_mcap and bottom_mcap:

Available top market cap: 831
Available bottom market cap: 862

These tickers are then sorted by market cap, and 100-day windows are made from the last 750 datapoints (roughly corresponds to 3 market-years). 3 years was chosen as the cutoff because market cap values from more than three years ago are likely to be significantly different from the latest values, as companies grow or shrink over time. A solution would be to use the keystats dataset to find the market cap for a company in a given year, but I have opted for a simpler workaround in this post.

Here is the function I wrote to create the time series snapshots from the raw price data.

def create_snapshots(ticker, n_days=750, window_size=100, 
                     window_shift=50):
    """
    Returns list of time series of length windowsize, shifted
    by window_shift, for the last n_days.
    """
    df = pd.read_sql(f"SELECT adj_close_price FROM daily_price "
                    f"WHERE ticker_id={ticker_id[ticker]}", conn)
    df = df.iloc[-n_days:]
    snapshots = []
    for i in range(0, n_days, window_shift):
        window = df.iloc[i:i +window_size]
        window = window['adj_close_price'].values
        if len(window) == window_size:
            snapshots.append(window)
    return snapshots

These snapshots are then stacked into a numpy array as follows:

dataset = []

for i, ticker in enumerate(top_mcap):
    try:
        ticker_snapshots = create_snapshots(ticker)
    except Exception as e:
        print(str(e))
        continue
    ticker_snapshots = np.c_[np.array(ticker_snapshots),
                        np.ones(len(ticker_snapshots))]
    dataset.append(ticker_snapshots)

for i, ticker in enumerate(bottom_mcap):
    try:
        ticker_snapshots = create_snapshots(ticker)
    except Exception as e:
        print(str(e))
        continue
    ticker_snapshots = np.c_[np.array(ticker_snapshots), 
                        np.zeros(len(ticker_snapshots))]
    dataset.append(ticker_snapshots)

dataset = np.vstack([a for a in dataset if a.shape[1] == 101])

The first 10 rows of dataset are presented below. Note how the last column is always 1 – this is because the first half of the dataset consists of the top market cap tickers (classified as 1).

array([[35.639999, 35.610001, 39.220001, ..., 37.189999, 37.060001,
         1.      ],
       [40.369999, 40.369999, 40.419998, ..., 36.189999, 35.889999,
         1.      ],
       [37.810001, 37.849998, 38.419998, ..., 34.68    , 34.959999,
         1.      ],
       ...,
       [41.369999, 40.02    , 39.860001, ..., 35.279999, 35.27    ,
         1.      ],
       [38.5     , 38.669998, 38.599998, ..., 41.169998, 41.27    ,
         1.      ],
       [35.360001, 35.439999, 36.34    , ..., 42.740002, 43.349998,
         1.      ]])

We have now finished preparing the raw data for analysis. However, we still need to extract the features and preprocess the data for machine learning specifically. This will be covered in the next section.

Methodology

It is a remarkable fact of mathematics that periodic continuous functions (at least the well-behaved ones) can be decomposed into sines and cosines. Although price data probably isn’t periodic, if we treat the whole time series as one period, we can apply a Discrete Fourier Transform (DFT) to extract the main ‘signals’ in a time series. My motivation for using a DFT is that it can be used to de-noise a time series by ignoring the smaller terms, and that the coefficients of the resulting terms can be fed into a classifier. Additionally, there exist very efficient implementations in numpy, such as np.fft.fft(). One potential problem is that the results of the DFT are actually complex numbers:

array([ 3.86450001e+02+0.j        , -7.77637336e+00+2.22326054j,
       -3.85445493e+00+2.69405956j, -2.39862764e+00+4.72442769j,
       -1.07054807e+00+1.46787384j,  1.49997000e-01+0.j        ,
       -1.07054807e+00-1.46787384j, -2.39862764e+00-4.72442769j,
       -3.85445493e+00-2.69405956j, -7.77637336e+00-2.22326054j])

This begs the question as to how we can input the DFT terms into a classifier. We will implement and compare two possible methods.

  • Taking the modulus (np.abs) of each term and using that as a feature
  • Splitting up the real and imaginary parts to use as separate features.

Clearly, we would expect the latter to have more predictive power, but at the cost of using twice as many terms.

There is another decision we have to make, namely, how many terms of the DFT we want to use. As it is impossible to know a priori how much information the Fourier terms hold, we’ll examine the first 5, 15, and 30 terms.

So in total, we have 6 datasets to test. Now, in order to avoid the data mining bias, we will apply a Bonferroni correction. The idea is that each additional test increases our chance of finding a spurious yet statistically significant relationship, so to counteract this, the significance level must be decreased to $\frac{\alpha}{m}$, where $m$ is the number of tests. So if the original significance level was $\alpha = 0.05$, the Bonferroni correction would imply a new significance of $\frac{\alpha}{6} = 0.00833$.

Feature extraction and data preprocessing

Adopting standard scikit-learn notation, X will denote the feature array and y the target vector. Using the dataset that we prepared before:

X = dataset[:,:-1]
y = dataset[:, -1]

Prior to putting our data through a DFT, I am first going to standardise the time series (scale everything to the range [0,1]) to ensure that we don’t actually learn some boring correlation between the share price of a company and its market cap. To maximise the efficiency of this computation, bearing in mind that this computation needs to be done for more than 20k rows, it is important to (ab)use numpy’s broadcasting:

numerator = X - X.min(axis=1).reshape((len(X), 1))
denominator = X.max(axis=1) - X.min(axis=1)
denominator = denominator.reshape((len(X),1))
X_scaled = numerator/denominator

We will now define functions to take the Fourier Transform, then slice for the first k terms, before either taking the modulus or splitting into real and imaginary parts.

def generate_modulus_dataset(X, k):
    return np.abs(np.fft.fft(X)[:, :k])

def generate_complex_dataset(X, k):
    fourier = np.fft.fft(X)[:, :k]
    return np.column_stack((fourier.real, fourier.imag))

With these methods ready, creating all of the datasets is simple:

modulus_5 = generate_modulus_dataset(X_scaled, 5)
modulus_15 = generate_modulus_dataset(X_scaled, 15)
modulus_30 = generate_modulus_dataset(X_scaled, 30)
complex_5 = generate_complex_dataset(X_scaled, 5)
complex_15 = generate_complex_dataset(X_scaled, 15)
complex_30 = generate_complex_dataset(X_scaled, 30)

Machine Learning

We are finally ready to apply machine learning using python’s wonderfully intuitive sklearn. I have chosen to use an XGBoost classifier - it trains quickly and normally produces good results with the default parameters. It should be noted that no hyperparameter optimisation will be done: the point of this experiment is to see if this problem is learnable (rather than trying to achieve excellent results). I have written the classification script as a function which can be easily applied to each of the datasets.

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import precision_score, accuracy_score


def classify(features, target, return_acc=False):
    X_train, X_test, y_train, y_test = train_test_split(
                                            features, target)
    clf = XGBClassifier()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    if return_acc:
        return acc
    print(f"Accuracy:\n {100* acc:.2f}%")
    print(f"Precision:\n {100* prec:.2f}%")

It is then a simple matter of applying the function to any of our datasets:

classify(modulus_5, y)

Running the above resulted in a cross-validated accuracy of 58%. Given balanced classes, this is clearly a significant difference, so in principle the experiment could be stopped here and we can conclude that classifying stock charts based on market cap is a learnable problem. This result was especially surprising to me because this particular example is just using the moduli of the first 5 terms of the Fourier Transform, which really ignores a lot of the price chart. To see what I’m talking about, this is what a time series made of the absolute values of the first 5 Fourier terms looks like:

Although you can see that some of the main trends are captured, it is clearly a crude approximation that ignores a lot of variation.

Despite having achieved our initial goal of demonstrating the learnability of this problem, seeing as we have already prepared the datasets it may be interesting to do further analysis.

Comparison with a benchmark

What happens if we try to naively classify based on the whole time series, i.e using each of the 100 days’ prices as a feature?

%%time
classify(X_scaled, y)
Accuracy:
 65.34%
Precision:
 64.48%
CPU times: user 18.6 s, sys: 135 ms, total: 18.7 s
Wall time: 19.2 s

So actually the naive benchmark has a much better accuracy. But note the relatively long compute time of 18.7s. The question is whether any of our other datasets can reach comparable accuracies more efficiently.

%%time
classify(complex_5, y)
Accuracy:
 60.96%
Precision:
 60.17%
CPU times: user 1.96 s, sys: 18.5 ms, total: 1.97 s
Wall time: 2.04 s

It makes sense that the complex dataset has a higher accuracy than the modulus dataset for the same number of Fourier terms: bear in mind that it includes twice as many features (since we have split the real and imaginary part).

%%time
classify(modulus_15, y)
Accuracy:
 59.45%
Precision:
 57.72%
CPU times: user 3.77 s, sys: 62.8 ms, total: 3.83 s
Wall time: 4.33 s

This is an interesting result. 15 modulus terms (15 features) underperform 5 complex terms (i.e 10 features). This tells us that the actual complex number holds a lot of information.

%%time
classify(complex_15, y)
Accuracy:
 64.97%
Precision:
 63.75%
CPU times: user 3.02 s, sys: 26.1 ms, total: 3.04 s
Wall time: 3.09 s

Using 15 complex terms (i.e 30 total features), comparable accuracy to the benchmark has been achieved, in about a sixth of the time!

%%time
classify(modulus_30, y)
Accuracy:
 59.74%
Precision:
 57.98%
CPU times: user 3.16 s, sys: 35.1 ms, total: 3.19 s
Wall time: 3.33 s
%%time
classify(complex_30, y)
Accuracy:
 66.63%
Precision:
 65.24%
CPU times: user 11.9 s, sys: 136 ms, total: 12.1 s
Wall time: 12.6 s

We can observe the diminishing returns: using more Fourier terms increases accuracy only slightly. However, it is a pretty remarkable result that using 30 complex terms (i.e 60 features), we can beat the benchmark while also being 1.5x faster.

To better visualise how accuracy varies with the number of terms, we can generate multiple datasets and run our classification method on them:

scores = []
for i in range(0, 100, 5):
    print(i)
    dataset = generate_modulus_dataset(X_scaled, i)
    acc = classify(dataset, y, return_acc=True)
    scores.append(acc)

Doing the same for the complex datasets and plotting the results yields:

It is interesting to note that the modulus datasets never beat the benchmark: it seems that taking the modulus of the complex terms throws away too much information. The complex datasets are most predictive at around 40 terms, but the plateau and gradual decrease after that is a clear sign that overfitting occurs.

Conclusion

In this post, we have seen that 100-day time series of prices can be classified according to market cap, by extracting the main signals from a time series with a Discrete Fourier Transform to use as features in a classifier. The DFT is not only able to outperform the naive solution of using each day’s price as a feature, but can also do so with a large speedup. The implication of this is that the benchmark overfits to noise, and thus this is a practical example of how reducing the information available to a classifier can actually improve performance.

I think this is an important lesson for anyone applying machine learning to a real-world problem: throwing data at a classifier or deep learning model is not a solid approach: it is much better to genuinely try to extract important features, and discard those which only add noise.

Future work on this topic could involve comparing the classification performance of the DFT method to a classifier that uses different summary statistics, or perhaps even a clustering methodology like k-Nearest Neighbours. But for now I am satisfied that this simple and intuitive method is sufficient to do significantly better than random guessing.