I had a suspicion that there was more data beyond the history that I got from Strava (see previous post). And indeed my hunch was confirmed by downloading my history from Garmin Connect.

My history now goes back to 2013.

]]>I’ve been itching to do some analytics on my running data. Today seemed like a good time to actually do it.

I’ll be generating some plots using the `{strava}`

package developed by Marcus Volz.

The GPS files in my Strava archive are in `.fit`

format (some compressed, others not), which needed to be converted into `.gpx`

format before I could consume them in R.

A quick BASH script using `gpsbabel`

sorted that out.

```
#!/bin/bash
for f in *.fit.gz
do
GPX=$(basename "$f" .fit.gz).gpx
gzip -dc $f | gpsbabel -i garmin_fit -o gpx -f - -F $GPX
done
for f in *.fit
do
GPX=$(basename "$f" .fit).gpx
gpsbabel -i garmin_fit -o gpx -f $f -F $GPX
done
```

Start off by plotting thumbnails of each individual route.

Let’s see how those routes compound over time.

Obviously I’ve spent a lot of time running on the Berea and in Durban North. You can also see some of the Comrades Marathon, Hillcrest Marathon, Chatsworth Ultra and Ballito Marathon routes.

How were those runs distributed in time?

What times of day?

Times on week days are bimodal, probably because of a shift in behaviour since I have started working mostly from home (runs are now later because I don’t need to rush off to work).

Finally, runs as packed circles, with distance mapped to circle area and speed as fill colour.

This has been seriously fun. Should have done it a lot sooner.

]]>The package is available here. It is still very much a work in progress: the API only exposes two endpoints, but both of them are wrapped in the package.

Install using `{devtools}`

.

```
devtools::install_github("datawookie/racently")
library(racently)
```

The `search_athlete()`

function makes it possible to search for an athlete by name using regular expressions for pattern matching.

Suppose I wanted to find athletes with the surname “Mann”.

```
search_athlete("Mann$")
```

```
id slug name gender nationality twitter parkrun strava
1 016d0a83-1fc4-4613-81cb-ad3f09abac5e <NA> Garry Mann M <NA> <NA> NA NA
2 311aeaa2-155f-4566-a9fd-cf30afd67b16 <NA> Alec Mann M ZA <NA> NA NA
3 3207bed9-3dd4-42e7-aee5-c0a0347295c8 <NA> Michael Mann M <NA> <NA> NA NA
4 3c22a177-d03e-41c7-be31-a2bc77b2c61c <NA> Kevin Mann M ZA <NA> NA NA
5 52d681ea-8229-47ce-99ab-a4425b4dbcf0 <NA> Thomas Mann M <NA> <NA> NA NA
6 7ebe5ce2-6067-48d5-9c68-5f8167d6ab6e <NA> Julie Mann F <NA> <NA> NA NA
7 8437fcc7-3a38-4ecb-8708-82d148ea3368 <NA> Angie Mann F <NA> <NA> NA NA
8 94d8f1e6-4850-4708-8f70-3210b45d3bde <NA> Helen Mann F <NA> <NA> NA NA
9 bdd0716a-cbda-4c2e-8ef1-d5baab63ecfd <NA> Patrick Mann M <NA> <NA> NA NA
10 bebc042f-ad26-40f3-99d2-61a65e0296a2 <NA> Zeldi Mann F <NA> <NA> NA NA
11 d4cf93a6-69b4-40c4-949e-8444da403e17 <NA> Peter Mann M ZA <NA> NA NA
12 d676cb99-f272-4e9a-be04-fb6ec0aa7245 <NA> Steve Mann M ZA <NA> NA NA
13 d9897020-6e62-47c4-8481-ecbd19ddad12 <NA> Nicola Mann F <NA> <NA> NA NA
14 d9cc374d-521a-48bb-86dd-3ec6f05ef59f stuart-mann Stuart Mann M <NA> runningmann100 NA NA
15 df571960-0451-4cba-8483-17ccb4704680 <NA> Kim Mann F <NA> <NA> NA NA
16 e3c1dcc1-c783-4629-891b-f2e5a21daa87 <NA> Joanne Mann F <NA> <NA> NA NA
17 e4417318-66ef-42e3-b9f7-6a241b550e74 <NA> Jack Mann M <NA> <NA> NA NA
18 f2e9074c-ca75-4a99-b89d-07d377836c21 <NA> Roy Mann M ZA <NA> NA NA
19 f753bf50-4637-4cfb-89f6-efde8563ceff <NA> Luke Mann M ZA <NA> NA NA
```

We see that the somewhat legendary @runningmann100, Stuart Mann, is among the list of results.

The `athlete()`

function allows you to find the results for a specific athlete, specified by `id`

(from the results above).

```
athlete("d9cc374d-521a-48bb-86dd-3ec6f05ef59f")
```

```
$name
[1] "Stuart Mann"
$gender
[1] "M"
$results
date race distance time club license
1 2019-06-09 Comrades 86.8 km 09:53:23 Fourways Road Runners NA
2 2019-02-09 Klerksdorp 42.2 km 03:54:09 NA NA
3 2018-11-24 Josiah Gumede 42.2 km 04:20:21 Fourways Road Runners 3388
4 2018-10-28 Sapphire Coast 42.2 km 04:27:23 Fourways Road Runners 3388
5 2018-09-02 Vaal River City 42.2 km 04:21:54 Fourways Road Runners 3388
6 2018-06-10 Comrades 90.2 km 10:15:52 Fourways Road Runners 3388
7 2018-05-20 RAC 10.0 km 01:14:21 Fourways Road Runners 3388
8 2018-03-25 Umgeni Marathon 42.2 km 04:20:04 Fourways Road Runners 3388
9 2018-03-10 Kosmos 3-in-1 42.2 km 04:18:51 NA NA
10 2018-02-25 Maritzburg 42.2 km 04:31:38 Fourways Road Runners 3388
11 2017-12-03 Heroes Marathon 42.2 km 04:28:06 Fourways Road Runners 3388
12 2017-11-05 Soweto Marathon 42.2 km 04:41:24 Fourways Road Runners 3388
13 2017-09-17 Cape Town Marathon 42.2 km 04:25:06 Fourways Road Runners 3388
14 2017-06-04 Comrades 86.7 km 10:40:59 Fourways Road Runners 3388
15 2017-04-01 Arthur Cresswell Memorial 52.0 km 05:27:19 Fourways Road Runners 3388
16 2017-03-26 Gaterite Challenge 42.2 km 04:14:54 Fourways Road Runners 3388
17 2016-05-29 Comrades 89.2 km 10:19:06 Fourways Road Runners 9120
18 2016-05-01 Deloitte Challenge 42.2 km 04:15:53 Team Vitality NA
19 2016-04-24 Chatsworth Freedom 52.0 km 05:36:06 Fourways Road Runners 9120
20 2015-08-30 Mandela Day 42.2 km 04:48:16 Fourways Road Runners NA
21 2015-01-25 Johnson Crane Hire 42.2 km 04:28:35 Fourways Road Runners NA
22 2009-08-09 Mtunzini Bush 16.0 km 01:31:07 Fourways Road Runners 9120
23 2009-05-24 Comrades 89.2 km 08:45:43 Fourways Road Runners 9120
24 2008-11-28 Sani Stagger 42.2 km 05:25:19 Fourways Road Runners NA
```

The results include name, gender and the details of all of the relevant results in the system. As Stuart has pointed out, there are many results missing from his profile (he ticks off a marathon pretty much every weekend), but hey, like I said, this is a work in progress.

We’ll be working hard to add more race results over the coming months (and cleaning up some of the existing data). There are also plans to expose more functionality on the API, which in turn will filter through to the R package.

]]>A few months ago @DanielCunnama suggested that I add the ability to creating running groups in @racently. This sounded like a good idea. It also sounded like a bit of work and TBH I just did not have the time. So I made a counter-suggestion: how about an API so that he could effectively aggregate the data in any way he wanted? He seemed happy with the idea, so it immediately went onto my backlog. And there it stayed. But @DanielCunnama is a persistent guy (perhaps this is why he’s a class runner!) and he pinged me relentlessly about this… until Sunday when I relented and created the API.

And now I’m happy that I did, because it gives me an opportunity to write up a quick post about how these data can be accessed from R.

I’m going to use Gerda Steyn as an example. I hope she doesn’t mind.

Now there are a couple of things I should point out:

- This profile is far from complete. Gerda has run a
*lot*more races than that. These are just the ones that we currently have in our database. We’re adding more races all the time, but it’s a long and arduous process. - The result for the 2019 Comrades Marathon was when she
*won*the race!

A view like this can be created for any runner on the system. Most runners in South Africa should have a profile (unless they have explicitly requested that we remove it!).

Supposing that you wanted to do some analytics on the data. You’d want to pull the data into R or Python. You could scrape the site, but the API makes it a lot easier to access the data.

Load up some helpful packages.

```
library(glue)
library(dplyr)
library(purrr)
library(httr)
```

Set up the URL for the API endpoint and the key for Gerda’s profile.

```
URL = "https://www.racently.com/api/athlete/{key}/"
key = "7ef6fbc8-4169-4a98-934e-ff5fa79ba103"
```

Send a GET request and extract the results from the response object, parsing the JSON into an R list.

```
response <- glue(URL) %>% GET() %>% content()
```

Extract some basic information from the response.

```
response$url
## [1] "https://www.racently.com/api/athlete/7ef6fbc8-4169-4a98-934e-ff5fa79ba103/"
response$name
## [1] "Gerda Steyn"
response$gender
## [1] "F"
```

Now get the race results. This requires a little more work because of the way that the JSON is structured: an array of licenses, each of which has a nested array of race result objects.

```
response$license %>% map_dfr(function(license) {
license$result %>%
map_dfr(as_tibble)} %>%
mutate(
club = license$club,
number = license$number,
date = as.Date(date)
)
) %>%
arrange(desc(date))
## date race distance time club number
## 1 2019-06-09 Comrades 86.8 km 05:58:53 Nedbank NA
## 2 2018-06-10 Comrades 90.2 km 06:15:34 Nedbank 8300
## 3 2018-05-20 RAC 10.0 km 00:35:38 Nedbank 8300
## 4 2018-05-01 Wally Hayward 10.0 km 00:35:35 Nedbank 8300
## 5 2017-06-04 Comrades 86.7 km 06:45:45 Nedbank NA
## 6 2016-05-29 Comrades 89.2 km 07:08:23 Nedbank NA
```

For good measure, let’s throw in the results for @DanielCunnama.

```
## date race distance time club number
## 1 2019-09-29 Grape Run 21.1 km 01:27:49 Harfield Harriers 4900
## 2 2019-06-09 Comrades 86.8 km 07:16:21 Harfield Harriers 4900
## 3 2019-02-17 Cape Peninsula 42.2 km 03:08:47 Harfield Harriers 4900
## 4 2019-01-26 Red Hill Marathon 36.0 km 02:52:55 Harfield Harriers 4900
## 5 2019-01-13 Bay to Bay 30.0 km 02:15:55 Harfield Harriers 7935
## 6 2018-11-10 Winelands 42.2 km 02:58:56 Harfield Harriers 7935
## 7 2018-10-14 The Gun Run 21.1 km 01:22:30 Harfield Harriers 7935
## 8 2018-10-07 Grape Run 21.1 km 01:36:46 Harfield Harriers 8358
## 9 2018-09-23 Cape Town Marathon 42.2 km 03:11:52 Harfield Harriers 7935
## 10 2018-09-09 Ommiedraai 10.0 km 00:37:46 Harfield Harriers 11167
## 11 2018-06-10 Comrades 90.2 km 07:19:25 Harfield Harriers 7935
## 12 2018-02-18 Cape Peninsula 42.2 km 03:08:27 Harfield Harriers 7935
## 13 2018-01-14 Bay to Bay 30.0 km 02:11:50 Harfield Harriers 7935
## 14 2017-10-01 Grape Run 21.1 km 01:27:18 Harfield Harriers 7088
## 15 2017-09-17 Cape Town Marathon 42.2 km 02:57:55 Harfield Harriers 7088
## 16 2017-06-04 Comrades 86.7 km 07:46:18 Harfield Harriers 7088
## 17 2016-10-16 The Gun Run 21.1 km 01:19:09 Harfield Harriers NA
## 18 2016-09-10 Mont-Aux-Sources 50.0 km 05:42:23 Harfield Harriers NA
## 19 2016-05-29 Comrades 89.2 km 07:22:53 Harfield Harriers NA
## 20 2016-02-21 Cape Peninsula 42.2 km 03:17:12 Harfield Harriers NA
```

Let’s digress for a moment to look at a bubble plot showing the number of races on @racently broken down by runner. There are some really prolific runners.

We’ve currently got just under one million individual race results across over a thousand races. If you have the time and inclination then there’s definitely some interesting science to be done using these results. I’d be very interested in collaborating, so just shout if you are interested.

Feel free to grab some data via the API. At the moment you’ll need to search for an athlete on the main website in order to find their API key. I’ll implement some search functionality in the API when I get a chance.

Finally, here’s a talk I gave about @racently at the Bulgaria Web Summit (2017) in Sofia, Bulgaria. A great conference, incidentally. Well worth making the trip to Bulgaria.

]]>- an overall view of the splits across the entire field and
- a detailed view for individual runners (relative to the rest of the field).

My working solution for visualising the global splits data is a ridgeline plot created with the {ggridges} package.

The density curve for each of the splits gives the distribution of the runners in time at that point. Quartiles are displayed as vertical lines.

It’s immediately apparent how the field spreads out between the first mat at the base of Cowie’s Hill (Pinetown) and the finish line in Pietermaritzburg. Whereas the distribution is fairly smooth early in the race, structure starts to emerge as you get closer to the finish, showing runners who are aiming for specific finishing times (under 9, 10, 11 or 12 hours).

I experimented with various options for displaying the splits of specific runners. It’s simple enough to just show their individual splits, but I wanted to juxtapose this information against the rest of the field. This is what I came up with. I’m calling it a “splits plot” for the moment.

On the x-axis are the split times for a specific “focus” runner, while on the y-axis are the split times for the rest of the field. Points are plotted and linked by (partially transparent) lines for every finisher. The diagonal dashed line indicates runners who had the same splits as the focus runner, with those above the dashed line being slower and those below being faster. The quartiles on each of the splits are shown in blue, making it possible to easily see whether a runner is getting better or worse (relative to the rest of the field) as the race progresses. The plot indicates that I got off to a relatively slow start (just outside the 75th percentile in Pinetown) but gathered ground over the hills of Natal (finishing slightly beyond the median).

Here’s what that looks like for a *quality* athlete:

Analysis and visualisation using R.

]]>This is what the medal categories correspond to:

**Gold**— first 10 men and women**Wally Hayward (men)**— 11th position to sub-6:00**Isavel Roche-Kelly (women)**— 11th position to sub-7:30**Silver (men)**— 6:00 to sub-7:30**Bill Rowan**— 7:30 to sub-9:00**Robert Mtshali**— 9:00 to sub-10:00**Bronze**— 10:00 to sub-11:00 and**Vic Clapham**— 11:00 to sub-12:00.

Analysis and visualisation using R.

]]>Using data from the 2019 edition of the Comrades Marathon I set out to answer this question.

We’ll start off by looking at summary statistics broken down by batch.

```
batch min max avg median
1 Elite 00'12" 03'24" 00'18" 00'15"
2 A 00'11" 10'25" 00'30" 00'28"
3 B 00'14" 11'10" 01'07" 00'59"
4 C 00'21" 10'50" 02'15" 02'09"
5 CC 00'35" 10'34" 02'24" 02'11"
6 D 00'27" 10'33" 04'05" 04'03"
7 E 00'29" 10'49" 05'41" 05'44"
8 F 00'35" 10'52" 07'07" 07'05"
9 G 00'27" 10'53" 08'13" 08'36"
10 H 00'20" 11'08" 08'52" 09'23"
```

It’s apparent that the average delay increases consistently as you progress from the front of the field (the Elite and A batch) through to the back (batches G and H). What’s somewhat surprising is that there are runners who should ostensible be starting towards the back of the field who still manage to cross the starting mat with only a short delay (see the `min`

values for batches E through H).

The above table hides a lot of details. Below is a plot showing the distribution of start delays broken down by batch. As one would expect the delays for the first few batches are small and sharply peaked. However, the distribution of delays becomes broader for other batches. As hinted above, there are a significant number of runners who manage to cross the start mat very quickly given their nominal starting batch.

There’s a problem with the above plot: the scale of the y axis is linear and this means that small values are hard to see. If we apply a `sqrt()`

transform to this axis we get a much clearer view.

Now we can see that the start batches are really not being very strictly controlled: there are H batch runners who are evidently starting from very close to the front of the field. Conversely, there are also numerous runners who are starting further back in the field than they are entitled to based on their qualifying batch. Of course, the latter case is allowed (starting in a slower batch), while the former (starting in a faster batch) is not.

It’s important to note that these results are subject to significant selection bias: only those runners who finished the race are accounted for. It’d be great to have more extensive data which includes all runners who started the race.

]]>A few years ago I put together a simple spreadsheet for generating a Comrades Marathon pacing strategy. But the spreadsheet was clunky to use and laborious to maintain. Plus I was frustrated by the crude plots (largely due to my limited spreadsheet proficiency). It seemed like an excellent opportunity to create a Shiny app.

You need to specify the following:

- projected finish time (hours and minutes, limited to range between 05:00 and 12:00)
- minutes to cross the start line (this can be 8 to 10 minutes if you start in H batch)
- fade (how much you’ll slow down during the race; set to zero if you plan to run at a constant pace) and
- whether or not to include a heuristic for the effect of the Big 5 hills.

The output is displayed in two tabs: Plot and Table.

On the Plot tab is a figure which presents the following as a function of distance:

- time
- pace (the “instantaneous pace” as a function of distance) and
- average pace (averaged up to that point in the race).

The average pace curve is rather interesting because it will generally have a “bucket” shape. At the beginning of the race average pace will be slow due to the delay in crossing the line. However, over time the effect of this delay decays and average pace improves. Then, as you begin to slow down towards the end of the race (assuming that you’ve set fade to a value other than zero) average pace begins to climb once again. I find average pace much more useful than instantaneous pace because it smoothes out effects of perturbations like water tables and wee breaks.

On the Table tab are the time and average pace projected at a number of landmarks along the route.

]]>There are a variety of ways to predict running times over the standard marathon distance (42.2 km). You could dust off your copy of *The Lore of Running* (Tim Noakes). My treasured Third Edition discusses predicting likely marathon times on p. 366, referring to tables published by other authors to actually make predictions. There’s also a variety of online services, for example:

- Runners' World’s Race Time Predictor (based on Riegel’s Formula),
- Running for Fitness’s Race Predictor and
- Race Result Predictor.

Of these I particularly like the offering from Running for Fitness which produces a neatly tabulated set of predicted times over an extensive range of distances using a selection of techniques including Riegel’s Formula and Cameron’s Model.

While the sites listed above certainly provide useful predictions, I have a niggling feeling that they aren’t fully exploiting the large amount of data that we currently have available (both as individual athletes and as a global fraternity of runners). I’ve developed a relentless itch to provide a better solution. I wanted to do the following:

- incorporate information for multiple measurements (other solutions just use a single time over another distance);
- illustrate how the prediction is updated (and hopefully improved) by adding additional measurements;
- provide an indication of uncertainty in the prediction.

Using data accumulated from a number of races in South Africa I put together a Bayesian model for prediction marathon times. The likelihood function, which embodies the code data for the model, was constructed using `npcdensbw()`

and `npcdens()`

from the np package (nonparametric kernel smoothing methods for mixed data types).

Distance [km] | Time [HH:MM] | Marathon (Riegel's Formula) [HH:MM] |
---|---|---|

10.0 | 00:38 | 02:55 |

21.1 | 01:17 | 02:40 |

25.0 | 01:34 | 02:43 |

32.0 | 01:59 | 02:40 |

The third column is the predicted marathon time using Riegel’s Formula on the time achieved over each distance.

Actual marathon time for this runner is 2:42.

But the mode of the posterior is at 170 minutes.

We start with a belief, called a prior. Then we obtain some data and use it to update our belief. The outcome is called a posterior. Should we obtain even more data, the old posterior becomes a new prior and the cycle repeats.

Below is the default prior distribution constructed as the distribution of all marathon times in the data set. In the absence of further information regarding a particular runner, this is a reasonable guess for the distribution of possible marathon times. It represents our initial belief of what’s possible.

Based on the default prior the expected time for finishing a marathon is 04:19, with a 95% confidence interval that extends from 02:51 to 05:43.

Once we have some data though, we are able to update the initial belief using Bayes' Theorem to generate a posterior distribution.

Having incorporated the 10 km finishing time, the expected marathon time drops to 03:24. Quite an improvement! The 95% confidence interval also narrows to between 02:38 and 04:07.

When further information becomes available, the current posterior distribution becomes the prior for the next application of Bayes' Theorem. This cycle repeats itself with each new piece of information, the posterior progressively becoming a more accurate representation of the information captured in the measurements.

Adding a time of 01:17 over 21.1 km into the mix gives an expected marathon time of 03:13, slicing off another 11 minutes.

A time of 01:34 for 25 km gives the expected marathon time another boost to 03:07.

Finally, adding in a time of 01:59 for 32 km drops the anticipated marathon time down to 02:54, with a 95% confidence interval from 02:36 to 03:16.

]]>They will probably also be thinking a lot about Sunday’s race. What will the weather be like? Will it be cold at the start? (Unlikely since it’s been so warm in Durban.) How will they feel on the day? Will they manage to find their seconds along the route?

For the more performance oriented among them (and, let’s face it, that’s most runners!), there will also be thoughts of what time they will do on the day and what medal they’ll walk away with. I’ve considered ways for projecting finish times in a previous article. Today I’m going to focus on a somewhat simpler goal: making a Comrades Marathon medal prediction.

In the process I have put together a small application which will make medal predictions based on recent race times.

I’m not going to delve too deeply into the details, but if you really don’t have the patience, feel free to skip forward to the results or click on the image above, which will take you to the application. If you have trouble accessing the application it’s probable that you are sitting behind a firewall that is blocking it. Try again from home.

The data for this analysis were compiled from a variety of sources. I scraped the medal results off the Comrades Marathon results archive. This archive no longer seems to be available but you can get a large chunk of Comrades Marathon data from Kaggle. Times for other distances were cobbled together from Two Oceans Marathon Results, RaceTec Results and the home pages of some well organised running clubs.

The distribution of the data is broken down below as a function of gender, Comrades Marathon medal and other distances for which I have data. For instance, I have data for 45 female runners who got a Bronze medal and for whom a 32 km race time was available.

Unfortunately the data are pretty sparse for Gold, Wally Hayward and Silver medalists, especially for females. I’ll be collecting more data over the coming months and the coverage in these areas should improve. Athletes that are contenders for these medals should have a pretty good idea of what their likely prospects are anyway, so the model is not likely to be awfully interesting for them. This model is intended more for runners who are aiming at a Bill Rowan, Bronze or Vic Clapham medal.

The first step in the modelling process was to build a decision tree. Primarily this was to check whether it was feasible to predict a medal class based on race times for other distances (I’m happy to say that it was!). The secondary motivation was to assess what the most important variables were. The resulting tree is plotted below. Open this plot in a new window so that you can zoom in on the details. As far as the labels on the tree are concerned, “min” stands for “minimum” time over the corresponding distance and times (labels on the branches) are given in decimal hours.

The first thing to observe is that the most important predictor is 56 km race time. This dominates the first few levels in the tree hierachy. Of slightly lesser importance is 42.2 km race time, followed by 25 km race time. It’s interesting to note that 32 km and 10 km results does no feature at all in the tree, probably due to the relative scarcity of results over these distances in the data.

Some specific observations from the tree are:

- Male runners who can do 56 km in less than 03:30 have around 20% chance of getting a Gold medal.
- Female runners who can do 56 km in less than 04:06 have about 80% chance of getting a Gold medal.
- Runners who can do 42.2 km in less than about 02:50 are very likely to get a Silver medal.
- Somewhat more specifically, runners who do 56 km in less than 05:53 and 42.2 km in more than 04:49 are probably in line for a Vic Clapham.

Note that the first three observations above should be taken with a pinch of salt since, due to a lack of data, the model is not well trained for Gold, Wally Hayward and Silver medals.

You’d readily be forgiven for thinking that this decision tree is an awfully complex piece of apparatus for calculating something as simple as the colour of your medal.

Well, yes, it is. And I am going to make it simpler for you. But before I make it simpler, I am going to make it slightly more complicated.

Instead of just using a single decision tree, I built a Random Forest consisting of numerous trees, each of which was trained on a subset of the data. Unfortunately the resulting model is not as easy to visualise as a single decision tree, but the results are far more robust.

To make this a little more accessible I bundled the model up in a Shiny application which I deployed here. Give it a try. You’ll need to enter the times you achieved over one or more race distances during the last few months. Note that these are *race* times, not training run times. The latter are not good predictors for your Comrades medal.

Let’s have a quick look at some sample predictions. Suppose that you are a male athlete who has recent times of 00:45, 01:45, 04:00 and 05:00 for 10, 21.1, 42.2 and 56 km races respectively, then according to the model you have a 77% probability of getting a Bronze medal and around 11% chance of getting either a Bill Rowan or Vic Clapham medal. There’s a small chance (less than 1%) that you might be in the running for a Silver medal.

What about a male runner who recently ran 03:20 for 56 km? There is around 20% chance that he would get a Gold medal. Failing that he would most likely (60% chance) get a Silver.

If you happen to have race results for the last few years that I could incorporate into the model, please get in touch. I’m keen to collaborate on improving this tool.

]]>There are various approaches to predicting Comrades Marathon finishing times. Lindsey Parry, for example, suggests that you use two and a half times your recent marathon time. Sports Digest provides a calculator which predicts finishing time using recent times over three distances. I understand that this calculator is based on the work of Norrie Williamson.

Let’s give them a test. I finished the 2013 Comrades Marathon in 09:41. Based on my marathon time from February that year, which was 03:38, Parry’s formula suggests that I should have finished at around 09:07. Throwing in my Two Oceans time for that year, 04:59, and a 21.1 km time of 01:58 a few weeks before Comrades, the Sports Digest calculator gives a projected finish time of 08:59. Clearly, relative to both of those predictions, I under-performed that year! Either that or the predictions were way off the mark.

It seems to me that, given the volume of data we gather on our runs, we should be able to generate better predictions. If the thought of maths or data makes you want to doze off, feel free to jump ahead, otherwise read on.

In 1977 Peter Riegel published a formula for predicting running times, which became popular due to its simplicity. The formula itself looks like this:

$$ \Delta t_2 = \Delta t_1 \left( \frac{d_2}{d_1} \right)^{1.06} $$

which allows you to predict \(\Delta t_2\) the time it will take you to run distance \(d_2\), given that you know it takes you time \(\Delta t_1\) to run distance \(d_1\). Riegel called this his “endurance equation”.

Riegel’s formula is an empirical model: it’s based on data. In order to reverse engineer the model we are going to need some data too. Unfortunately I do not have access to data for a cohort of elite runners. However, I do have ample data for one particular runner: me. Since I come from the diametrically opposite end of the running spectrum (I believe the technical term would be “bog standard runner”), I think these data are probably more relevant to most runners anyway.

I compiled my data for the last three years based on the records kept by my trusty Garmin 910XT. A plot of time versus distance is given below.

At first glance it looks like you could fit a straight line through those points. And you can, indeed, make a pretty decent linear fit.

```
> fit <- lm(TimeHours ~ Distance, data = training)
>
> summary(fit)
Call:
lm(formula = TimeHours ~ Distance, data = training)
Residuals:
Min 1Q Median 3Q Max
-0.64254 -0.04592 -0.00618 0.02361 1.24900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1029964 0.0107648 -9.568 <2e-16 ***
Distance 0.1012847 0.0008664 116.902 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1394 on 442 degrees of freedom
Multiple R-squared: 0.9687, Adjusted R-squared: 0.9686
F-statistic: 1.367e+04 on 1 and 442 DF, p-value: < 2.2e-16
```

However, applying a logarithmic transform to both axes gives a more uniform distribution of the data, which also now looks more linear.

Riegel observed that data for a variety of disciplines (running, swimming, cycling and race walking) conformed to the same pattern. Figure 1 from this paper is included below.

If we were to fit a straight line to the data on logarithmic axes then the relationship we’d be contemplating would have the form

$$\log \Delta t = m \log d + c$$

or, equivalently,

$$\Delta t = k d^m$$

which is a power law relating elapsed time to distance. It’s pretty easy to get Riegel’s formula from this. Taking two particular points on the power law, \(\Delta t_1 = k d_1^m\) and \(\Delta t_2 = k d_2^m\), and eliminating \(k\) gives

$$\Delta t_2 = \Delta t_2 \left( \frac{d_2}{d_1} \right)^m$$

which is Riegel’s formula with an unspecified value for the exponent. We’ll call the exponent the “fatigue factor” since it determines the degree to which a runner slows down as distance increases.

How do we get a value for the fatigue factor? Well, by fitting the data, of course!

```
> fit <- lm(log(TimeHours) ~ log(Distance), data = training)
>
> summary(fit)
Call:
lm(formula = log(TimeHours) ~ log(Distance), data = training)
Residuals:
Min 1Q Median 3Q Max
-0.27095 -0.04809 -0.01843 0.01552 0.80351
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.522669 0.018111 -139.3 <2e-16 ***
log(Distance) 1.045468 0.008307 125.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09424 on 442 degrees of freedom
Multiple R-squared: 0.9729, Adjusted R-squared: 0.9728
F-statistic: 1.584e+04 on 1 and 442 DF, p-value: < 2.2e-16
```

The fitted value for the exponent is 1.05 (rounding up), which is pretty close to the value in Riegel’s formula. The fitted model is included in the logarithmic plot above as the solid line, with the 95% prediction confidence interval indicated by the coloured ribbon. The linear plot below shows the data points used for training the model, the fit and confidence interval as well as dotted lines for constant paces ranging from 04:00 per km to 07:00 per km.

Our model also provides us with an indication of the uncertainty in the fitted exponent: the 95% confidence interval extends from 1.029 to 1.062.

```
> confint(fit)
2.5 % 97.5 %
(Intercept) -2.558264 -2.487074
log(Distance) 1.029142 1.061794
```

A fatigue factor of 1 would correspond to a straight line, which implies a constant pace regardless of distance. A value less than 1 implies faster pace at larger distances (rather unlikely in practice). Finally, a value larger than 1 implies progressively slower pace at larger distances.

The problem with the fatigue factor estimate above, which was based on a single model fit to *all* of the data, is that it’s probably biased by the fact that most of the data are for runs of around 10 km. In this regime the relationship between time and distance is approximately linear, so that the resulting estimate of the fatigue factor is probably too small.

To get around this problem, I employed a bootstrap technique, creating a large number of subsets from the data. In each subset I weighted the samples to ensure that there was a more even distribution of distances in the mix. I calculated the fatigue factor for each subset, resulting in a range of estimates. Their distribution is plotted below.

According to this analysis, my personal fatigue factor is around 1.07 (the median value indicated by the dashed line in the plot above). The Shapiro-Wilk test suggests that the data is sufficiently non-Normal to justify a non-parameteric estimate of the 95% confidence interval for the fatigue factor, which runs from 1.03 to 1.11.

```
> shapiro.test(fatigue.factor)
Shapiro-Wilk normality test
data: fatigue.factor
W = 0.9824, p-value = 1.435e-12
> median(fatigue.factor)
[1] 1.072006
> quantile(fatigue.factor, c(0.025, 0.975))
2.5% 97.5%
1.030617 1.107044
```

Riegel’s analysis also lead to a range of values for the fatigue factor. As can be seen from the table below (extracted from his paper), the values range from 1.01 for Nordic skiing to 1.14 for roller skating. Values for running range from 1.05 to 1.08 depending on age group and gender.

The rules mentioned above for predicting finishing times are generally applied to data for a single race (or a selection of three races). But, again, given that we have so much data on hand, would it not make sense to generate a larger set of predictions?

The distributions above indicate the predictions for this year’s Comrades Marathon (which is apparently going to be 89 km) based on all of my training data this year and using both the default (1.06) and personalised (1.07) values for the fatigue factor. The distributions are interesting, but what we are really interested in is the *expected* finish times, which are 08:59 and 09:18 depending on what value you use for the fatigue factor. I have a little more confidence in my personalised value, so I am going to be aiming for 09:18 this year.

Comrades is a long day and a variety of factors can affect your finish time. It’s good to have a ball-park idea though. If you would like me to generate a set of personalised predictions for you, just get in touch via the replies below.

I repeated the analysis for one of my friends and colleagues. His fatigue factor also comes out as 1.07 although, interestingly, the distribution is bi-modal. I think I understand the reason for this though: his runs are divided clearly into two groups: training runs and short runs back and forth between work and the gym.

]]>I’m not convinced.

Although 20 runners were charged with misconduct, six of them had a valid story. These runners had retired from the race but the bailers' bus had dropped them back on the course and they were “forced” to cross the finish line. I find this story a hard to digest. My understanding is that race numbers are either confiscated, destroyed or permanently marked when entering the bus. If this story is true then it should have thus been immediately obvious to officials that the runners in question had dropped out of the race. Their times should never have been recorded and they certainly should not have received medals (which presumably at least some of them did, since they have been instructed to return them!).

So that leaves the 14 runners who were disqualified. Were any of them among the group of mysterious negative splits identified previously? Unless KZNA or the CMA releases the names or race numbers, I guess we’ll never know.

]]>The use of the chart is explained in a previous post. Any feedback on how this can be improved would be appreciated.

]]>The histograms below show graphically how the distribution of runners' ages at the Comrades Marathon has changed every decade starting in the 1980s and proceeding through to the 2010s. The data are encoded using blue for male and pink for female runners (apologies for the banality!). It is readily apparent how the distributions have shifted consistently towards older ages with the passing of the decades. The vertical lines in each panel indicate the average age for male (dashed line) and female (solid line) runners. Whereas in the 1980s the average age for both genders was around 34, in the 2010s it has shifted to over 40 for females and almost 42 for males.

Maybe clumping the data together into decades is hiding some of the details. The plot below shows the average age for each gender as a function of the race year. The plotted points are the observed average age, the solid line is a linear model fitted to these data and the dashed lines delineate a 95% confidence interval.

Prior to 1990 the average age for both genders was around 35 and varies somewhat erratically from year to year. Interestingly there is a pronounced decrease in the average age for both genders around 1990. Evidently something attracted more young runners that year… Since 1990 though there has been a consistent increase in average age. In 2013 the average age for men was fractionally less than 42, while for women it was over 40.

Of course, the title of this article is hyperbolic. The Comrades Marathon is a long way from being a race for geriatrics. However, there is very clear evidence that the average age of runners is getting higher every year. A linear model, which is a reasonably good fit to the data, indicates that the average age increases by 0.26 years annually and is generally 0.6 years higher for men than women. If this trend continues then, by the time of the 100th edition of the race, the average age will be almost 45.

Is the aging Comrades Marathon field a problem and, if so, what can be done about it?

As before I have used the Comrades Marathon results from 1980 through to 2013. Since my last post on this topic I have refactored these data, which now look like this:

```
head(results)
```

```
key year age gender category status medal direction medal_count decade
1 6a18da7 1980 39 Male Senior Finished Bronze D 20 1980
2 6570be 1980 39 Male Senior Finished Bronze D 16 1980
3 4371bd17 1980 29 Male Senior Finished Bronze D 9 1980
4 58792c25 1980 24 Male Senior Finished Silver D 25 1980
5 16fe5d63 1980 58 Male Master Finished Bronze D 9 1980
6 541c273e 1980 43 Male Veteran Finished Silver D 18 1980
```

The first step in the analysis was to compile decadal and annual summary statistics using plyr.

```
decade.statistics = ddply(
results, .(decade, gender), summarize,
median.age = median(age, na.rm = TRUE),
mean.age = mean(age, na.rm = TRUE)
)
#
year.statistics = ddply(
results, .(year, gender), summarize,
median.age = median(age, na.rm = TRUE),
mean.age = mean(age, na.rm = TRUE)
)
head(decade.statistics)
```

```
decade gender median.age mean.age
1 1980 Female 34 34.352
2 1980 Male 34 34.937
3 1990 Female 36 36.188
4 1990 Male 36 36.440
5 2000 Female 39 39.364
6 2000 Male 39 39.799
```

```
head(year.statistics)
```

```
year gender median.age mean.age
1 1980 Female 35.0 35.061
2 1980 Male 33.0 34.091
3 1981 Female 33.5 34.096
4 1981 Male 34.0 34.528
5 1982 Female 34.5 35.032
6 1982 Male 34.0 34.729
```

The decadal data were used to generate the histograms. I then considered a selection of linear models applied to the annual data.

```
fit.1 <- lm(mean.age ~ year, data = year.statistics)
fit.2 <- lm(mean.age ~ year + year:gender, data = year.statistics)
fit.3 <- lm(mean.age ~ year + gender, data = year.statistics)
fit.4 <- lm(mean.age ~ year + year * gender, data = year.statistics)
```

The first model applies a simple linear relationship between average age and year. There is no discrimination between genders. The model summary (below) indicates that the average age increases by about 0.26 years annually. Both the intercept and slope coefficients are highly significant.

```
summary(fit.1)
```

```
Call:
lm(formula = mean.age ~ year, data = year.statistics)
Residuals:
Min 1Q Median 3Q Max
-1.3181 -0.5322 -0.0118 0.4971 1.9897
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.80e+02 1.83e+01 -26.2 <2e-16 ***
year 2.59e-01 9.15e-03 28.3 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.74 on 66 degrees of freedom
Multiple R-squared: 0.924, Adjusted R-squared: 0.923
F-statistic: 801 on 1 and 66 DF, p-value: <2e-16
```

The second model considers the effect on the slope of an interaction between year and gender. Here we see that the slope is slightly large for males than females. Although this interaction coefficient is statistically significant, it is extremely small relative to the slope coefficient itself. However, given that the value of the abscissa is around 2000, it still contributes roughly 0.6 extra years to the average age for men.

```
summary(fit.2)
```

```
Call:
lm(formula = mean.age ~ year + year:gender, data = year.statistics)
Residuals:
Min 1Q Median 3Q Max
-1.103 -0.522 0.024 0.388 2.287
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.80e+02 1.68e+01 -28.57 < 2e-16 ***
year 2.59e-01 8.41e-03 30.78 < 2e-16 ***
year:genderMale 3.00e-04 8.26e-05 3.63 0.00056 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared: 0.937, Adjusted R-squared: 0.935
F-statistic: 481 on 2 and 65 DF, p-value: <2e-16
```

The third model considers an offset on the intercept based on gender. Here, again, we see that the effect of gender is small, with the fit for males being shifted slightly upwards. Again, although this effect is statistically significant, it has only a small effect on the model. Note that the value of this coefficient (5.98e-01 years) is consistent with the effect of the interaction term (0.6 years for typical values of the abscissa) in the second model above.

```
summary(fit.3)
```

```
Call:
lm(formula = mean.age ~ year + gender, data = year.statistics)
Residuals:
Min 1Q Median 3Q Max
-1.1038 -0.5225 0.0259 0.3866 2.2885
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.80e+02 1.68e+01 -28.58 < 2e-16 ***
year 2.59e-01 8.41e-03 30.79 < 2e-16 ***
genderMale 5.98e-01 1.65e-01 3.62 0.00057 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared: 0.937, Adjusted R-squared: 0.935
F-statistic: 480 on 2 and 65 DF, p-value: <2e-16
```

The fourth and final model considers both an interaction between year and gender as well as an offset of the intercept based on gender. Here we see that the data does not differ sufficiently on the basis of gender to support both of these effects, and neither of the resulting coefficients is statistically significant.

```
summary(fit.4)
```

```
Call:
lm(formula = mean.age ~ year + year * gender, data = year.statistics)
Residuals:
Min 1Q Median 3Q Max
-1.0730 -0.5127 -0.0492 0.4225 2.1273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -460.3631 23.6813 -19.44 <2e-16 ***
year 0.2491 0.0119 21.00 <2e-16 ***
genderMale -38.4188 33.4904 -1.15 0.26
year:genderMale 0.0195 0.0168 1.17 0.25
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.679 on 64 degrees of freedom
Multiple R-squared: 0.938, Adjusted R-squared: 0.935
F-statistic: 322 on 3 and 64 DF, p-value: <2e-16
```

On the basis of the above discussion, the fourth model can be immediately abandoned. But how do we choose between the three remaining models? An ANOVA indicates that the second model is a significant improvement over the first model. There is little to choose, however, between the second and third models. I find the second model more intuitive, since I would expect there to be a slight gender difference in the rate of aging, rather than a simple offset. We will thus adopt the second model, which indicates that the average age of runners increases by about 0.259 years annually, with the men aging slightly faster than the women.

```
anova(fit.1, fit.2, fit.3, fit.4)
```

```
Analysis of Variance Table
Model 1: mean.age ~ year
Model 2: mean.age ~ year + year:gender
Model 3: mean.age ~ year + gender
Model 4: mean.age ~ year + year * gender
Res.Df RSS Df Sum of Sq F Pr(>F)
1 66 36.2
2 65 30.1 1 6.09 13.23 0.00055 ***
3 65 30.1 0 -0.02
4 64 29.5 1 0.62 1.36 0.24833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Lastly, I constructed a data frame based on the second model which gives both the model prediction and a 95% uncertainty interval. This was used to generate the second set of plots.

```
fit.data <- data.frame(year = rep(1980:2020, each = 2), gender = c("Female", "Male"))
fit.data <- cbind(fit.data, predict(fit.2, fit.data, level = 0.95, interval = "prediction"))
```

Brad Brown has published evidence that the runner in question, Kitty Chutergon (race number 25058), had another athlete running with his number for most of the race. This is a change in strategy from last year, where he appears to have been assisted along the route, having missed the timing mats at Camperdown and Polly Shortts.

]]>My data for the 2013 up run gives times with one second precision, so these questions could be answered if I relaxed the constraints from “exactly the same time” to “within one second of each other”. We’ll call such simultaneous pairs of runners “twins” and simultaneous threesomes will be known as “tripods”. How many twins are there? How many tripods? The answers are somewhat surprising. What’s even more surprising is another category: “phantoms”.

If you are not interested in the details of the analysis (and I’m guessing that you probably aren’t), please skip forward to the pictures and analysis.

The first step is to subset the data, leaving a data frame containing only the times at halfway and the finish, indexed by a unique runner key.

```
simultaneous = subset(splits,
year == 2013 & !is.na(medal))[, c("key", "drummond.time", "race.time")]
simultaneous = simultaneous[complete.cases(simultaneous),]
#
rownames(simultaneous) = simultaneous$key
simultaneous$key <- NULL
head(simultaneous)
```

```
drummond.time race.time
4bdcb291 320.15 712.42
4e488aab 294.65 656.90
ab59fc97 304.62 643.67
89d3e09b 270.32 646.78
fc728816 211.27 492.95
7b761740 274.60 584.37
```

Next we calculate the “distance” (this is a distance in time and not in space) between runners, which is effectively the squared difference between the halfway and finish times for each pair of runners. This yields a rather large matrix with rows and columns labelled by runner key. These data are then transformed into a format where each row represents a pair of runners.

```
simultaneous = dist(simultaneous)
library(reshape2)
simultaneous = melt(as.matrix(simultaneous))
head(simultaneous)
```

```
Var1 Var2 value
1 4bdcb291 4bdcb291 0.000
2 4e488aab 4bdcb291 61.093
3 ab59fc97 4bdcb291 70.483
4 89d3e09b 4bdcb291 82.408
5 fc728816 4bdcb291 244.992
6 7b761740 4bdcb291 135.910
```

We can immediately see that there are some redundant entries. We need to remove the matrix diagonal (obviously the times match when a runner is compared to himself!) and keep only one half of the matrix.

```
simultaneous = subset(simultaneous, as.character(Var1) < as.character(Var2))
```

Finally we retain only the records for those pairs of runners who crossed both mats simultaneously (in retrospect, this could have been done earlier!).

```
simultaneous = subset(simultaneous, value == 0)
head(simultaneous)
```

```
Var1 Var2 value
623174 5217dfc9 75a78d04 0
971958 d8c9c403 e6e0d6e3 0
2024105 2e8f7778 9acc46ee 0
2464116 5f18d86f 9a1697ff 0
2467712 63033429 9a1697ff 0
3538608 54a92b96 f574be97 0
```

We can then merge in the data for race numbers and names, leaving us with an (anonymised) data set that looks like this:

```
simultaneous = simultaneous[order(simultaneous$race.time),]
head(simultaneous)[, c(4, 6, 8)]
```

```
race.number.x race.number.y race.time
133 59235 56915 07:54:21
9 26132 23470 08:06:55
62 44008 31833 08:25:58
61 25035 36706 08:35:42
54 28868 25910 08:46:42
26 47703 31424 08:47:08
```

```
tail(simultaneous)[, c(4, 6, 8)]
```

```
race.number.x race.number.y race.time
71 54689 16554 11:55:59
60 8846 23003 11:56:26
44 9235 49251 11:56:47
38 53354 53352 11:56:56
28 19268 59916 11:57:49
20 22499 40754 11:58:26
```

As it turns out, there are a remarkably large number of Comrades twins. In the 2013 race there were more than 100 such pairs. So they are not as rare as I had assumed they would be.

Although there were relatively many Comrades twins, there were only two tripods. In both cases, all three members of the tripod shared the same surname, so they are presumably related.

The members of the first tripod all belong to the same running club, two of them are in the 30-39 age category and the third is in the 60+ group. There’s a clear family resemblance, so I’m guessing that they are father and sons. Dad had gathered 9 medals, while the sons had 2 and 3 medals respectively. What a day they must have had together!

The second tripod also consisted of three runners from the same club. Based on gender and age groups, I suspect that they are Mom, Dad and son. The parents had collected 8 medals each, while junior had 3. What a privilege to run the race with your folks! Lucky guy.

And now things get more interesting…

The runner with race number 26132 appears to have run all the way from Durban to Pietermaritzburg with runner 23470! Check out the splits below.

Not only did they pass through halfway and the finish at the same time, but they crossed *every* mat along the route at *precisely* the same time. Yet, somewhat mysteriously, there is no sign of 23470 in the race photographs…

You might notice that there is another runner with 26132 in all three of the images above. That’s not 23470. He has race number 28151 and he is not the phantom! His splits below show that he only started running with 26132 somewhere between Camperdown and Polly Shortts.

If you search the race photographs for the phantom’s race number (23470), you will find that there are no pictures of him at all! That’s right, nineteen photographs of 26132 and not a single photograph of 23470.

The runner with race number 53367 was also accompanied by a phantom with race number 27587. Again, as can be seen from the splits below, these two crossed every mat on the course at *precisely* the same time.

Yet, despite the fact that 53367 is quite evident in the race photos, there is no sign of 27587.

I would have expected to see a photograph of 53367 embracing his running mate at the finish, yet we find him pictured with two other runners. In fact, if you search the race photographs for 27587 you will find that there are no photographs of him at all. You will, however, find twelve photographs of 53367.

Well done to the tripods, I think you guys are awesome! As for the phantoms (and their running mates), you have some explaining to do.

]]>Well, suppose that it takes me 3 minutes from the gun to get across the starting line. And, furthermore, assume that I will be running around 5% slower towards the end of the race. To still get to Durban under 9 hours I would need to run at roughly 5:52 per km at the beginning and gradually ease back to about 6:11 per km towards the end.

I arrived at these figures using a pacing spreadsheet. To get an idea of your pace requirements you will need to specify your goal time, the number of minutes you anticipate losing before crossing the start line and an estimate of how much you think you will slow down during the course of the race. This is done by editing the blue fields indicated in the image below. The rest of the spreadsheet will update on the basis of your selections.

The spreadsheet uses a simple linear model which assumes that your pace will gradually decline at the rate you have specified. If you give 0% for your slowing down percentage then the calculations are performed on the basis of a uniform pace throughout the race. Of course, neither the linear model nor a uniform pace are truly realistic. We all know that our pace will vary continuously throughout the race as a function of congestion, topography, hydration, fatigue, motivation and all of the other factors which come into play. However, as noted by the eminent statistician George Box “all models are wrong, but some are useful”. In this case the linear model is a useful way to account for the effects of fatigue.

The spreadsheet will give you an indication of the splits (both relative to the start of the race as well as time of day) and pace (instantaneous and average) required to achieve your goal time. There are also a pair of charts which will be updated with your projected timing and pace information.

My plan on race day is to run according to my average pace. This works well because it smooths out all the perturbations associated with tables and walking breaks.

One interesting thing to play around with on the spreadsheet is the effect of losing time at the start. If you vary this number you should see that it really does not have a massive influence on your pacing requirements for the rest of the race. For example, if I change my estimate from 3 minutes to 10 minutes then my required average pace decreases from 6:02 per km to 5:57 per km. Sure, this amounts to 5 seconds shaved off every km, but it is not unmananagable: the delay at the start gets averaged out over the rest of the race.

Naturally, the faster you are hoping to finish the race, the more significant a delay at the start is going to become. However, if you are aiming for a really fast time then presumably you are in a good seeding batch. For the majority of runners it is probably not going to make an enormous difference and so it is not worth stressing about.

The important thing is to make sure that you just keep on moving forward. Don’t stop. Just keep on putting one foot in front of the other.

The pacing chart by Dirk Cloete is based on the profile of the route. It breaks the route down into undulating, up and down sections and takes this into account when calculating splits.

]]>I have consequently partitioned the data according to “strict” and “extended” cutoffs.

```
novices$extended = factor(novices$year == 2000 | novices$year >= 2003,
labels = c("Strict Cutoff", "Extended Cutoff"))
```

This paints a much more representative picture of the distribution of finish times now that the race has been extended to 12 hours.

The allocation of medals is complicated by the fact that new medals have been introduced at different times over recent years. Specifically, the Bill Rowan medal was first awarded in 2000, then the Vic Clapham medal was introduced in 2003 and, finally, 2007 saw the first Wally Hayward medals.

```
novices$period = cut(novices$year, breaks = c(1900, 2000, 2003, 2007, 3000), right = FALSE,
labels = c("before 2000", "2000 to 2002", "2003 to 2006", "after 2007"))
novice.medals = table(novices$medal, novices$period)
novice.medals = scale(novice.medals, scale = colSums(novice.medals), center = FALSE) * 100
options(digits = 1)
(novice.medals = t(novice.medals))
```

```
Gold Wally Hayward Silver Bill Rowan Bronze Vic Clapham
before 2000 0.07 0.00 4.80 0.00 95.13 0.00
2000 to 2002 0.09 0.00 2.66 12.76 84.49 0.00
2003 to 2006 0.15 0.00 4.05 17.51 47.63 30.66
after 2007 0.08 0.03 2.60 12.28 46.40 38.62
```

So, currently, around 46% of novices get a Bronze medal while slightly fewer, about 37%, get a Vic Clapham medal. A significant fraction, just over 12%, achieve a Bill Rowan, while only 2.6% get a Silver medal. The number of Wally Hayward and Gold medals among novices is very small indeed.

Thanks to Daniel for pointing out this issue!

]]>To paraphrase the dictionary, a *novice* is “a person who is new to or inexperienced in the circumstances in which he or she is placed; a beginner”. In the context of the Comrades Marathon this definition can be interpreted in a few ways:

- a runner who has never run the Comrades Marathon (has never started the race);
- a runner who has never completed the Comrades Marathon (has never finished the race); or
- a runner who has not completed both an “up” and a “down” Comrades Marathon.

For the purposes of this article I will be adopting the first definition. This is probably the one of most interest to runners who are embarking on their first Comrades journey.

I’ll be using the same data sets that I have discussed in previous articles. Before we focus on the data for the novices we’ll start by just retaining the fields of interest.

```
novices = results[, c("key", "year", "category", "gender", "medal", "medal.count", "status", "ftime")]
head(novices)
```

```
key year category gender medal medal.count status ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham 1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female <NA> 1 DNF NA
3 100030f4 2013 Ages 20 - 29 Female <NA> 1 DNS NA
4 10007cb6 2005 Ages 26 - 39 Male Bronze 1 Finished 9.1589
5 10007cb6 2006 Ages 30 - 39 Male Bill Rowan 2 Finished 8.2564
6 10007cb6 2007 Ages 30 - 39 Male Bill Rowan 3 Finished 8.0344
```

To satisfy our definition of novice we’ll need to exclude the “did not start” (DNS) records.

```
novices = subset(novices, status != "DNS")
head(novices)
```

```
key year category gender medal medal.count status ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham 1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female <NA> 1 DNF NA
4 10007cb6 2005 Ages 26 - 39 Male Bronze 1 Finished 9.1589
5 10007cb6 2006 Ages 30 - 39 Male Bill Rowan 2 Finished 8.2564
6 10007cb6 2007 Ages 30 - 39 Male Bill Rowan 3 Finished 8.0344
7 10007cb6 2008 Ages 30 - 39 Male Bill Rowan 4 Finished 8.8514
```

Some runners do not finish the race on their first attempt but they bravely come back to run the race again. We will retain only the first record for each runner, because the second time they attempt the race they are (according to our definition) no longer novices since already have some race experience.

```
novices <- novices[order(novices$year),]
novices <- novices[which(!duplicated(novices$key)),]
```

I suppose that the foremost question going through the minds of many Comrades novices is “Will I finish?”.

```
table(novices$status) / nrow(novices) * 100
```

```
Finished DNF
80.035 19.965
```

Well, there’s some good news: around 80% of all novices finish the race. Those are quite compelling odds. Of course, a number of factors can influence the success of each individual, but if you have done the training and you run sensibly, then the odds are in your favour.

What medal is a novice most likely to receive?

```
table(novices$medal) / nrow(subset(novices, !is.na(medal))) * 100
```

```
Gold Wally Hayward Silver Bill Rowan Bronze Vic Clapham
0.0829671 0.0051854 4.0264976 5.6469490 79.4708254 10.7675754
```

The vast majority (again around 80%) claim a Bronze medal. There are also a significant proportion (just over 10%) who miss the eleven hour cutoff and get a Vic Clapham medal. Around 6% of novices achieve a Bill Rowan medal and a surprisingly large fraction, just over 4%, manage to finish in a Silver medal time of under seven and a half hours. There are very few Wally Hayward and Gold medals won by novices. The odds for a novice Gold medal are around one in 1200, all else being equal (which it very definitely isn’t!).

As one would expect, the chart slopes up towards the right: progressively more runners come in later in the day. There is very clear evidence of clustering of runners just before the medal cutoffs at 07:30, 09:00, 11:00 and 12:00. There is also a peak before the psychological cutoff at 10:00.

The data for previous years indicates that the outlook for novices is rather good. 80% of them will finish the race and, of those, around 80% will receive Bronze medals.

How can you help ensure that you have a successful race? Here are some of the things I would think about:

- Start slowly. It’s going to be a long day.
- Take regular walking breaks and start doing this early on. A few minutes' recovery will power you up for a number of kms.
- Stay hydrated. Take something at every water table. Just don’t overdo it.
- Be inspired by the other runners: they all have the guts to indulge in this madness with you and every one of them is fighting their own battle.
- Enjoy the support: the hordes of people beside the road have come out to see YOU run by. And they all want you to finish.
- Enjoy the day: as far as entertainment is concerned, the Comrades Marathon is about the best value for money that you can get.

See you in Pietermaritzburg at 05:30 on 1 June!

**Note:** There’s an error in this post which is corrected here.

Thanks to Daniel for suggesting this article.

]]>Let’s have a look at the ten most extreme negative splits from Comrades Marathon 2013:

```
split.ratio.2013 = subset(split.ratio, year == 2013)
#
split.ratio.2013 = head(split.ratio.2013[order(split.ratio.2013$ratio),], 10)
#
rownames(split.ratio.2013) <- 1:nrow(split.ratio.2013)
split.ratio.2013[, c(-2, -7)]
```

```
year key drummond.time race.time ratio
1 2013 3c0ea3bc 368.12 636.50 -0.270929
2 2013 e22d8c74 359.00 633.17 -0.236305
3 2013 5cd624eb 354.87 640.05 -0.196365
4 2013 4d5a86d7 359.45 659.88 -0.164186
5 2013 61fa6b5 345.33 644.38 -0.134025
6 2013 e5d6fa0e 344.33 649.83 -0.112778
7 2013 63a33c8d 368.88 696.88 -0.110830
8 2013 e445f2d1 340.15 647.20 -0.097310
9 2013 fed967de 338.67 647.77 -0.087303
10 2013 553aeb62 364.02 697.90 -0.082780
```

Below are the splits data for these runners (in the same order as the table above).

The top one you have seen before (it was presented in my previous post). And, as previously noted, this runner’s time was not captured by the mat at either Camperdown or Polly Shortts. But if we look at the runner with the next most extreme negative split (e22d8c74) we see that the same thing happened: mysteriously he too was missed by those timing mats. The mats must have been having a bad day. The next two major negative splits (5cd624eb and 4d5a86d7): same story, no times at either of those mats. The next runner (61fa6b5) was captured on all five timing mats. And if we look at his splits, he is getting progressively faster during the course of the race. I suspect that this guy actually just had a very well planned and executed race. But the following runner on the list (e5d6fa0e) has also managed to elude both the mats in the second half of the race. Very strange indeed. The final four runners all have splits registered for every timing mat. And, again, if you look at their pace for each of the legs, it is not too hard to believe that these runners were playing by the rules and just had a very good day on the road.

So, of the top ten runners with extreme negative splits, five of them (yes, that’s 50%) inexplicably missed both timing mats in the second half of the race. Coincidence? I think not.

]]>This story emerged in February this year.

There was quite a fuss.

And then everything went quiet. The suspected runners were instructed to attend disciplinary hearings, but the outcomes of these hearings have not been publicised nor have the names of the suspected runners been released.

I have done some previous analyses using Comrades Marathon data. Here I am going to use the same data set to explore these suspicious negative splits.

I started off by extracting a subset of the columns from my splits data.

```
split.ratio = splits[, c("year", "race.number", "key", "drummond.time", "race.time")]
tail(split.ratio)
```

```
year race.number key drummond.time race.time
2013-9911 2013 9911 eb4b3b0c 303.40 686.68
2013-9912 2013 9912 c8d6cfdd 218.73 484.00
2013-9940 2013 9940 f46204ad 249.87 582.03
2013-9954 2013 9954 4bd1ca76 307.62 669.23
2013-9955 2013 9955 b2b9ed60 286.85 651.87
2013-9964 2013 9964 6f14470d 242.20 573.78
```

The resulting records have fields for the year, athlete’s race number, a unique key identifying the runner, and time taken (in minutes) to reach the little town of Drummond (the half way point at around the marathon distance) and the finish. We will only keep the complete records (valid entries for both half way and the full distance) and then add a new field.

```
# This is derived from (race.time - drummond.time) / drummond.time - 1
#
split.ratio = transform(
split.ratio,
ratio = race.time / drummond.time - 2
)
head(split.ratio)
```

```
year race.number key drummond.time race.time ratio
2000-10003 2000 10003 f1dffb06 243.65 532.33 0.18483
2000-10009 2000 10009 b06cab7f 274.47 599.95 0.18588
2000-10010 2000 10010 929fd7ee 273.38 620.35 0.26916
2000-10013 2000 10013 5d7aa79c 295.72 633.80 0.14327
2000-10014 2000 10014 c0578dad 247.18 533.80 0.15953
2000-10016 2000 10016 d64e4b42 257.60 657.65 0.55299
```

```
summary(split.ratio$ratio)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.5640 0.0934 0.1630 0.1870 0.2540 1.7400
```

The ratio field is a number between -1 and 1 which quantifies the time difference between first and second halves of the race. So, for example, if a runner took 4.5 hours for the first half and then 5.0 hours for the second half, his ratio would be 0.11111, indicating that he ran around 11% slower in the second half of the race.

```
9.5 / 4.5 - 2
```

```
[1] 0.11111
```

Conversely, if a runner took 5.0 hours for the first half and then finished the second half in 4.5 hours, his ratio would be -0.1, indicating that he ran about 10% faster in the second half.

```
9.5 / 5.0 - 2
```

```
[1] -0.1
```

Negative values of this ratio then indicate *negative splits*, while positive values are for *positive splits* and a value of exactly zero would be for *even splits* (same time for both halves of the race). Let’s look at the two extremes.

```
head(split.ratio[order(split.ratio$ratio, decreasing = TRUE),])[, -2]
```

```
year key drummond.time race.time ratio
2009-37874 2009 2c5ad823 178.72 668.70 1.7417
2008-36570 2008 d4033ea2 189.98 710.13 1.7379
2005-30155 2005 5a961d21 175.13 643.78 1.6760
2009-33945 2009 a1e79747 183.08 671.57 1.6681
2009-57185 2009 fdc6a261 186.92 653.70 1.4973
2011-56513 2011 df77e8bb 172.38 598.12 1.4697
```

Large (positive) values of the split ratio mean that a runner ran the second half much slower than the first half. Unless the time for the first half is unrealistic, then these are not suspicious: it is quite reasonable that a runner should go out really hard in the first half, get to half way in good time but then find that the wheels fall off in the second half of the race. Take, for example, the runner with key 2c5ad823, whose time for the first half was blisteringly fast (just less than three hours) but who slowed down a lot in the second half, only finishing the race in around 11 hours.

```
head(split.ratio[order(split.ratio$ratio),])[, -2]
```

```
year key drummond.time race.time ratio
2001-45410 2001 1a605ce5 340.32 488.82 -0.56364
2009-25058 2009 3c0ea3bc 359.08 591.63 -0.35238
2000-2187 2000 ef35f2e6 337.08 569.48 -0.31056
2000-8152 2000 18e59575 324.03 557.25 -0.28027
2013-25058 2013 3c0ea3bc 368.12 636.50 -0.27093
2012-48382 2012 7889f60a 336.85 592.57 -0.24086
```

At the other end of the spectrum we have runners with very low values of the split ratio, meaning that they ran the second half much faster than the first half. Take, for example, the runner with key 1a605ce5: she ran the first half in around five and a half hours but whipped through the second half in less than three hours. Seems a little odd, right?

Note that one runner (key 3c0ea3bc) crops up twice in the top 6 negative split ratios above. More about him later.

Let’s have a look at the empirical distribution of split ratios.

We can see that only a very small fraction of the field achieves a negative split. And that these runners generally only shave a few percent off their first half times. The dashed lines on the plot indicate the extreme values of the split ratio. Both of these are a long way from the body of the distribution. In statistical terms, either of these extremes is highly improbable.

If we categorise the runners broadly by the number of hours required to finish the race then we get a slightly different view of the data.

```
split.ratio = transform(split.ratio,
ihour = factor(floor(race.time / 60)))
levels(split.ratio$ihour) = sprintf("%s hour", levels(split.ratio$ihour))
#
(split.ratio.range = ddply(split.ratio, .(ihour), summarize, min = min(ratio), max = max(ratio)))
```

```
ihour min max
1 5 hour -0.061526 0.24595
2 6 hour -0.130848 0.57918
3 7 hour -0.172256 0.84996
4 8 hour -0.563642 1.43530
5 9 hour -0.352379 1.46969
6 10 hour -0.270929 1.67596
7 11 hour -0.115299 1.74168
```

Runners who finish the race in less than 6 hours (in the “5 hour” bin above, which includes the race winner) have split ratios between -0.061526 and 0.24595. The 8 hour bin has ratios which range from -0.563642 to 1.43530. So there was a runner in this group who was twice as fast in the second half… The 9 and 10 hour bins also have some inordinately large negative splits.

What about the distribution of splits in each of these categories?

Now that paints an interesting picture. We can clearly see that in the 5 hour bin quite a significant proportion of the elite runners manage to achieve negative splits. The proportion in all the other bins is appreciably smaller, yet the extreme negative splits are very much larger!

Note that the density curve for the 5 hour bin extends slightly beyond the dashed line indicating the smallest value in this group. This is an artifact of the kernel density method used to create these curves, for which there is a trade off between the smoothness of the curve and the fidelity of the curve to the data. With a smoother curve the data are effectively smeared out more.

We can quantify those proportions.

```
negsplit.ihour = with(split.ratio, table(ihour, ratio < 0))
negsplit.ihour = negsplit.ihour / rowSums(negsplit.ihour)
#
negsplit.ihour[,2] * 100
```

```
5 hour 6 hour 7 hour 8 hour 9 hour 10 hour 11 hour
14.2857 2.8740 2.2335 3.0653 3.1862 3.8505 1.9485
```

So, 14.3% of the runners in the 5 hour bin shave off some time in the second half of the race. In the other bins only around 2% to 3% of runners manage to achieve this feat.

Finally, before we dig into the details of some individual runners, let’s see how things vary from year to year.

These data are more or less consistent between years. The median of the ratio is around 10% to 20%; the maximum is always roughly 100% or more; the minimum fluctuates rather wildly, extending from the credible -9.7% all the way down to the incredible -56.4%

```
ddply(split.ratio, .(year), summarize, median = median(ratio), min = min(ratio), max = max(ratio))
```

```
year median min max
1 2000 0.163330 -0.310556 1.3282
2 2001 0.168550 -0.563642 1.0321
3 2002 0.211599 -0.175799 1.2257
4 2003 0.171931 -0.151615 1.3793
5 2004 0.201743 -0.172256 1.2693
6 2005 0.151614 -0.183591 1.6760
7 2006 0.179430 -0.131274 1.0500
8 2007 0.153102 -0.129477 1.3033
9 2008 0.208643 -0.096563 1.7379
10 2009 0.163242 -0.352379 1.7417
11 2010 0.093532 -0.206322 1.3878
12 2011 0.150365 -0.125141 1.4697
13 2012 0.118362 -0.240859 1.1876
14 2013 0.204870 -0.270929 1.1596
```

We are going to focus our attention on those runners with suspiciously large negative splits. These have been identified on the plot below as those with ratios less than -15% (that is, to the left of the dotted line). The threshold at -15% is somewhat arbitrary, but is certainly conservative.

We extract only those records with ratios less than -15% and discard fields (like race number) to enforce a degree of anonymity. We will also add in a field to indicate how many times a runner appears in the list.

```
suspect = subset(split.ratio, ratio < RMIN)[, c("year", "key", "race.time", "ratio")]
(suspect = ddply(suspect, .(key), mutate, entries = length(ratio)))
```

```
year key race.time ratio entries
1 2000 12bade96 545.20 -0.15863 1
2 2000 18e59575 557.25 -0.28027 1
3 2001 1a605ce5 488.82 -0.56364 1
4 2002 2edeb04e 556.53 -0.17580 1
5 2009 3c0ea3bc 591.63 -0.35238 2
6 2013 3c0ea3bc 636.50 -0.27093 2
7 2001 4abfd3 526.87 -0.19741 1
8 2013 4d5a86d7 659.88 -0.16419 1
9 2013 5cd624eb 640.05 -0.19636 1
10 2004 5ec9a72b 445.12 -0.17226 1
11 2012 7889f60a 592.57 -0.24086 1
12 2005 81f2015 538.72 -0.18359 1
13 2010 9f83c1a5 639.75 -0.15252 1
14 2003 a229ca86 544.75 -0.15161 1
15 2012 a59982c4 633.30 -0.18235 1
16 2010 a962c295 644.05 -0.20632 1
17 2010 ab59fc97 626.78 -0.15986 1
18 2001 c293e8f5 618.82 -0.17395 1
19 2013 e22d8c74 633.17 -0.23630 1
20 2000 ef35f2e6 569.48 -0.31056 1
21 2000 efdaf288 611.33 -0.22502 1
22 2005 fce308d5 638.98 -0.18083 1
```

That’s interesting, only one runner (the same guy with key 3c0ea3bc) appears twice.

We can take a look at the recent race history for these runners.

For a number of these runners there are only splits data for a few years, so it’s quite difficult to say anything conclusive. The negative split achieved by 1a605ce5 in 2001 looks pretty extreme though… Others runners, like 4d5a86d7, 9f83c1a5 and fce308d5 have a high degree of variability in both their first and second half times, so again it is difficult to spot an anomaly with certainty.

Let’s have a good look at 3c0ea3bc though. He has run the race consistently from 1991 to 2013. He did not finish in 1991 or 1997, but in the other years has managed to rack up 11 Bronze medals and 9 Vic Clapham medals, and in the process earned a double green number. The plot shows that his time to half way has been gradually increasing over the years. Not surprising since we all slow down with age. His finish time has mostly followed the same trend. Except for two major hiccups in 2009 and 2013. It’s hard to say for certain that these unusual negative splits were the result of cheating. But, equally, it’s hard to imagine how else they might have happened.

Here are the splits data for 3c0ea3bc:

So he was not recorded by either of the timing mats at Camperdown or Polly Shortts. It is well known that these mats are not perfect and sometimes they do miss runners. However, the missing splits at these mats plus the extraordinary time for the second half of the race are rather condemning.

I wonder what happened with those disciplinary hearings?

]]>The analysis started off with the same data set that I was working with before, from which I extracted only the records for the winners.

```
winners = subset(results, gender.position == 1, select = c(year, name, gender, race.time))
head(winners)
```

```
year name gender race.time
1 1980 Alan Robb Male 05:38:25
428 1980 Isavel Roche-Kelly Female 07:18:00
3981 1981 Bruce Fordyce Male 05:37:28
4055 1981 Isavel Roche-Kelly Female 06:44:35
7643 1982 Bruce Fordyce Male 05:34:22
7873 1982 Cheryl Winn Female 07:04:59
```

I then added in a field which gives a count of the number of times each person won the race.

```
library(plyr)
winners = ddply(winners, .(name), function(df) {
df = df[order(df$year),]
df$count = 1:nrow(df)
return(df)
})
subset(winners, name == "Bruce Fordyce")
```

```
year name gender race.time count
7 1981 Bruce Fordyce Male 05:37:28 1
8 1982 Bruce Fordyce Male 05:34:22 2
9 1983 Bruce Fordyce Male 05:30:12 3
10 1984 Bruce Fordyce Male 05:27:18 4
11 1985 Bruce Fordyce Male 05:37:01 5
12 1986 Bruce Fordyce Male 05:24:07 6
13 1987 Bruce Fordyce Male 05:37:01 7
14 1988 Bruce Fordyce Male 05:27:42 8
15 1990 Bruce Fordyce Male 05:40:25 9
```

The chart was generated as a scatter plot using ggplot2. The size of the points relates to the number of times each person won the race. The colour scale is as you might imagine: pink for the ladies and blue for the men.

```
library(ggplot2)
ggplot(winners, aes(x = year, y = name, color = gender)) +
geom_point(aes(size = count), shape = 19, alpha = 0.75) +
scale_size_continuous(range = c(5, 15)) +
ylab("") + xlab("") +
scale_x_discrete(expand = c(0, 1)) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, colour = "black"),
axis.text.y = element_text(colour = "black"),
legend.position = "none",
panel.background = element_blank(),
panel.grid.major = element_line(linetype = "dotted", colour = "grey"),
panel.grid.major.x = element_blank()
)
```

Two of the key aspects of getting this to look just right were:

- the call to
`scale_size_continuous()`

which ensured that a reasonable range of point sizes was used and - the call to
`scale_x_discrete()`

which expanded the plot very slightly so that the points near the borders were not cropped.

Just to recall what the data look like:

```
head(splits.2013)
```

```
gender age.category drummond.time race.time status medal
2013-10014 Male 50-59 5.510833 NA DNF <NA>
2013-10016 Male 60-69 6.070833 NA DNF <NA>
2013-10019 Male 20-29 5.335833 11.87361 Finished Vic Clapham
2013-10031 Male 20-29 4.910833 10.94833 Finished Bronze
2013-10047 Male 50-59 5.076944 10.72778 Finished Bronze
2013-10049 Male 50-59 5.729444 NA DNF <NA>
```

Here the drummond.time and finish.time fields are expressed in decimal hours and correspond to the time taken to reach the half-way mark and the finish respectively. The status field indicates whether a runner finished the race or did not finish (DNF).

I am going to consider two models. The first will look at the probability of finishing and the second will look at the distribution of medals. The features which will be used to predict these outcomes will be gender, age category and half-way time at Drummond. To build the first model, first load the party library and then call ctree.

```
library(party)
tree.status = ctree(
status ~ gender + age.category + drummond.time, data = splits.2013,
control = ctree_control(minsplit = 750)
)
tree.status
```

```
Conditional inference tree with 17 terminal nodes
Response: status
Inputs: gender, age.category, drummond.time
Number of observations: 13917
1) drummond.time <= 5.669167; criterion = 1, statistic = 2985.908
2) drummond.time <= 5.4825; criterion = 1, statistic = 494.826
3) age.category <= 40-49; criterion = 1, statistic = 191.12
4) drummond.time <= 5.078611; criterion = 1, statistic = 76.962
5) gender == {Male}; criterion = 1, statistic = 73.4
6)* weights = 5419
5) gender == {Female}
7)* weights = 836
4) drummond.time > 5.078611
8) gender == {Male}; criterion = 1, statistic = 63.347
9) drummond.time <= 5.379722; criterion = 1, statistic = 15.55
10)* weights = 1123
9) drummond.time > 5.379722
11)* weights = 447
8) gender == {Female}
12)* weights = 634
3) age.category > 40-49
13) drummond.time <= 5.038056; criterion = 1, statistic = 68.556
14) age.category <= 50-59; criterion = 1, statistic = 40.471
15) gender == {Female}; criterion = 1, statistic = 32.419
16)* weights = 118
15) gender == {Male}
17)* weights = 886
14) age.category > 50-59
18)* weights = 170
13) drummond.time > 5.038056
19)* weights = 701
2) drummond.time > 5.4825
20) gender == {Male}; criterion = 1, statistic = 56.149
21) age.category <= 40-49; criterion = 0.995, statistic = 9.826
22)* weights = 636
21) age.category > 40-49
23)* weights = 259
20) gender == {Female}
24)* weights = 352
1) drummond.time > 5.669167
25) drummond.time <= 5.811389; criterion = 1, statistic = 301.482
26) age.category <= 30-39; criterion = 1, statistic = 37.006
27)* weights = 315
26) age.category > 30-39
28)* weights = 553
25) drummond.time > 5.811389
29) drummond.time <= 5.940556; criterion = 1, statistic = 75.164
30) age.category <= 30-39; criterion = 1, statistic = 25.519
31)* weights = 299
30) age.category > 30-39
32)* weights = 475
29) drummond.time > 5.940556
33)* weights = 694
```

There is a deluge of information in the textual representation of the model. Making sense of this is a lot easier with a plot.

```
plot(tree.status)
```

The image below is a little small. You will want to click on it to bring up a larger version.

To interpret the tree, start at the top node (Node 1) labelled drummond.time, indicating that of the features considered, the most important variable in determining a successful outcome at the race is the time to the half-way mark. We are presented with two options: times that are either less than or greater than 5.669 hours. The cutoff time at Drummond is 6.167 hours (06:10:00), so runners reaching half-way after 5.669 hours are already getting quite close to the cutoff time. Suppose that we take the > 5.669 branch. The next node again depends on the half-way time, in this case dividing the population at 5.811 hours. If we take the left branch then we are considering runners who got to Drummond after 5.669 hours but before 5.811 hours. The next node depends on age category. The two branches here are for runners who are 39 and younger (left branch) and older runners (right branch). If we take the right branch then we reach the terminal node. There were 553 runners in this category and the spine plot indicates that around 35% of those runners successfully finished the race.

Rummaging around in this tree, there is a lot of interesting information to be found. For example, female runners who are aged less than 49 years and pass through Drummond in a time of between 5.079 and 5.482 hours are around 95% likely to finish the race. In fact, this is the most successful group of runners (there were 634 of them in the field). The next best group was male runners in the same age category who got to half-way in less than 5.079 hour: roughly 90% of the 5419 runners in this group finished the race.

Constructing a model for medal allocation is done in a similar fashion.

```
splits.2013.finishers = subset(splits.2013, status == "Finished" & !is.na(medal))
#
levels(splits.2013.finishers$medal) <- c("G", "WH", "S", "BR", "B", "VC")
```

Here I first extracted the subset of runners who finished the race (and for whom I have information on the medal allocated). Then, to make the plotting a little easier, the names of the levels in the medal factor are changed to a more compact representation.

```
tree.medal = ctree(
medal ~ gender + age.category + drummond.time, data = splits.2013.finishers,
control = ctree_control(minsplit = 750)
)
tree.medal
```

```
Conditional inference tree with 19 terminal nodes
Response: medal
Inputs: gender, age.category, drummond.time
Number of observations: 10221
1) drummond.time <= 4.124167; criterion = 1, statistic = 7452.85
2) drummond.time <= 3.438889; criterion = 1, statistic = 1031.778
3)* weights = 571
2) drummond.time > 3.438889
4) drummond.time <= 3.812222; criterion = 1, statistic = 342.628
5) drummond.time <= 3.708056; criterion = 1, statistic = 53.658
6)* weights = 549
5) drummond.time > 3.708056
7)* weights = 250
4) drummond.time > 3.812222
8) drummond.time <= 3.976111; criterion = 1, statistic = 37.853
9)* weights = 386
8) drummond.time > 3.976111
10)* weights = 431
1) drummond.time > 4.124167
11) drummond.time <= 5.043611; criterion = 1, statistic = 4144.845
12) drummond.time <= 4.55; criterion = 1, statistic = 596.673
13) drummond.time <= 4.288333; criterion = 1, statistic = 81.996
14)* weights = 603
13) drummond.time > 4.288333
15) gender == {Male}; criterion = 0.996, statistic = 10.468
16)* weights = 993
15) gender == {Female}
17)* weights = 148
12) drummond.time > 4.55
18) drummond.time <= 4.862778; criterion = 1, statistic = 77.052
19) gender == {Male}; criterion = 1, statistic = 34.077
20) drummond.time <= 4.653611; criterion = 0.994, statistic = 9.583
21)* weights = 353
20) drummond.time > 4.653611
22)* weights = 762
19) gender == {Female}
23)* weights = 237
18) drummond.time > 4.862778
24) gender == {Male}; criterion = 1, statistic = 45.95
25)* weights = 756
24) gender == {Female}
26)* weights = 193
11) drummond.time > 5.043611
27) drummond.time <= 5.265833; criterion = 1, statistic = 544.833
28) gender == {Male}; criterion = 1, statistic = 54.559
29) drummond.time <= 5.174444; criterion = 1, statistic = 26.917
30)* weights = 545
29) drummond.time > 5.174444
31)* weights = 402
28) gender == {Female}
32)* weights = 327
27) drummond.time > 5.265833
33) drummond.time <= 5.409722; criterion = 1, statistic = 88.926
34) gender == {Male}; criterion = 1, statistic = 40.693
35)* weights = 675
34) gender == {Female}
36)* weights = 277
33) drummond.time > 5.409722
37)* weights = 1763
```

Apologies for the bit of information overload. A plot brings out the salient information though.

```
plot(tree.medal)
```

Again you will want to click on the image below to make it legible.

Again the most important feature is the time at the half-way mark. If we look at the terminal node on the left (Node 3), which is the only one which contains athletes who received either Gold or Wally Hayward medals, then we see that they all passed through Drummond in a time of less than 3.439 hours. Almost all of the Silver medal athletes were also in this group, along with a good number of Bill Rowan runners. There are still a few Silver medal athletes in Node 6, which corresponds to runners who got to Drummond in less than 3.708 hours.

Shifting across to the other end of the plot and looking at runners who reached half-way in more than 5.266 hours. These are further divided into a group whose half-way time was more than 5.41 hours: these almost all got Vic Clapham medals. Interestingly, the outcome for athletes whose time at Drummond was greater than 5.266 hours but less than 5.41 hours depends on gender: the ladies achieved a higher proportion of Bronze medals than the men.

I could pore over these plots for hours. The take home message from this is that your outcome at the Comrades Marathon is most strongly determined by your pace in the first half of the race. Gender and age don’t seem to be particularly important, although they do exert an influence on your first half pace. Ladies who get to half-way at between 05:00 and 05:30 seem to have hit the sweet spot though with close to 100% success rate. Nice!

]]>I am going to explore the hypothesis that runners with green numbers are more likely to bail.

Let’s start by looking at the proportions of runners who finish the race as opposed to those who do not finish (DNF) and those who enter but do not start (DNS). As can be seen from the plot below, the proportion of runners who finish the race seems to increase with the number of medals that the runners in question have. So, for example, of the runners with one medal, 68.6% finished while only 21.7% were DNF. For runners with ten medals, 87.1% finished and only 9.5% were DNF.

On the face of it, this seems to make sense: there is a natural selection effect. Runners who have more medals are probably a little more hard core and thus less likely to bail. Less experienced runners might be more likely to jump on the bus when the going gets really tough.

But, unfortunately, it is not quite that simple.

The analysis above has a serious problem: consider those runners with one medal. We are comparing the number of finishers (those that have just received that medal) to non-finishers (who already have a medal!). So we are not really comparing apples with apples! What we really should be working with are the number of finishers who had *i-1* medals before the race and the number of non-finishers who had *i* medals.

Compiling these data takes a little work, but nothing too taxing. Let’s consider an anonymous (but real) runner whose Comrades Marathon history looks like this:

```
year status medal.count
1985 Finished 1
1986 Finished 2
1987 Finished 3
1988 Finished 4
1989 Finished 5
1990 Finished 6
1991 Finished 7
1992 DNF 7
1993 DNF 7
1998 DNF 7
1999 Finished 8
2000 Finished 9
2001 DNF 9
2002 DNF 9
2003 DNF 9
2009 DNS 9
2010 DNS 9
2011 DNS 9
2012 DNS 9
2013 DNS 9
```

What we want is a table that shows how many times he ran with a given number of medals. So, for our anonymous hero, this would be:

```
0 1 2 3 4 5 6 7 8 9
Finished 1 1 1 1 1 1 1 1 1 0
DNF 0 0 0 0 0 0 0 3 0 3
DNS 0 0 0 0 0 0 0 0 0 5
```

Things went well for the first seven years. On the first year he had no medal (column 0) but he finished (so there is a 1 in the first row). The same applies for columns 1 to 6. Then on year 7 he finished, gaining his seventh medal (hence the 1 in the first row of column 6: he already had 6 medals when he ran this time!). However, for the next three years (when he already had 7 medals) he got a DNF (hence the 3 in the second row of column 7). On his fourth attempt he got medal number 8 (giving the 1 in the first row of column 7: he already had 7 medals when he ran this time!). And the following year he got medal number 9. Then he suffered a string of 3 DNFs (the 3 in the second row of column 9), followed by a series of 5 DNSs (the 5 in the third row of column 9). To illustrate the proportions, when he had 7 medals he got DNS 0% (0/4) of the time, DNF 75% (3/4) of the time and finished 25% (1/4) of the time.

Those are the data for a single athlete. To make a compelling case it is necessary to compile the same statistics for many, many runners. So I generated the analogous table for all athletes who ran the race between 1984 and 2013. A melted and abridged version of the resulting data look like this:

```
status medal.count number proportion
1 Finished 0 78051 0.83386039
2 DNF 0 11102 0.11860858
3 DNS 0 4449 0.04753104
4 Finished 1 52186 0.83512298
5 DNF 1 7336 0.11739666
6 DNS 1 2967 0.04748036
7 Finished 2 37478 0.83605863
8 DNF 2 5332 0.11894617
9 DNS 2 2017 0.04499520
10 Finished 3 28506 0.83472914
11 DNF 3 4072 0.11923865
12 DNS 3 1572 0.04603221
13 Finished 4 22814 0.83326637
14 DNF 4 3256 0.11892326
15 DNS 4 1309 0.04781037
16 Finished 5 18576 0.83630470
17 DNF 5 2585 0.11637853
18 DNS 5 1051 0.04731677
19 Finished 6 15538 0.83794424
20 DNF 6 2156 0.11627029
21 DNS 6 849 0.04578547
22 Finished 7 13300 0.84503463
23 DNF 7 1706 0.10839316
24 DNS 7 733 0.04657221
25 Finished 8 11809 0.86165633
26 DNF 8 1339 0.09770157
27 DNS 8 557 0.04064210
28 Finished 9 10852 0.81215387
29 DNF 9 1463 0.10948960
30 DNS 9 1047 0.07835653
31 Finished 10 7381 0.82047577
32 DNF 10 974 0.10827034
33 DNS 10 641 0.07125389
61 Finished 20 784 0.80575540
62 DNF 20 98 0.10071942
63 DNS 20 91 0.09352518
91 Finished 30 59 0.83098592
92 DNF 30 9 0.12676056
93 DNS 30 3 0.04225352
```

The important information here is the proportion of DNF entries for each medal count. We can see that 11.8% (0.11860858) of runners DNF on the first time that they ran. Similarly, of those runners who had already completed the race once (so they had one medal in the bag), 11.7% (0.11739666) did not finish. Of those who ran again after just achieving a green number, 10.8% (0.10827034) were DNF. It will be easier to make sense of all this in a plot.

Wow! Now that is interesting. Just to be sure that everything is clear about this plot: every column reflects the proportions of finishers, DNFs and DNSs who **already had** a given number of medals. There are a number of intriguing things about these data:

- all three proportions remain almost identical for runners who already had between 0 and 6 medals;
- the proportion of finishers then starts to ramp up for those with 7 and 8 medals (the DNS proportion remains unchanged, the DNFs decrease);
- there is a decrease in the proportion of finishers who already have 9 medals and a corresponding increase in the proportion of DNSs, while the DNFs remain unchanged;
- the proportion of finishers then increases slightly for those who already have 10 medals.

What conclusions can we draw from this? The second point seems to indicate a growing level of determination: these athletes are really close to their green number and they are less likely to sacrifice their medal. The third point is interesting too: the proportion of DNFs stays roughly the same but the DNS percentage grows from 4.1% for those with 8 medals to 7.8% for those with 9 medals. Why would this be? Well, I am really not sure and I would welcome suggestions. One possibility is that these runners are determined to have a good race so they might overtrain and end up injured or ill.

Are the differences in the proportion of DNFs statistically significant?

```
31-sample test for equality of proportions without continuity correction
data: medal.table[2, 1:31] out of colSums(medal.table[, 1:31])
X-squared = 139.4798, df = 30, p-value = 4.744e-16
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8 prop 9 prop 10
0.11860858 0.11739666 0.11894617 0.11923865 0.11892326 0.11637853 0.11627029 0.10839316 0.09770157 0.10948960
prop 11 prop 12 prop 13 prop 14 prop 15 prop 16 prop 17 prop 18 prop 19 prop 20
0.10827034 0.10204696 0.10013936 0.10500000 0.11237335 0.10784314 0.11079137 0.10659026 0.09327846 0.11298606
prop 21 prop 22 prop 23 prop 24 prop 25 prop 26 prop 27 prop 28 prop 29 prop 30
0.10071942 0.10404624 0.09890110 0.09684685 0.14473684 0.10833333 0.14358974 0.07284768 0.14285714 0.16379310
prop 31
0.12676056
```

The miniscule p-value from the proportion test indicates that there definitely is a significant difference in the proportion of DNFs across the entire data set (for those with between 0 and 30 medals). But it does not tell us anything about which of the proportions are responsible for this difference. We can get some information about this from a pairwise proportion test. Here is the abridged output.

```
Pairwise comparisons using Pairwise comparison of proportions
data: medal.table[2, 1:31] out of colSums(medal.table[, 1:31])
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1.000 - - - - - - - - - - - - - - -
2 1.000 1.000 - - - - - - - - - - - - - -
3 1.000 1.000 1.000 - - - - - - - - - - - - -
4 1.000 1.000 1.000 1.000 - - - - - - - - - - - -
5 1.000 1.000 1.000 1.000 1.000 - - - - - - - - - - -
6 1.000 1.000 1.000 1.000 1.000 1.000 - - - - - - - - - -
7 0.107 0.734 0.179 0.205 0.457 1.000 1.000 - - - - - - - - -
8 4.8e-10 2.5e-08 3.8e-09 9.0e-09 6.4e-08 1.8e-05 5.8e-05 1.000 - - - - - - - -
9 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.689 - - - - - - -
10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - - - - - -
11 0.025 0.099 0.031 0.032 0.056 0.579 0.780 1.000 1.000 1.000 1.000 - - - - -
12 0.038 0.117 0.042 0.042 0.066 0.506 0.651 1.000 1.000 1.000 1.000 1.000 - - - -
13 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - - -
14 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - -
15 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 -
16 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
```

For between 0 and 6 medals there is no significant difference (p-value is roughly 1). The DNF proportion for those with 7 medals does start to differ from those with 4 medals or fewer, but the p-values are not significant. When we get to athletes who have 8 medals there is a significant difference in the proportion of DNFs all the way from those with 0 medals to those with 6 medals. However, the proportion of DNFs for those with 9 medals is not significantly different from any of the other categories. Finally, the DNF proportion for those athletes who already have 10 medals does not differ significantly from the athletes with any number of fewer medals.

So, no, it does not seem that runners with green numbers are more likely to bail (a conclusion that makes me personally very happy!). And good luck to the anonymous runner: I hope that you will be back in 2014 and that you will crack your green number!

Oh, and one last thing: as I mentioned before, the analysis above is based on the period 1984 to 2013. There are some serious issues with the data in the earlier years. Here is a breakdown of the number of runners in each of the categories across the years:

```
Finished DNF DNS
1984 7105 2 0
1985 8192 1907 1
1986 9654 1793 0
1987 8376 2458 0
1988 10363 1934 0
1989 10505 3065 2
1990 10272 1351 2
1991 12082 2936 1
1992 10695 2533 5
1993 11322 2270 2
1994 10274 2428 3
1995 10541 2990 1
1996 11269 2277 2
1997 11365 2467 3
1998 10496 2874 5
1999 11291 2835 3
2000 20030 4508 7
2001 11090 4270 1
2002 9027 2276 863
2003 11416 1065 892
2004 10123 1925 9
2005 11729 2163 7
2006 9846 1194 1025
2007 10052 1084 868
2008 8631 1745 813
2009 10008 1501 1441
2010 14339 2226 7000
2011 11058 2023 6506
2012 11889 1739 5916
2013 10278 3643 5986
```

Certainly something is deeply wrong in 1984! In the early years it does not make any sense to discriminate between DNF and DNS since there were no independent records kept: we simply know whether or not an athlete finished. The introduction of the ChampionChip timing devices improved the quality of the data dramatically. These chips have been used by all Comrades Marathon runners since 1997] although there is a delayed effect on the quality of the data.

Despite these issues, the conclusions of the analysis above remain essentially unchanged if you simply lump the DNF and DNS data together (because we cannot always make a meaningful divide between them!).

]]>There is clear evidence of a Green Number Effect: many people hang on for ten medals and then pack it in. There is also weaker evidence of a Double Green Number Effect. But evidently there are far fewer people with that kind of commitment or level of craziness.

What about the influence of the Back-2-Back medals introduced in 2005? If you look carefully at the plot above then you can see some evidence. However, a simple histogram of medal counts makes the effect irrefutable.

Thanks for the idea, Tilda.

]]>Since I have been delving into the Comrades Marathon data, this got me thinking about the typical age distribution of athletes taking part. The plot below indicates the ages of athletes who finished the race, going all the way back to 1984. You can clearly spot the two years when Wally Hayward ran (1988 and 1989). My data indicates that he was only 79 on the day of the 1989 Comrades Marathon, but I am not going to quibble over a year and I am more than happy to accept that he was 80!

It is interesting to see that there is a consistent increase in the ages of both male and female finishers, as reflected by both the median and interquartile range (IQR).

The detailed distribution of ages across the period 1984 to 2013 is shown below. The median age of finishers is 37 years. Although in recent years the minimum age has been set at 20, in earlier times younger athletes were allowed to run the race. There are a significant number of runners in their 60s, but far fewer in their 70s. Only 163 runners older than 70 have finished the race since 1984.

What about the effect of age on individual finish times? Men appear to perform best between 20 and 30 years of age, with a gradual but consistent decrease in performance with advancing years. Things are not quite as clear cut with the female runners, where those in the 30 to 40 age bracket appear to perform fractionally better than those between the ages of 20 and 30.

Naturally these races times translates into medal allocations. The mosaic plot below shows both the distribution of runners across the various age categories as well as the medal allocations within those categories. The majority of runners are between 30 and 40 years of age and the most commonly awarded medal is the Bronze.

Finally, a breakdown of the gross number of medals awarded between 1984 and 2013. This includes data for the last 30 years and so is an extension of my previous analysis. Here it must be borne in mind that the Bill Rowan medal was only introduced in 2000, the Vic Clapham medal in 2003 and the Wally Hayward medal in 2007.

]]>- Gold medals to the first ten finishers in the men’s race and the ladies' race;
- Wally Hayward medals to finishers in under 06:00;
- Silver medals to finishers under 07:30;
- Bill Rowan medals to finishers under 09:00;
- Bronze medals to finishers under 11:00; and finally
- Vic Clapham medals to finishers before the final gun at 12:00.

This will be followed in a couple of days by an analysis of the relationship between running a negative split and finishing time.

]]>