Historical Weather Data

I’m building a model which requires historical weather data from a selection of locations in South Africa. In this post I demonstrate the process of acquiring the data and doing some simple processing.

I need data for three locations: Brookes and Goje (in KwaZulu-Natal) and Hlangalane (in the Eastern Cape).

# A tibble: 3 × 4
  name       region          lat   lon
  <chr>      <chr>         <dbl> <dbl>
1 Brookes    KwaZulu-Natal -29.6  29.8
2 Goje       KwaZulu-Natal -28.3  31.2
3 Hlangalane Eastern Cape  -31.0  28.6

Here are those locations on a map. They are sufficiently far apart that we would expect them to have different weather histories.

Map showing the location of Brookes, Goje and Hlangalane.

Data Acquisition

I’m getting the data using Weather API. The business plan gives me access to data going back to the beginning of 2010. I like to mix things up, so I’ll hit the API from Python and then use R to do the processing.

The API key is stored in an environment variable.

import os

API_KEY = os.getenv("WEATHER_API_KEY")

Define the date range.

import pandas as pd

DATE_MIN = "2020-08-01"
DATE_MAX = "2022-08-01"

DATES = pd.date_range(start=DATE_MIN, end=DATE_MAX)

Create a function for retrieving the data and writing it to a file. There will be one JSON file per location and date.

import re
import requests

def weather_history(name, region):
  location = name+", "+region
  slug = re.sub("[, ]+", "-", location.lower())
  
  for date in DATES:
    date = date.date()
    
    URL = f"http://api.weatherapi.com/v1/history.json?key={API_KEY}&q={location}&dt={date}"
    
    response = requests.get(URL)
    
    with open(f"{date}-{slug}.json", "wt") as fid:
      fid.write(response.text)
    
    time.sleep(5)

Now retrieve the data.

weather_history("Goje", "KwaZulu-Natal")

Repeat for the other locations.

Data Processing

We’ll need a function for loading the JSON data into R. The data are nested, so we’ll include some code to unwrap and rectangle the data.

library(jsonlite)

prepare_weather <- function(path) {
  weather <- read_json(path)
  
  weather$location %>%
    as_tibble() %>%
    # Drop time fields that relate to data acquisition (download) time.
    select(-starts_with("localtime")) %>%
    mutate(
      hours = weather$forecast$forecastday %>%
        map_dfr(function(day) {
          map_dfr(day$hour, function(hour) {
            hour$condition <- NULL
            hour
          })
        }) %>%
        select(-ends_with("epoch")) %>%
        select(-matches("_(mph|f|in|miles)$")) %>%
        select(-matches("^(will_it|chance_of)_")) %>%
        list()
    )
}

Let’s read the data for Goje on 1 August 2021.

(goje <- prepare_weather("2021-08-01-goje-kwazulu-natal.json"))
# A tibble: 1 × 7
  name  region        country        lat   lon tz_id               hours   
  <chr> <chr>         <chr>        <dbl> <dbl> <chr>               <list>  
1 Goje  KwaZulu-Natal South Africa -28.3  31.2 Africa/Johannesburg <tibble>

The hours list column contains the hourly weather data. The data contains the following fields:

  • temperature
  • temperature feels like
  • wind chill
  • heat index
  • dew point
  • wind speed and direction
  • wind gust speed
  • pressure
  • precipitation
  • humidity
  • cloud cover and
  • visibility.

Let’s take a quick look. We’ll only pull out a few columns that are relevant to the model.

goje %>%
  unnest(cols = hours) %>%
  # Use appropriate time zone when converting to date/time type.
  mutate(time = as.POSIXct(time, "%Y-%m-%d %H:%M", tz = unique(tz_id))) %>%
  select(time, temp_c, wind_kph, wind_dir, pressure_mb, precip_mm, humidity, cloud)
# A tibble: 24 × 8
   time                temp_c wind_kph wind_dir pressure_mb precip_mm humidity
   <dttm>               <dbl>    <dbl> <chr>          <dbl>     <dbl>    <int>
 1 2021-08-01 00:00:00   16.6     17.3 NNE             1026         0       78
 2 2021-08-01 01:00:00   16.2     16.7 NNE             1025         0       76
 3 2021-08-01 02:00:00   15.9     16.1 NNE             1025         0       74
 4 2021-08-01 03:00:00   15.5     15.5 N               1024         0       71
 5 2021-08-01 04:00:00   15.6     15   N               1024         0       68
 6 2021-08-01 05:00:00   15.6     14.5 N               1024         0       64
 7 2021-08-01 06:00:00   15.7     14   N               1023         0       61
 8 2021-08-01 07:00:00   16.9     13.3 N               1023         0       55
 9 2021-08-01 08:00:00   18.2     12.6 N               1023         0       50
10 2021-08-01 09:00:00   19.4     11.9 N               1023         0       45
11 2021-08-01 10:00:00   21.4     11.5 NNE             1023         0       42
12 2021-08-01 11:00:00   23.5     11.2 NNE             1022         0       38
13 2021-08-01 12:00:00   25.5     10.8 NE              1021         0       35
14 2021-08-01 13:00:00   25.6     11.8 NE              1020         0       38
15 2021-08-01 14:00:00   25.6     12.7 NE              1019         0       40
16 2021-08-01 15:00:00   25.7     13.7 ENE             1018         0       43
17 2021-08-01 16:00:00   24.5     13.8 ENE             1018         0       48
18 2021-08-01 17:00:00   23.3     13.9 NE              1018         0       53
19 2021-08-01 18:00:00   22.1     14   NE              1018         0       58
20 2021-08-01 19:00:00   21.1     12   ENE             1019         0       60
21 2021-08-01 20:00:00   20.1     10   E               1019         0       61
22 2021-08-01 21:00:00   19.1      7.9 ESE             1020         0       63
23 2021-08-01 22:00:00   19.2      8.8 SSE             1021         0       64
24 2021-08-01 23:00:00   19.2      9.6 SSW             1021         0       65
# ℹ 1 more variable: cloud <int>

We’ll wrap up with a few plots of daily aggregated data. First the total daily precipitation.

Daily precipitation for two years at Brookes, Goje and Hlangalane.

Looks like a wet year followed by a dry year. Finally the daily temperature (average is solid line and ribbon gives range).

Daily temperature (minimum, maximum and average) for two years at Brookes, Goje and Hlangalane.

These data are going to be particularly useful for our models.