![Historical Weather Data](https://datawookie.dev/blog/2022/08/historical-weather-data/historical-weather-data_hu_257618d59962975a.webp)
I’m building a model which requires historical weather data from a selection of locations in South Africa. In this post I demonstrate the process of acquiring the data and doing some simple processing.
I need data for three locations: Brookes and Goje (in KwaZulu-Natal) and Hlangalane (in the Eastern Cape).
# A tibble: 3 × 4
name region lat lon
<chr> <chr> <dbl> <dbl>
1 Brookes KwaZulu-Natal -29.6 29.8
2 Goje KwaZulu-Natal -28.3 31.2
3 Hlangalane Eastern Cape -31.0 28.6
Here are those locations on a map. They are sufficiently far apart that we would expect them to have different weather histories.
![Map showing the location of Brookes, Goje and Hlangalane.](https://datawookie.dev/blog/2022/08/historical-weather-data/index_files/figure-html/location-map-1.png)
Data Acquisition
I’m getting the data using Weather API. The business plan gives me access to data going back to the beginning of 2010. I like to mix things up, so I’ll hit the API from Python and then use R to do the processing.
The API key is stored in an environment variable.
import os
API_KEY = os.getenv("WEATHER_API_KEY")
Define the date range.
import pandas as pd
DATE_MIN = "2020-08-01"
DATE_MAX = "2022-08-01"
DATES = pd.date_range(start=DATE_MIN, end=DATE_MAX)
Create a function for retrieving the data and writing it to a file. There will be one JSON file per location and date.
import re
import requests
def weather_history(name, region):
location = name+", "+region
slug = re.sub("[, ]+", "-", location.lower())
for date in DATES:
date = date.date()
URL = f"http://api.weatherapi.com/v1/history.json?key={API_KEY}&q={location}&dt={date}"
response = requests.get(URL)
with open(f"{date}-{slug}.json", "wt") as fid:
fid.write(response.text)
time.sleep(5)
Now retrieve the data.
weather_history("Goje", "KwaZulu-Natal")
Repeat for the other locations.
Data Processing
We’ll need a function for loading the JSON data into R. The data are nested, so we’ll include some code to unwrap and rectangle the data.
library(jsonlite)
prepare_weather <- function(path) {
weather <- read_json(path)
weather$location %>%
as_tibble() %>%
# Drop time fields that relate to data acquisition (download) time.
select(-starts_with("localtime")) %>%
mutate(
hours = weather$forecast$forecastday %>%
map_dfr(function(day) {
map_dfr(day$hour, function(hour) {
hour$condition <- NULL
hour
})
}) %>%
select(-ends_with("epoch")) %>%
select(-matches("_(mph|f|in|miles)$")) %>%
select(-matches("^(will_it|chance_of)_")) %>%
list()
)
}
Let’s read the data for Goje on 1 August 2021.
(goje <- prepare_weather("2021-08-01-goje-kwazulu-natal.json"))
# A tibble: 1 × 7
name region country lat lon tz_id hours
<chr> <chr> <chr> <dbl> <dbl> <chr> <list>
1 Goje KwaZulu-Natal South Africa -28.3 31.2 Africa/Johannesburg <tibble [24 × 16]>
The hours
list column contains the hourly weather data. The data contains the following fields:
- temperature
- temperature feels like
- wind chill
- heat index
- dew point
- wind speed and direction
- wind gust speed
- pressure
- precipitation
- humidity
- cloud cover and
- visibility.
Let’s take a quick look. We’ll only pull out a few columns that are relevant to the model.
goje %>%
unnest(cols = hours) %>%
# Use appropriate time zone when converting to date/time type.
mutate(time = as.POSIXct(time, "%Y-%m-%d %H:%M", tz = unique(tz_id))) %>%
select(time, temp_c, wind_kph, wind_dir, pressure_mb, precip_mm, humidity, cloud)
# A tibble: 24 × 8
time temp_c wind_kph wind_dir pressure_mb precip_mm humidity cloud
<dttm> <dbl> <dbl> <chr> <dbl> <dbl> <int> <int>
1 2021-08-01 00:00:00 16.6 17.3 NNE 1026 0 78 0
2 2021-08-01 01:00:00 16.2 16.7 NNE 1025 0 76 0
3 2021-08-01 02:00:00 15.9 16.1 NNE 1025 0 74 0
4 2021-08-01 03:00:00 15.5 15.5 N 1024 0 71 0
5 2021-08-01 04:00:00 15.6 15 N 1024 0 68 1
6 2021-08-01 05:00:00 15.6 14.5 N 1024 0 64 2
7 2021-08-01 06:00:00 15.7 14 N 1023 0 61 2
8 2021-08-01 07:00:00 16.9 13.3 N 1023 0 55 5
9 2021-08-01 08:00:00 18.2 12.6 N 1023 0 50 7
10 2021-08-01 09:00:00 19.4 11.9 N 1023 0 45 9
11 2021-08-01 10:00:00 21.4 11.5 NNE 1023 0 42 9
12 2021-08-01 11:00:00 23.5 11.2 NNE 1022 0 38 9
13 2021-08-01 12:00:00 25.5 10.8 NE 1021 0 35 8
14 2021-08-01 13:00:00 25.6 11.8 NE 1020 0 38 6
15 2021-08-01 14:00:00 25.6 12.7 NE 1019 0 40 3
16 2021-08-01 15:00:00 25.7 13.7 ENE 1018 0 43 0
17 2021-08-01 16:00:00 24.5 13.8 ENE 1018 0 48 0
18 2021-08-01 17:00:00 23.3 13.9 NE 1018 0 53 0
19 2021-08-01 18:00:00 22.1 14 NE 1018 0 58 0
20 2021-08-01 19:00:00 21.1 12 ENE 1019 0 60 0
21 2021-08-01 20:00:00 20.1 10 E 1019 0 61 0
22 2021-08-01 21:00:00 19.1 7.9 ESE 1020 0 63 0
23 2021-08-01 22:00:00 19.2 8.8 SSE 1021 0 64 0
24 2021-08-01 23:00:00 19.2 9.6 SSW 1021 0 65 0
We’ll wrap up with a few plots of daily aggregated data. First the total daily precipitation.
![Daily precipitation for two years at Brookes, Goje and Hlangalane.](https://datawookie.dev/blog/2022/08/historical-weather-data/index_files/figure-html/daily-precipitation-1.png)
Looks like a wet year followed by a dry year. Finally the daily temperature (average is solid line and ribbon gives range).
![Daily temperature (minimum, maximum and average) for two years at Brookes, Goje and Hlangalane.](https://datawookie.dev/blog/2022/08/historical-weather-data/index_files/figure-html/daily-temperature-1.png)
These data are going to be particularly useful for our models.