Scraping the NYSE Composite Index

For a side project I needed to scrape data for the NYSE Composite Index going back as far as possible.

The most convenient data source is on Yahoo! Finance.

What is the NYSE Composite Index?

The NYSE Composite Index is an index that includes all common stock listed on the New York Stock Exchange. The contribution of each stock to the index (its weight) is determined by its free-float market capitalisation (value of shares available for public trading).

Scraping

I considered doing the scrape with either R or Python. After looking at the structure of the site and tabulated data I decided to go with R.

Libraries

Start by loading a few indispensable libraries.

library(httr2)
library(rvest)
library(dplyr)
library(readr)
library(janitor)

What purpose do these serve?

httr2 — HTTP requests (retrieve the web page HTML content);
rvest — parsing HTML (extracting content from the HTML);
dplyr — data manipulation;
readr — for the parse_number() function and easy CSV writing; and
janitor — cleaning up column names.

Query Parameters

There are two query parameters, period1 and period2 that determine the duration of the data returned. These should both be set to integral values, specifically the number of seconds since the UNIX epoch. We could hard-code these but I prefer to generate them dynamically.

Here’s how we do that for the start date (31 December 1965):

as.integer(as.POSIXct("1965-12-31"))

[1] -126316800

_ The value is negative because that date is before the UNIX epoch (1 January 1970).

Create a list with the appropriate parameter values.

params <- list(
  period1 = as.integer(as.POSIXct("1965-12-31")),
  period2 = as.integer(Sys.time())
)

Headers & Cookies

There’s just one mandatory header required, which is the User Agent string.

headers <- list(
  "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64)"
)

There are also a number of cookies, only one of which appears to be important. The content of that cookie appears to evolve over time, so it will require refreshing. Get the cookies and put them into a cookies list.

Request

Now we have all of the prerequisites in place. I’ll store the URL in a variable rather than hard coding into the request.

URL <- "https://finance.yahoo.com/quote/^NYA/history/"

Now build the request using {httr2}. The modular approach to this is really nice. Unpack the lists as function parameters using the !!! (bang-bang-bang) syntax, which, for the benefit of Pythonistas, is equivalent to using ** on a dictionary.

response <- request(URL) |>
  req_url_query(!!!params) |>
  req_cookies_set(!!!cookies) |>
  req_headers(!!!headers) |>
  req_perform()

Parse & Wrangle

Next we take the response object and extract the HTML body. Then a quick bit of web scraping (using html_element() and html_table() to locate the <table> and transform its content into a data frame).

The resulting column names are not to my liking. A combination of clean_names() followed by some specific renaming does the trick though.

Finally a mutation to clean up the data:

convert the dates into standard format (there is no world where month, day, year makes sense!);
handle missing volume values;
use the parse_number() function to parse numbers (handles the commas); and
scales volume by 1000 (because all values end in “000”).

table <- response |> 
  resp_body_html() |>
  html_element("table") |>
  html_table() |>
  clean_names() |>
  rename_with(~ gsub("_adjusted.*", "", .x)) |>
  rename(
    close = close_close_price
  ) |>
  mutate(
    date = as.Date(strptime(date, "%b %d, %Y")),
    volume = ifelse(volume == "-", NA, volume),
    across(.cols = -date, .fns = ~ parse_number(.)),
    across(.cols = open:close_adj, .fns = ~ sprintf("%.2f", .)),
    volume = volume / 1000
  ) %>%
  arrange(date)

Let’s take a look at the fruits of this labour. The oldest data doesn’t have values for volume.

head(table)

  date       open   high   low    close  adj_close volume
  <date>     <chr>  <chr>  <chr>  <chr>  <chr>      <dbl>
1 1965-12-31 528.69 528.69 528.69 528.69 528.69        NA
2 1966-01-03 527.21 527.21 527.21 527.21 527.21        NA
3 1966-01-04 527.84 527.84 527.84 527.84 527.84        NA
4 1966-01-05 531.12 531.12 531.12 531.12 531.12        NA
5 1966-01-06 532.07 532.07 532.07 532.07 532.07        NA
6 1966-01-07 532.60 532.60 532.60 532.60 532.60        NA

The new data is complete and up to date.

tail(table, 10)

   date       open     high     low      close    adj_close  volume
   <date>     <chr>    <chr>    <chr>    <chr>    <chr>       <dbl>
 1 2024-10-14 19711.22 19834.18 19690.54 19814.55 19814.55  3005250
 2 2024-10-15 19814.55 19814.55 19669.80 19678.71 19678.71  3882120
 3 2024-10-16 19678.71 19838.29 19678.71 19825.49 19825.49  3467230
 4 2024-10-17 19825.49 19901.49 19825.49 19855.26 19855.26  3480010
 5 2024-10-18 19855.26 19905.65 19808.99 19884.81 19884.81  3374490
 6 2024-10-21 19884.81 19884.81 19690.32 19720.32 19720.32  3407010
 7 2024-10-22 19720.32 19720.32 19579.84 19664.58 19664.58  3342080
 8 2024-10-23 19653.36 19653.36 19493.42 19594.24 19594.24  3532650
 9 2024-10-24 19594.24 19620.02 19496.22 19560.73 19560.73  3543030
10 2024-10-25 19560.73 19648.74 19438.80 19456.27 19456.27  3501280

Persist

Save it to a CSV file.

write_csv(table, "nyse-composite.csv")

You can grab the CSV file here (not updated).

Visualise

Finally let’s create a quick visualisation of the data, just to be sure that it looks legit.

Here’s the corresponding plot for the last decade from Google Finance. The vertical dashed line in the above plot indicates the initial date (3 January 2003) in the plot below.