Retail Data: Scraping & API

web scraping
Python
I’ve been wanting to gather data on retail prices for quite some time. Finally, just before Christmas 2019, I had some time on my hands, so I started to put.
Published

15 Mar 2020 12:00

I’ve been wanting to gather data on retail prices for quite some time. Finally, just before Christmas 2019, I had some time on my hands, so I started to put something together.

The Plan

This was the plan:

  • a fleet of scrapers, each focusing on a specific retailer;
  • the scrapers submit data to a service which would validate the data and persist in a database; and
  • expose the data via a simple REST API.

Implementation

This was the tech stack I envisaged:

I used Scrapy to manage the scraping, via either a Spider, CrawlSpider or SitemapSpider, depending on the site.

The data are pushed into a Celery queue (with a Redis backend), from whence they are processed and entered into the PostgreSQL database.

Environments

I ended up with three environments: development, staging and production.

Development

In the development environment I’ve got both Redis and PostgreSQL running as Docker containers.

Staging & Production

The staging and production environments are on EC2 instances (t3a.medium and t3.large respectively). Both have Redis on Docker. A PostgreSQL database hosted on RDS is accessed from both of these environments. Each environment has a separate dedicated schema, with a shared schema for common (static) data.

The API

Once I had the scrapers up and running, reliably sending data through to the database, it was time to allow access via an API. For this I used Flask and Flask-RESTPlus. This combination had the added advantage of immediately generating a Swagger interface for the API.

This is what the Swagger interface looks like at present.

Swagger interface for the Retail Pricing Data API.

Swagger interface for the Retail Pricing Data API.

There are endpoints which expose:

  • a list of retailers
  • a list of products for each retailer and
  • historical price information for each product.

An API key is required to access some of these endpoints. Get in touch if you’d like a key.

Here’s data for a specific product:

{
  "id": 349324,
  "retailer_id": 34,
  "url": "https://www.capeunionmart.co.za/tyrell-short-mens",
  "name": "Tyrell Men's Shorts",
  "brand": "Old Khaki",
  "sku": "153048",
  "barcodes": []
}

And the associated price history, showing both normal and promotion prices as well as availability.

[
  {
    "product_id": 349324,
    "time": "2020-03-08T01:22:32+00:00",
    "price": 599,
    "price_promotion": 399,
    "available": true
  },
  {
    "product_id": 349324,
    "time": "2020-02-23T02:16:44+00:00",
    "price": 599,
    "price_promotion": 399,
    "available": true
  },
  {
    "product_id": 349324,
    "time": "2020-02-16T02:25:35+00:00",
    "price": 599,
    "price_promotion": 399,
    "available": true
  },
  {
    "product_id": 349324,
    "time": "2020-02-09T01:47:47+00:00",
    "price": 599,
    "price_promotion": null,
    "available": true
  },
  {
    "product_id": 349324,
    "time": "2020-02-02T02:02:42+00:00",
    "price": 599,
    "price_promotion": null,
    "available": true
  },
  {
    "product_id": 349324,
    "time": "2020-01-26T02:23:15+00:00",
    "price": 599,
    "price_promotion": null,
    "available": true
  }
]

The API is useful. But I do most of my analytical work in R, so it made sense to wrap the API in an R Package. More details on that here.