Iterating over a Paginated List of Links

A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:

  • iterating over a list of profiles on LinkedIn and then retrieving employment history from each profile;
  • traversing a list of products on Amazon and then retrieving reviews for each product; or
  • navigating the list of items on BBC News and retrieving the full articles.

Tools like Scrapy are specifically designed for this application. Often, however, a simpler, hand-crafted solution is better than rolling out the heavy machinery.

The Problem

I want the solution to incorporate the following:

  • a pause between requests so that I’m not hammering the site with a flood of requests;
  • a possible limit on the total number of requests (for example, I only want to retrieve the first 20 items); and
  • no repeated or redundant code.

A Solution

First a couple of imports to support pauses and type annotations.

import time
from typing import Generator

A couple of global constants. In practice these could be set via command line options or a configuration file.

# Possibly limit the total number of items scraped.
#
LIMIT = None

# How long (in seconds) to wait between requests.
#
PAUSE = 5

Now a pair of placeholder functions, which would do the work of scraping the target site. Their implementation would depend on the structure of the site. You might replace get_item_links() with get_item_ids(), which returns a list of IDs rather than a list of links.

def get_item_links(page: int) -> list[str]:
    """
    Retrieve a list of links.
    
    Args:
      page (int): The page number used to construct the URL for fetching links.
      
    Returns:
      list[str]: A list of item links.

    Example:
        For page=2 might return https://www.amazon.co.uk/s?k=running+shoes&page=2.
    """
    pass
    
def get_link(url: str) -> bytes:
    """
    Retrieve content from a link.
    
    Args:
      url (str): The URL for a specific item.
      
    Returns:
      bytes: The HTML content of the item page.
      
    Example:
    
      https://www.amazon.co.uk/dp/B0CPN5HJBT/ (Brooks Men's Divide 5 Sneaker)
    """
    pass

Finally the implementation of the downloader, which is where all the looping happens.

def download() -> Generator[bytes, None, None]:
    """
    Iterate over paginated list of links, yielding content from each link.
    
    Yields:
        bytes: HTML as bytes.
    """
    page = 0
    count = 0

    while True:
        links = get_item_links(page)

        if len(links) == 0:
            break

        for link in links:
            time.sleep(PAUSE)
            count += 1
            yield get_link(link)

            if LIMIT and count >= LIMIT:
                break
        else:
            # Move to next page.
            page += 1
            # Next iteration of outer loop.
            time.sleep(PAUSE)
            continue

        # Only get here if exited inner loop prematurely.
        break

The outer loop is an infinite while loop. Breaking out of this loop is the key!

We keep track of two counters:

  • page — the page counter (to construct the URL for a page of links); and
  • count — the scraped item counter (to keep track of the total number of items scraped).

Within the outer loop we first retrieve a list of links. We check if that list is empty (which would indicate that we have gone past the last page of links) and possibly break out of the outer loop.

Next we iterate over the list of links. There’s a pause at the start of each iteration that ensures (i) a delay between requesting successive items and (ii) a delay between the request for a list of items and the first item request.

Within the inner loop we check to see if we have reached the item limit and, if so, we break out of the inner loop. If we break out of the inner loop then we skip over the else clause and reach the break at the end of the outer loop, which of course terminates the outer iteration too. At this point the downloader is finished its job.

If we don’t exit the inner loop prematurely then we enter the else clause, where the page counter is incremented. There’s another pause before returning to the start of the outer loop to grab the next list of links. The continue statement ensures that we immediately proceed to the next iteration of the outer loop and don’t get to the break at the end of the outer loop.

The crux of this solution is the use of break and continue in the inner and outer loops.

Conclusion

I’ve developed this approach to iterating over a paginated list based on some trial and error. This structure works well and it can easily be adapted to fit most sites with this structure (and that covers the majority of sites that require crawling).