Persisting Data with Pickle & S3

I occasionally write scripts where I need to persist some information between runs. These scripts are often wrapped in a Docker image and deployed on Amazon ECS. This means that there is no persistent storage. I could use a database, but this would be overkill for the volume of data involved. This post describes a simple approach to storing these data on S3 using a pickle file.

Setup

Import the boto3 and botocore packages (the latter package is only required for the ClientError exception).

import boto3, botocore

Create an S3 client object.

s3 = boto3.client("s3")

How does authentication work? I store my credentials in ~/.aws/credentials with multiple AWS accounts, each identified by an unique profile name. I set the AWS_PROFILE environment variable to choose a specific account. I also specify a suitable value for the AWS_DEFAULT_REGION environment variable.

export AWS_PROFILE=fathom
export AWS_DEFAULT_REGION=eu-west-1

Now store the S3 bucket name and a name for the pickle file.

BUCKET = "state-persist"
PICKLE = "state.pkl"

Retrieve

First try to load the data. On the first iteration this won’t work because there’s nothing persisted yet. But after you’ve been through the process once, these steps will load the data from the previous iteration.

Attempt to download the pickle file from S3. If it’s not there, handle the error gracefully.

try:
    s3.download_file(BUCKET, PICKLE, PICKLE)
except botocore.exceptions.ClientError:
    # You'll arrive here on the first iteration.
    pass

Read the pickle file. On failure, set data to None (or some other appropriate default value).

try:
    with open(PICKLE, "rb") as file:
        data = pickle.load(file)
except (FileNotFoundError, EOFError):
    # You'll arrive here on the first iteration.
    data = None

Since both of the first two steps will normally fail together, it might make sense to place the second step in an else clause of the first exception handler.

Store

As the script runs the state information is assigned to (or updated in) data. At the end we need to persist those data.

Create or update the pickle file.

pickle.dump(data, open(PICKLE, "wb"))

Write that file to S3.

s3.upload_file(PICKLE, BUCKET, PICKLE)

Conclusion

A simple procedure for persisting information between jobs.

This approach is vulnerable to race conditions if there are multiple instances of the script running simultaneously. You could handle this with a lock file (also stored on S3) or by just being careful to avoid simultaneous execution.