Driving AWS from the Command Line

Although it’s very handy (and easy) to set up some cloud resources using the AWS Management Console, once you know what you need it makes a lot of sense to automate the process. Fortunately there’s a handy little command line tools, aws, which makes this eminently possible. The AWS CLI Command Reference is the definitive resource for this tool. There’s a mind boggling array of possibilities. We’ll take a look at a small selection of them.

Install

The aws tool is a Python script. Installation is very simple, just follow the detailed documentation.

Configure

Specify your AWS Access Key ID, Secret Access Key and default region.

$ aws configure

SSH Key

You’ll need a SSH key to connect to your remote resources.

$ aws ec2 create-key-pair --key-name datawookie-sydney | jq -r .KeyMaterial >~/.ssh/datawookie-sydney.pem

The result from aws ec2 create-key-pair is a JSON document, from which we extract the value for KeyMaterial using the command-line JSON processor jq.

Apply restrictive access permissions to the resulting PEM file.

$ chmod 0400 ~/.ssh/datawookie-sydney.pem

Security Group

Create a Security Group. This will be used to determine what services will have access to your resources.

$ aws ec2 create-security-group --group-name general --description "SSH / HTTP / HTTPS"
{
    "GroupId": "sg-933cf8f5"
}

Add some rules for inbound connections. Here we allow ports 22 (SSH), 80 (HTTP) and 443 (HTTPS).

$ aws ec2 authorize-security-group-ingress --group-name general --protocol tcp --port 22 --cidr 0.0.0.0/0
$ aws ec2 authorize-security-group-ingress --group-name general --protocol tcp --port 80 --cidr 0.0.0.0/0
$ aws ec2 authorize-security-group-ingress --group-name general --protocol tcp --port 443 --cidr 0.0.0.0/0

Then check if everything is configured as expected. In light of the volume and complexity of the output of this command you might find it expedient to simply use the AWS Management Console.

$ aws ec2 describe-security-groups

Environment Variables

Since we’ll be running aws from the shell it’ll make our lives easier if we first set up a few environment variables.

Region

Specify the region in which the resources are going to be deployed.

$ export REGION="ap-southeast-2"

Keyfile

The name assigned to your SSH key.

$ export KEYNAME="datawookie-sydney"

Elastic Compute Cloud (EC2)

Launch an EC2 instance using aws ec2 run-instances. You can find an appropriate image ID in Step 1 of the EC2 Launch Instance wizard.

$ aws ec2 run-instances --image-id ami-e2021d81 \
                        --security-group-ids sg-933cf8f5 \
                        --count 1 \
                        --instance-type t2.micro \
                        --key-name $KEYNAME

Provided that the above command executed without error, you should have a running EC2 instance. Check out the Instances tab on the AWS Management Console. You can now connect to the remote instance using SSH.

Elastic Map Reduce (EMR)

There’s a wide variety of clusters that can be deployed using EMR. We’ll put together a small Spark cluster.

First we’ll need to create two new Security Groups, spark-master and spark-slave, where

  • spark-master has the same permissions as general but also allows inbound TCP connections on port 8001;
  • spark-slave has no inbound permissions.

Then run the script below, which will create a cluster with four nodes (one master and three workers). The nodes are provisioned with Spark and a few other pertinent applications. A bootstrap script also sets up JupyterHub.

aws emr create-cluster \
  --name 'Spark Cluster' \
  --release-label emr-5.2.0 \
  --applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Tez Name=Ganglia Name=Presto \
  --region $REGION \
  --ec2-attributes '{
    "KeyName": "'$KEYNAME'",
    "InstanceProfile": "EMR_EC2_DefaultRole",
    "EmrManagedMasterSecurityGroup": "sg-4a4f8b2c",
    "EmrManagedSlaveSecurityGroup": "sg-d4498db2"
  }' \
  --service-role EMR_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.4xlarge,Name=Master \
    InstanceGroupType=CORE,InstanceCount=3,InstanceType=c3.4xlarge,Name=Worker \
  --bootstrap-actions '[{
    "Path": "s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",
    "Args": ["--toree","--ds-packages","--jupyterhub","--jupyterhub-port","8001","--password","jupyter"],
    "Name": "Jupyter Notebook"
  }]'

The --password parameter sets up the JupyerHub password for the hadoop user.

There are a host of other parameters which can be passed to the bootstrap script. Of particular interest are:

  • --r — install a kernel for R;
  • --julia — install a kernel for Julia;
  • --ruby — install a kernel for Ruby;
  • --ml-packages — install Python Machine Learning and Deep Learning packages;
  • --python-packages — install arbitrary named Python packages;
  • --port — port for Jupyter notebook (defaults to 8888);
  • --password — password for Jupyter Notebook.

It might take a while to bring up the cluster. The bootstrap process appears to be somewhat time consuming. However, if you’re patient then in good time (an hour or so!) you’ll have a fully provisioned Spark cluster with JupyterHub running on it.

The JupyterHub interface will be available on port 8001 on the master node. Find out more about this setup here.

Simple Storage Service (S3)

S3 provides storage space which can be readily accessed from other resources on AWS.

Creating a S3 Bucket

Storage on S3 is divided into containers called “buckets”. Creating a bucket is simple with aws s3 mb.

$ aws s3 mb s3://datawookie-bucket

Copying Files to a Bucket

Local files can be copied across to a S3 bucket using aws s3 cp. You can restrict access to a file using the --grants option.

$ aws s3 cp iris.csv s3://datawookie-bucket
upload: ./iris.csv to s3://datawookie-bucket/iris.csv

The commands aws s3 mv and aws s3 rm are analogous to their UNIX equivalents, moving and deleting files on S3.

The command aws s3 sync is used to synchronise the contents of folders, either local or on S3.

Listing Buckets and Their Contents

You can get a list of available buckets using aws s3 ls.

$ aws s3 ls
2017-09-01 13:22:58 datawookie-bucket

If you provide the URL for a particular bucket then you can also see its contents.

$ aws s3 ls s3://datawookie-bucket
2017-09-01 13:29:07       3716 iris.csv

Destroying a S3 Bucket

When you’re done with your bucket you can destroy it with aws s3 rb. The --force argument is required to delete a bucket which still contains files.

$ aws s3 rb s3://datawookie-bucket --force

S3 and Static Web Sites

You can use the aws s3 website command to turn the contents of a S3 bucket into a static web site.