Adding robots.txt to a Gatsby Site

There are a couple files which can have an impact on the SEO performance of a site: (1) a sitemap and (2) a robots.txt. In a previous post we set up a sitemap which includes only the canonical pages on the site. In this post we’ll add a robots.txt.

A Gatsby site will not have a robots.txt file by default. There’s a handy package which makes it simple though. We’ll take a look at how to add it to the site and a couple of ways to configure it too.

🚀 TL;DR Show me the code. Look at the 15-robots branch. This site is deployed here.

What’s robots.txt?

The robots.txt file is a guide for search engine crawlers. It indicates which parts of a site are allowed or forbidden from crawling and indexing. It can help to ensure that a site is indexed accurately and should prevent sensitive or duplicate content from being crawled and indexed.

Add Package

Add the gatsby-plugin-robots-txt package to packages.json. Then reinstall site packages with npm or yarn.

npm install

Configure Package

The simplest configuration would involve just adding gatsby-plugin-robots-txt to the list of plugins in gatsby-config.js.

You can also add some details to the plugin configuration:

{
  resolve: 'gatsby-plugin-robots-txt',
  options: {
    policy: [
      {
        userAgent: 'Googlebot',
        allow: '/',
        crawlDelay: 5
      },
      {
        userAgent: 'bingbot',
        allow: '/'
      },
      {
        userAgent: 'CCBot',
        disallow: '/'
      },
      {
        userAgent: '*',
        allow: '/'
      }
    ]
  }
}

Here we’ve set up different policies for three specific bots (Google, Bing and Common Crawl) and another policy for all other bots. More examples of the information that can be included in the policy can be found here.

There are a couple of other things that can be configured here:

host — the site URL (generally obtained from siteUrl); and
sitemap — the location of a sitemap (the default will work immediately for sitemaps generated via gatsby-plugin-sitemap).

You can also specify different configurations depending on the value of the GATSBY_ACTIVE_ENV environment variable. This can be useful, for example, if you want to have different content in robots.txt depending on whether it’s a production, development or preview build.

This is what the resulting robots.txt looks like:

User-agent: Googlebot
Allow: /
Crawl-delay: 5

User-agent: bingbot
Allow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /
Sitemap: https://gatsby-whimsyweb-15-robots.netlify.app/sitemap-index.xml
Host: https://gatsby-whimsyweb-15-robots.netlify.app

You can see the live version of this file here.

Conclusion

Adding a robots.txt file to a Gatsby site is quick and easy. It is likely to improve the site’s SEO performance (perhaps only slightly?). And it can certainly do no harm (provided that you don’t preclude an important crawler bot!).

🚀 TL;DR Show me the code. Look at the 15-robots branch. This site is deployed here.