Update Sitemap for Canonical Pages

The principal purpose of a sitemap file is to inform search engines about the pages on a website that are available for crawling. It provides a list of URLs along with additional metadata about each URL to help search engines more intelligently crawl the site. If there are multiple page versions on a site then the sitemap should include only the canonical versions of those pages.

In the previous post we set up canonical links, so that a search engine crawler can determine which version of a page should be indexed. However, we can make this process more efficient. If the sitemap only includes the canonical versions of pages then the crawler is less likely to waste time on indexing non-canonical pages.

🚀 TL;DR Show me the code. Look at the 14-canonical branch. This site is deployed here.

Updating the Sitemap

The GraphQL schema already contains all of the information that we need. For each page the pageContext contains both version and latest_version fields.

In gatsby-config.js we’ll make the entry for the gatsby-plugin-sitemap plugin more detailed:

  1. add an explicit GraphQL query which includes pageContext; and
  2. use resolvePages to include a function which will filter out only the canonical nodes.

Canonical Sitemap

With this change the sitemap only includes the URLs for the latest version of each of the pages.

<?xml version="1.0" encoding="UTF-8"?>
<urlset>
  <url>
    <loc>https://www.whimsyweb.dev/1.2/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>https://www.whimsyweb.dev/1.2/what-is-asciidoc/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>https://www.whimsyweb.dev/1.2/what-is-gatsby/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>https://www.whimsyweb.dev/1.2/what-is-graphql/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>https://www.whimsyweb.dev/1.2/what-is-javascript/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>https://www.whimsyweb.dev/1.2/what-is-tailwind/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>https://www.whimsyweb.dev/1.2/what-is-typescript/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
</urlset>

The attributes of the <urlset> tag have been truncated for clarity and brevity.

The deployed sitemap index can be found here. It references the actual sitemap.

Conclusion

A more detailed configuration of the gatsby-plugin-sitemap plugin configuration in gatsby-config.js will filter out URLs for all pages except the canonical versions. The resulting optimised sitemap will enhance the efficiency of search engine crawling.

🚀 Action items:

  1. Update your sitemap to only include the canonical version of each page.
  2. Submit your updated sitemap to search engines.
  3. Monitor your search engine rankings and traffic to see the impact of the changes.

🚀 TL;DR Show me the code. Look at the 14-canonical branch. This site is deployed here.