Extract Chapter Headings from PDF

From time to time I want to extract the table of contents from a PDF. Here’s how I do that using simple shell tools.

PDFtk

I’ll be using the PDF Toolkit, PDFtk, to extract the complete table of contents. Make sure that this is installed.

apt-get update
apt-get install -y pdftk

Extract Table of Contents

For the purpose of illustration I’ll use the book 101 Small Business Ideas for Under $5000 for which I have a PDF copy.

To get the full table of contents you can run:

pdftk 101-small-business-ideas.pdf dump_data >toc.txt

Take a look at toc.txt.

NumberOfPages: 338
BookmarkBegin
BookmarkTitle: 101 Small Business Ideas for Under $5,000
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 2
BookmarkPageNumber: 7
BookmarkBegin
BookmarkTitle: Preface
BookmarkLevel: 2
BookmarkPageNumber: 11
BookmarkBegin
BookmarkTitle: Acknowledgments
BookmarkLevel: 2
BookmarkPageNumber: 13
BookmarkBegin
BookmarkTitle: How to Use This Book
BookmarkLevel: 2
BookmarkPageNumber: 15
BookmarkBegin
BookmarkTitle: When to Seek Professional Advice
BookmarkLevel: 3
BookmarkPageNumber: 15
BookmarkBegin
BookmarkTitle: Informational Icons Used
BookmarkLevel: 3
BookmarkPageNumber: 17
BookmarkBegin
BookmarkTitle: Chapter 1: Business Insurance and Risk Management
BookmarkLevel: 2
BookmarkPageNumber: 19

The information that we are after is in the BookmarkTitle records but those records exist not only for chapters but sections (and probably sub-sections too).

I only want the chapters.

Filter with grep and awk

We can invoke a couple of reliable shell tools, grep and awk, to finish the job.

grep -B 1 'Level: 2' toc.txt | awk '/Title/ {print substr($0, index($0, ": ") + 2)}'
Contents
Preface
Acknowledgments
How to Use This Book
Chapter 1: Business Insurance and Risk Management
Chapter 2: Legalities and Taxes
Chapter 3: Setting Your Price
Chapter 4: Financing a Small Business
Chapter 5: Home Services (Exterior)
Chapter 6: Home Services (Interior)
Chapter 7: Home Services (Specialty)
Chapter 8: Parties, Entertainment, and Special Events
Chapter 9: Personal Services
Chapter 10: Children, Family, and Pet Services
Chapter 11: Educational Services
Chapter 12: Arts, Crafts, Jewelry, Clothing, and Musical Instruments
Chapter 13: Transportation, Delivery, and Auto Services
Chapter 14: Computers, Graphics, and Photography
Chapter 15: Office and Professional Services
Chapter 16: Sales
Appendix: Government and Private Resources for Small Businesses
Index
About the Authors

A quick dissection:

  • Use grep to select the rows that contain Level: 2. The -B 1 option includes the line preceding each match in the output.
  • Use awk to filter out the rows in the output that contain Title and strip off everything before the actual title text.

An alternative implementation using sed (and an extra step in the pipeline):

grep -B 1 'Level: 2' foo.txt | grep 'Title' | sed 's/.*Title: //'