Installation and Usage Guide for pdfannots

Posted by Praveen on July 24, 2023 · 3 mins read

When dealing with PDF files, many readers, including Linux’s default evince (such as in the case of Manjaro), offer convenient options to highlight text and add comments directly within the document. However, one drawback of this approach is that the added annotations are stored within the specific PDF file, making it challenging to extract the annotated text separately. This limitation becomes apparent when you want to extract the highlighted information for reference, analysis, or sharing with others.

In the past, to address this issue, I relied on a tool called PDF Studio, which provided comprehensive features for handling PDF annotations and text extraction. However, since I don’t frequently require text extraction, I sought a more streamlined solution that wouldn’t burden me with unnecessary complexities.

My search led me to discover an elegant and efficient command-line tool called pdfannots. It is a Python package designed to simplify the process of extracting annotations and highlighted text from PDFs. The great advantage of pdfannots lies in its simplicity and ease of use, making it an ideal choice for those seeking a quick and straightforward solution for text extraction.

To get started with pdfannots, you can find the package’s repository on GitHub: pdfannots. The installation process is straightforward and can be accomplished using the Python package manager, pip3.

Here are the steps to install pdfannots:

  1. Open your terminal or command prompt.
  2. Ensure you have pip3 installed. If not, you can install it via the package manager specific to your operating system.
  3. Execute the following command to install pdfannots:
    pip3 install pdfannots
    

In case you encounter the error message “pip3 error: externally-managed-environment”, there is an alternative approach using pipx for installation. Follow these steps instead:

  1. Install pipx using pip3:
    pip3 install pipx
    
  2. Proceed to install pdfannots using pipx:
    pipx install pdfannots
    

After successful installation, you can begin using pdfannots right from your terminal or command prompt. To explore the available options and functionalities, simply type the following command:

pdfannots --help

This will provide you with a list of commands and usage examples, helping you make the most out of the tool.

To perform a straightforward text extraction, use the following command structure:

pdfannots input.pdf -o output.md

Replace input.pdf with the path to the PDF file you wish to extract annotations from, and output.md with the desired name for the output Markdown file that will contain the extracted text.

By leveraging pdfannots, you can now effortlessly extract the annotated text from your PDFs, making it accessible for further analysis, sharing, or personal reference. Its command-line interface and Python-based design offer both convenience and flexibility, ensuring a smooth and hassle-free experience with PDF annotations extraction.