When dealing with PDF files, many readers, including Linux’s default evince (such as in the case of Manjaro), offer convenient options to highlight text and add comments directly within the document. However, one drawback of this approach is that the added annotations are stored within the specific PDF file, making it challenging to extract the annotated text separately. This limitation becomes apparent when you want to extract the highlighted information for reference, analysis, or sharing with others.
In the past, to address this issue, I relied on a tool called PDF Studio, which provided comprehensive features for handling PDF annotations and text extraction. However, since I don’t frequently require text extraction, I sought a more streamlined solution that wouldn’t burden me with unnecessary complexities.
My search led me to discover an elegant and efficient command-line tool called pdfannots
. It is a Python package designed to simplify the process of extracting annotations and highlighted text from PDFs. The great advantage of pdfannots
lies in its simplicity and ease of use, making it an ideal choice for those seeking a quick and straightforward solution for text extraction.
To get started with pdfannots
, you can find the package’s repository on GitHub: pdfannots. The installation process is straightforward and can be accomplished using the Python package manager, pip3
.
Here are the steps to install pdfannots
:
pip3
installed. If not, you can install it via the package manager specific to your operating system.pdfannots
:
pip3 install pdfannots
In case you encounter the error message “pip3 error: externally-managed-environment”, there is an alternative approach using pipx
for installation. Follow these steps instead:
pipx
using pip3
:
pip3 install pipx
pdfannots
using pipx
:
pipx install pdfannots
After successful installation, you can begin using pdfannots
right from your terminal or command prompt. To explore the available options and functionalities, simply type the following command:
pdfannots --help
This will provide you with a list of commands and usage examples, helping you make the most out of the tool.
To perform a straightforward text extraction, use the following command structure:
pdfannots input.pdf -o output.md
Replace input.pdf
with the path to the PDF file you wish to extract annotations from, and output.md
with the desired name for the output Markdown file that will contain the extracted text.
By leveraging pdfannots
, you can now effortlessly extract the annotated text from your PDFs, making it accessible for further analysis, sharing, or personal reference. Its command-line interface and Python-based design offer both convenience and flexibility, ensuring a smooth and hassle-free experience with PDF annotations extraction.