WEB PAGE TO EPUB CONVERSION: A Guide to Using Wget and Pandoc

Posted by Praveen on July 20, 2023 · 5 mins read

In today’s digital age, access to information is abundant on the internet. However, sometimes we may find ourselves in situations where we need to access this data offline or simply prefer to read content in a more organized format, like an eBook. Converting web pages to EPUB format can be a convenient solution, and in this blog post, we’ll explore two methods to achieve this: using Wget and Pandoc, and utilizing a Chrome extension called EpubPress.

Method 1: Using Wget and Pandoc

First, create a text file (e.g., list.txt) containing the links of the web pages you wish to download and convert to EPUB format. Make sure the links are listed one per line.

Step 2: Download the Web Pages

With the list of links ready, use the command-line tool Wget to download the entire site data and make it available offline. Use the following command:

wget -E -H -k -p -i list.txt

This command will recursively download the web pages, convert links for offline browsing (-k), and save them as HTML files in the current directory.

Step 3: Convert to EPUB

Navigate to the folder containing the downloaded HTML files and use Pandoc, a versatile document converter, to convert the files to EPUB format. Execute the following command:

pandoc -i filename-*.html -o filename.epub

Replace “filename” with the appropriate name for your EPUB file. This command will merge all the HTML files and convert them into a single EPUB file.

Alternative Step 4: Using an Index File

Instead of providing a list of links in a text file, you can create an index.html file that includes all the links you want to convert. Format the file as follows:

<!doctype html>
<html>
  <head>
    <title>This is the title of the webpage!</title>
  </head>
  <body>
	<a href="chapter-1.html">1</a>
	<a href="chapter-2.html">2</a>
	<a href="chapter-3.html">3</a>
	<a href="chapter-4.html">4</a>
  </body>
</html>

Then, use Calibre’s ebook-convert tool to convert the index.html to EPUB, providing title, authors, and a cover image (if available):

ebook-convert index.html book.epub --title "Title" --authors "Author" --cover cover.png

Note: Ensure you have Calibre installed and added to your system’s PATH to use ebook-convert.

If your links end with numbers in ascending order (e.g., chapter-1.html, chapter-2.html, etc.), you can use LibreOffice Sheets to generate all the links. Simply copy the first link and extend it until the last number in the sequence.

Deleting Unwanted Strings in the Scrapped Text

Sometimes, after converting web pages to EPUB format, you may want to remove specific unwanted strings from the text. To do this, create a file (e.g., delete.txt) containing each string to be removed, one per line. Then use the grep command to filter out those strings:

grep -vf delete.txt filename.md > tmpfile && mv tmpfile newname.md

Method 2: Using the Chrome Extension EpubPress

For a more user-friendly approach to converting web pages to EPUB, you can utilize the EpubPress Chrome extension. Follow these steps:

  1. Install the EpubPress extension from the Chrome Web Store using the link here.

  2. Open each web page you want to convert into a separate browser tab.

  3. Click on the EpubPress extension icon in the Chrome toolbar.

  4. Select the desired pages you want to include in the EPUB.

  5. Click the “Process” button to initiate the conversion.

Limitations of EpubPress

While EpubPress offers a convenient way to convert web pages to EPUB, it comes with some limitations:

  • Each EPUB book can only contain up to 50 articles.
  • For email delivery to work, the EPUB book must be 10 MB or less in size.
  • Images within articles must be 1 MB or smaller; otherwise, they will be removed during conversion.
  • The extension will only download a maximum of 30 images.

In conclusion, converting web pages to EPUB format can be beneficial for offline reading and easier access to information. Using the combination of Wget and Pandoc allows more flexibility and customization, while the EpubPress Chrome extension offers a straightforward approach for quick conversions. Choose the method that suits your needs best and enjoy a seamless reading experience with your newly created EPUBs. Happy reading!