Skip to content


Creating a single PDF from a series of HTML files in Linux

Today I found a great tutorial about digital photography. Unfortunately it was decomposed to chapters and each chapter was an a separate html page. I wanted to print the tutorial but it was not available as a single printable file. I decided to download the bunch of HTMLs with wget. Then I wrote the following perl script:

#!/usr/bin/perl -w

opendir(DIR, “.”);
@files = grep(/\.html$/,readdir(DIR));
closedir(DIR);

foreach $file (@files) {
@parts = split(/\./, $file);

$command = “htmldoc –webpage -f @parts[0]\.pdf $file”;
print “$command\n”;
`$command`;
}

opendir(DIR, “.”);
@files = grep(/\.pdf$/,readdir(DIR));
closedir(DIR);

@files = sort(@files);

$command = “pdfjoin “;

foreach $file (@files) {
$command = $command.$file.” “;
}

print “$command\n”;
`$command`;

It lists the html files in the current folder. It converts each html to a pdf file by calling htmldoc. The it lists the pdfs, sorts them alphabetically by their filename and joins them by calling pdfjoin. The output filename is generated from the last pdf filename. Look for “joined”.

You will need perl, htmldoc and pdfjoin to be installed on your machine in order to run the script. I got pdfjoin by installing pdfjam.

Posted in Linux tips, Ubuntu.

Tagged with , , , , .


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

You must be logged in to post a comment.