How we ended up generating beautiful PDFs from a HTML page

The challenge

Recently we were implementing a project that required a possibility to generate reports in PDF format. Usually, this would not be anything extraordinary, but in this case, the report contained complicated charts. Apart from that, we were supposed to display the report as a normal web page.

We took two approaches into consideration: convert the HTML page to PDF or generate the PDF “by hand”. Client-side generation was not an option because we also needed to enable sending the report via e-mail. Also, we couldn’t rely on end user’s browser and introduce additional steps because this would seriously hinder UX flow.

The first approach seemed to be the best as it guarantees consistent look across both media. The second one seemed to be very cumbersome as we would need to generate the charts as images sacrificing interactivity and probably also the looks – this was a no-go.

 

The journey

The first thing we tried was to establish wkhtmltopdf converter. Underneath it uses Qt framework, specifically WebKit engine that’s bundled with it. Unfortunately, the WebKit version was quite old and we could not easily get consistent renders of JS charts. With a lot of tweaking, it would be possible to get a decent result but we were all about providing the best possible solution.

Another option would be to use the excellent PhantomJS headless browser. It’s quite heavy on the memory consumption side though and the author recently announced that he’s stepping down as maintainer. But the reason he’s stepping down is that there is a new player on the horizon – Chrome browser with a headless mode.

 

Headless beast

Several Google searches after the beta version (v59) of Chrome browser with headless support was installed on our development CentOS VM. This was possible thanks to Google providing excellent rpm and deb packages.

It turned out that the new version has a handy CLI –print-to-pdf option. Combined with –headless and –disable-gpu we were ready to go. The output looked promising but the canvas-based chartjs library, that we used, produced a pixelated output. No matter what we tried, we could not get it to render crisp charts. The –force-device-scale-factor did nothing.

Fortunately, Google also provides an unstable version (v60) which obeyed the pixel ratio setting. What’s unfortunate is that it prints the default header & footer – it would look plain unprofessional on the report. This setting is hardcoded for now as we discovered in the source.

Another thing is that sometimes the –print-to-pdf CLI option doesn’t wait long enough for the page to fully render. This depends on a page and other factors but leaves questions about reliability.

The morale went down but we kept on searching and eventually caught a lucky break. The new version of Chrome had some additions to the remote debugging API, print to pdf functionality included.

We created a nodejs script that launches the browser and executes its JS for 5 seconds virtual time – then it finally renders the PDF. Because the browser does not have to show anything on the screen JS execution will usually take significantly less real time.

 

Conclusion

Apart from the JS wrapper script / library we have produced a dockerized solution and an ansible role to easily use it on CentOS).

In the end, we were not satisfied with the size of the PDF due to the rasterized charts and we switched to d3.js to get vector output from SVG. The produced document looks even more phenomenal this way.

The screen media styles and page-break css properties allow creating virtually any layout in the PDF document.

I suppose that once headless becomes stable it will devour all other players in this market (looking at you PrinceXML ;))

Author: Filip Sobalski

 

pluswerk