Introduction

XML Worker is an add-on for iText®. It allows developers to convert XML files to PDF documents in a programmer-friendly way. As a proof of concept, we're shipping XML Worker with simple XHTML to PDF functionality, taking into account the styles stored in a CCS2 file. This functionality has been tested on straightforward HTML created with WYSIWYG editors such a TinyMCE and CKEditor.

Simple XHTML/CSS2 to PDF conversion was one of the "most wanted" features that emerged from a user survey, but there's more: with XML Worker, it's possible to parse all kinds of XML, provided you write a specific implementation for that type of XML.

top

Default configuration for HTML

XML Worker uses HTML TagProcessors in the HtmlPipeline to convert HTML to PDF. The default configuration uses the following settings:

Let's take a look at a first example.

top

Example using the default setup

Parsing HTML using the default setup is as easy as creating a PDF in five steps:

  1. Create a Document object
  2. Get a PdfWriter instance.
  3. Open the Document
  4. Invoke XMLWorkerHelper.getInstance().parseXHtml()
  5. Close the Document

Let's take a look at a code snippet that converts the loremipsum.html file to PDF. In this snippet we use the XMLWorkerHelper class and its parseXHtml() method to do all the work:


    Document document = new Document();

    PdfWriter writer = PdfWriter.getInstance(document,

        new FileOutputStream("results/loremipsum.pdf"));

    document.open();

    XMLWorkerHelper.getInstance().parseXHtml(writer, document,

        new FileInputStream("/html/loremipsum.html"));

    document.close();

see HTMLParsingDefault and the resulting PDF loremipsum.pdf

top

Parsing HTML into a list of Element objects

When iText parses an XML file, it interprets the different tags and, whenever possible, iText will create a corresponding Element object. Suppose you're not interested in creating PDF, but you just want parse the HTML into a list of iText Element objects.


XMLWorkerHelper.getInstance().parseXHtml(new ElementHandler() {

    public void add(final Writable w) {

        if (w instanceof WritableElement) {

            List<Element> elements = ((WritableElement)w).elements();

            // write class names of elements to file

        }

    }

}, new FileInputStream("/html/walden.html"));

see HTMLParsingToList

The first object is a Title header. It will result in a bookmark. The first real Element is a Chunk. This is the chunk of text between the <pre> tags in the HTML file:

<pre>


The Project Gutenberg EBook of Walden, and On The Duty Of Civil

Disobedience, by Henry David Thoreau


This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever. ...

<pre>


This snippet is converted to a Chunk, and a Chunk is an Element that doesn't have a leading of its own, hence the gibberish: all the lines between the <pre> tags are written on top of each other, until the first <p> tag is encountered, resulting in a Paragraph object.

top

Fixing the leading problem

When no content has been added to a Document, the initial leading is 0. As soon as a Paragraph is added to the Document, the default leading changes into the leading of that Paragraph. In our first example, we started by adding a Chunk object, and all of its content was written on the first line because the initial leading is 0. We can avoid this problem by setting an initial leading. In the next example, we set the initial leading to 12.5 point.


Document document = new Document();

PdfWriter writer = PdfWriter.getInstance(document,

    new FileOutputStream("results/xmlworker/loremipsum.pdf"));

writer.setInitialLeading(12.5f);

document.open();

XMLWorkerHelper.getInstance().parseXHtml(writer, document,

    new FileInputStream("/html/loremipsum.html"));

document.close();

see HTMLParsingDefault2

The content inside the <pre> tags in the HTML file is now legible in the PDF.

top

Understanding XML Worker

In the previous examples, we've used the XMLWorkerHelper class and its parseXHtml() method to parse the HTML to PDF or to a list of Element objects. Now suppose that this class and method didn't exist. What would we have to do?

First you need an instance of the XMLWorker class. With this instance you can create an XMLParser. Finally use its parse() to parse an HTML file.

Let's take a look at an example, and examine every step.

top

Parsing HTML step by step

The following snippet shows what happens in step 4 of the PDF creation process in more detail.


HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);

htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());

CSSResolver cssResolver =

    XMLWorkerHelper.getInstance().getDefaultCssResolver(true);

Pipeline<?> pipeline =

    new CssResolverPipeline(cssResolver,

        new HtmlPipeline(htmlContext,

            new PdfWriterPipeline(document, writer)));

XMLWorker worker = new XMLWorker(pipeline, true);

XMLParser p = new XMLParser(worker);

p.parse(new FileInputStream("/html/loremipsum.html"));

see HTMLParsingProcess

Let's do a buttom-up examination of this snippet.

HTML input

As you can see, we parse the HTML as an InputStream. We could also have used a Reader object to read the HTML file.

XMLParser

The XMLParser class expects an implementation of the XMLParserListener interface. XMLWorker is such an implementation. Another implementation (ParserListenerWriter) was written for debugging purposes.

XMLWorker

The XMLWorker constructor expects two parameters: a Pipeline<?> and a boolean indicating whether or not the XML should be treated as HTML. If true, all tags will be converted to lowercase and whitespace used to indent the HTML syntax will be ignored. Internally, XMLWorker creates Tag objects that are processed using implementations of the TagProcessor interface (for instance com.itextpdf.tool.xml.html.Anchor is the tag processor for the <a>-tag).

Pipeline<?>

In this case, we're parsing XHTML and CSS to PDF; we define the Pipeline<?> as a chain of three Pipeline implementations:

  1. a CssResolverPipeline,
  2. an HtmlPipeline, and
  3. a PdfWriterPipeline.

You create the first pipeline passing the second one as a parameter; the second pipeline is instantiated passing the third as a parameter; and so on.

Pipeline<?> pipeline =

    new CssResolverPipeline(cssResolver,

        new HtmlPipeline(htmlContext,

            new PdfWriterPipeline(document, writer)));


The PdfWriterPipeline marks the end of the pipeline: it creates the PDF document.

CssResolverPipeline

The style of your HTML document is probably defined using Cascading Style Sheets (CSS). The CSSResolverPipeline is responsible for adding the correct CSS Properties to each Tag that is created by XMLWorker. Without a CssResolverPipeline, the document would be parsed without style. The CssResolverPipeline constructor needs a CssResolver instance. The getDefaultCssResolver() method in the XMLWorkerHelper class provides a default CssResolver:

CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(true);


The boolean parameter indicates whether or not the default.css (shipped with XML Worker) should be added to the resolver.

HtmlPipeline

Next in line, is the HtmlPipeline. Its constructor expects an HtmlPipelineContext.

HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);

htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());


Using the setTagFactory() method of the HtmlPipelineContext, you can configure how the HtmlPipeline should interpret the tags encountered by the parser. We've created a default implementation of the TagProcessorFactory interface for parsing HTML. It can be obtained using the getHtmlTagProcessorFactory() method in the Tags class.

If you want to parse other types of XML, you'll need to implement your own Pipeline implementations, for instance an SvgPipeline.

PdfWriterPipeline

This is the end of the pipeline. The PdfWriterPipeline constructor expects the Document and a PdfWriter instance you've created in step 1 and 2 of the PDF creation process.

In some cases, using the default configuration won't be sufficient, and you'll need to configure XML Worker yourself. This is the case if you want to parse HTML with images and links.

top

Changing the default configuration

Let's find out how to fix these problems by looking at the following example:


FontFactory.registerDirectories();

Document document = new Document();

PdfWriter writer = PdfWriter.getInstance(document,

    new FileOutputStream("results/xmlworker/thoreau1.pdf"));

document.open();

HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);

htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());

htmlContext.setImageProvider(new AbstractImageProvider() {

    public String getImageRootPath() {

        return "src/main/resources/html/";

    }

});

htmlContext.setLinkProvider(new LinkProvider() {

    public String getLinkRoot() {

        return "http://tutorial.itextpdf.com/src/main/resources/html/";

    }

});

CSSResolver cssResolver =

    XMLWorkerHelper.getInstance().getDefaultCssResolver(true);

Pipeline<?> pipeline =

    new CssResolverPipeline(cssResolver,

            new HtmlPipeline(htmlContext,

                new PdfWriterPipeline(document, writer)));

XMLWorker worker = new XMLWorker(pipeline, true);

XMLParser p = new XMLParser(worker);

p.parse(new FileInputStream("/html/thoreau.html"));

document.close();

See HTMLParsingImagesLinks1.java

As you can see, we don't need a special XML Worker configuration to fix the font problem.

top

Registering fonts

In the style attribute of the body tag of thoreau.html, we've told the HTML renderer that we prefer the font Nimbus Roman No9 L (a font you can usually find on Linux distributions). If that font can't be found, we want the font to be Times New Roman (a font that is usually distributed with Windows). If that font isn't found, the default font can be used (which is what happened in thoreau0.pdf).

XML Worker uses the FontFactory class to retrieve fonts. Initially, this class is only aware of the standard Type 1 fonts. If we want iText to find Nimbus Roman No9 L or Times New Roman, we need to register these fonts (see section 11.4.1 of iText in Action — Second Edition). In this example, we just register the "usual suspects":

FontFactory.registerDirectories();


On Windows, this method will find the font files times.ttf, timesbd.ttf, timesbi.ttf, and timesi.ttf. iText will use fonts these to render all the text in the HTML. On Linux, iText will use the Type1 font stored in n021004l.afm/n021004l.pfb and use it whenever a regular font is needed. Unfortunately, it will be more difficult to find the corresponding bold, italic and bold italic font. Choose your font wisely if you want to avoid this problem.

top

Adding an ImageProvider

If the HTML file you're parsing is stored in a directory that is different from the working directory, iText won't be able to create Image objects. We have to supply an implementation of the ImageProvider interface that tells iText what to do if an img tag is encountered. This interface has the following methods:

You can write your own class implementing these four methods, or you can subclass AbstractImageProvider. It is preferred to do the latter. XML Worker will use the store() method of the AbstractImageProvider class to cache all the Image objects that are encountered in a Map. These objects will be reused when the retrieve() method is called for an image with the same src. If you don't cache images, your PDF will be bloated. The same image bits and bytes will be written to the PDF more than once. The reset() method clears the cache; it is used when an ImageProvider is cloned. Finally, the getImageRootPath() method isn't implemented. You have to implement it yourself, as is done in the following snippet:

htmlContext.setImageProvider(new AbstractImageProvider() {

    public String getImageRootPath() {

        return "src/main/resources/html/";

    }

});


The relative path from our workdir to our loremipsum.htm file is "src/main/resources/html/". By using this ImageProvider in the HtmlPipelineContext, relative paths in the src attribute of an img tag will be adapted. iText will add src/main/resources/html/ to the src attribute of the tag (e.g. img/Henry_David_Thoreau_1861.jpg), resulting in the path src/main/resources/html/img/Henry_David_Thoreau_1861.jpg. This path is valid relative to the working directory.

top

Adding a LinkProvider

It makes perfect sense to create a PDF with http links that open an URL in a browser window. It's more delicate to add a relative link to a PDF document. If the document is downloaded and consulted off line, the relative link requires that the document that is referred to is present on your system at the correct location. Looking at loremipsum.htm, we see that there's a link to the test.html file that is in the same folder as loremipsum.htm file. However, the PDF we're creating is written to a different directory. If we want the link to work, we need to change the base URL. This can be done by implementing LinkProvider, an interface with a single method: getLinkRoot(), and adding it to the HtmlPipelineContext using the setLinkProvider() method.


htmlContext.setLinkProvider(new LinkProvider() {

    public String getLinkRoot() {

        return "http://tutorial.itextpdf.com/src/main/resources/html/";

    }

});


In this case, the relative link <a href="test.html">Walden</a> is changed into an absolute link to http://tutorial.itextpdf.com/src/main/resources/html/test.html.

Congratulations! You can now parse HTML. Now let's take a look at ways to extend XML Worker so that you can also parse other XML documents.

top

Extending the XMLWorker

Depending on the nature of your XML file, you can either write your own Pipeline implementations, or you can extend the HtmlPipeline by adding your own TagProcessor classes.

Let's start by extending the HtmlPipeline.

top

How to extend the HtmlPipeline class

We've already configured a HtmlPipeline by changing the HtmlPipelineContext. We've defined an ImageProvider and a LinkProvider and applied it using the setImageProvider() and setLinkProvider() method, but there's more.

Each time a new XMLWorker/XmlParser is started with the same HtmlPipeline, the context is cloned using some defaults. You can change these defaults with the following methods:

In previous examples, we've also used the setTagFactory() method. We can completely change the way HtmlPipeline interprets tags by creating a custom TagProcessorFactory.

XMLWorker creates Tag objects that contains attributes, styles and a hierarchy (one parent, zero or more children). HtmlPipeline transforms these Tags into com.itextpdf.text.Element objects with the help of TagProcessors. You can find a series of precanned TagProcessor implementations in the com.itextpdf.tool.xml.html package.

The default TagProcessorFactory can be obtained from the Tags class, using the getHtmlTagProcessorFactory() method. Not all tags are enabled by default. Some tags are linked to the DummyTagProcessor (a processor that doesn't do anything), other tags result in a TagProcessor with a very specific implementation. You can extend the HtmlPipeline by adding your own TagProcessor implementations to the TagProcessorFactory with the addProcessor() method. This will either replace the default functionality of already supported tags, or add functionality for new tags.

Suppose that you have HTML code in which you've used a custom tag that should trigger a call to a database, for example a <userdata> tag. XMLWorker will detect this tag and pass it to the HtmlPipeline. As a result, HtmlPipeline looks for the appropriate TagProcessor in its HtmlPipelineContext. You can implement the TagProcessor interface or extend the AbstractTagProcessor class in such a way that it performs a database query, adding its ResultSet to the Document in the form of a (list of) Element object(s). You should prefer extending AbstractTagProcessor, as this class comes with precanned page-break-before, page-break-after, and fontsize handling.

Note that your TagProcessor can use CSS if you introduced a CssResolverPipeline before each pipeline that wants to apply styles. The CssResolverPipeline is responsible for setting the right CSS properties on each tag. This pipeline requires a CSSResolver that contains your css file. Let's take a look at the StyleAttrCssResolver that is shipped with XML Worker.

top

The StyleAttrCSSResolver explained

The major function of a CSSResolver is detecting the right CSS for a given tag. The StyleAttrCssResolver uses the following criteria:

  1. All inheritable CSS from the parent tag is added to the current tag.
  2. All CssFile are checked for rules applying on the given tag in the order they were added to the CSSResolver. Rules defined on the same property are overridden.
  3. Finally any CSS found in the tag's style attribute is added to the tags CSS.

Note that CssFiles can be added to a CSSResolver at any time. When adding a CssFile to the StyleAttrCssResolver, it's used by the resolving process immediately. There's no method to remove a CssFile in the CSSResolver interface, but that doesn't mean you can't add such a method in your custom implementation.

You can provide inheritance rules for the StyleAttrCssResolver class with the setCssInheritanceRules() method. By default, the DefaultCssInheritanceRules are used, but you can always write your own implementation of the CssInheritanceRules interface, for instance if you don't want certain CSS properties to be inherited from a tag.

All of this is very interesting if your XML is very similar to HTML and if your styles are defined using CSS. But if your XML is completely different, and if you need to produce content that doesn't fit in iText's Element objects, you'll need to write your own Pipeline.

top

Write your own Pipeline

Pipelines were introduced to separate Tag processing from content received from the XMLWorker. Different pipelines will result in different actions. Creating PDF is only one of the many possible actions.

If you need functionality that goes beyond HTML to PDF rendering, you need to implement the Pipeline interface.


public interface Pipeline<T extends CustomContext> {


	Pipeline init(final WorkerContext context) throws PipelineException;

	Pipeline open(WorkerContext context, Tag t, ProcessObject po) throws PipelineException;

	Pipeline content(WorkerContext context, Tag t, byte[] content, ProcessObject po) throws PipelineException;

	Pipeline close(WorkerContext context, Tag t, ProcessObject po) throws PipelineException;

	Pipeline getNext();


}


For your convenience, the AbstractPipeline already implements all this method. It's always a good idea to write subclass. This allows you to inherit all the default behavior, so that you only have to implement the open(), content(), and close() methods. These methods are called when XMLWorker detects that a tag is opened, when it detects content inside a tag, and when it detects that a tag is closed.

XMLWorker passes a Tag object (containing the name of the tag, attributes, styles, its parent, and its children) as well as a ProcessObject to these methods. In the case of the content() method, you also get a byte array with whatever was found in-between tags. This lifecycle of such a ProcessObject object starts in the first pipeline that is encountered and it ends in the final pipeline. It contains a list of Writable objects. Writable is used as a marker interface, allowing you to pass virtually anything from one pipeline to another. For instance the PdfWriterPipeline expects the ProcessObject to contain lists of WritableElements. These contain lists of Element object that can be added to a document. In the HTML to PDF implementation, the HtmlPipeline add Element objects to a WritableElement and puts them in the ProcessObject that is passed on to the PdfWriterPipeline.

The WorkerContext lives as long as the parsing is going on, the context can be filled with CustomContext implementations used by pipelines. This way pipelines can access a CustomContext of another pipeline. In the existing pipelines this is done in the init method which is called by the XMLParsers various parse methods.

Please consult the source code of the existing pipelines for inspiration when writing your own Pipeline implementation.

top

References

Before you start coding, please consult the requirements to find out if you're using the correct version of iText and XML Worker. Not all HTML tags and CSS styles are supported. See the tables "HTML Tag Support" and "CSS Support" for more information. Note that we've also added support for some styles that are specific for XML Worker.

top

Requirements

top

XMLWorker Specific CSS

top

Links

iText®
XMLWorker on sourceforge®
XMLWorker demo
Example code
Css conformance list
top
this site was generated with iText Documentation Maven Plugin - build with iText®, iText® XMLWorker, Apache Common IO, jQuery - Maven id: com.itextpdf.maven | itextdoc-1.0.0 see itextpdf.com