Convert from and to OpenDocument - Blog - Open Source - schlitt.info

schlitt.info - php, photography and private stuff

Convert from and to OpenDocument

Yesterday the latest stable release of the eZ Components project, number 2009.2, was rolled. For this release I worked on support for OpenDocumentText (ODT) in the Document component. In this article I show you how you can import OpenDocumentText documents and convert them into any of the supported formats of the component, how to export data into ODT and how to apply styles to the generated documents. You will also see how ODT and PDF can be exported with the very same styling information to make them look almost identical.

Before I get going with the technical stuff, I'd like to thank Derick Rethans for his efforts in the eZ Components project and the amazing cooperation in the past ~4 years at eZ Systems. Derick is leaving the company by the end of the year and I want to wish him all the best for the future and especially the upcoming changes. It was a pleasure working with you, mate!

In the first version with ODT, the Document component only supports the FODT (flat ODT) format for import and export. This variant of OpenDocumentText consists of a single plain XML file, not a ZIP package as normal ODT files do. However, such files should be supported by the most OpenOffice.org versions and they can even contain images and other media data. I hope I can implement support for real ODT files as a first step for the next release.

To create an FODT with OpenOffice.org, simply choose the file format in the save dialog. If your instance of OpenOffice.org does not have this format, check your distribution for a package like openoffice.org-filter-binfilter (Ubuntu). Supporting versions of OpenOffice.org will also open existing FODT files. If your desktop environment did not register the filetype correctly, just force it to open .fodt files with OpenOffice.org.

Importing ODT

With version 1.3 of the Document component it is capable of importing FODT files. As usual, you can convert the content of such files into the internal format of the component (Docbook XML) and from there you can export any of the other supported formats. The following example shows importing an ODT and exporting RST:

<?php // Initialize eZ Components … $odt = new ezcDocumentOdt(); $odt->loadFile( 'example.fodt' ); $docbook = $odt->getAsDocbook(); $converter = new ezcDocumentDocbookToRstConverter(); $rst = $converter->convert( $docbook ); file_put_contents( 'example.txt', $rst ); ?>

In $odt a new instance of ezcDocumentOdt is created. This object is capable of writing, validating and, as seen in the example, reading FODT files. The method getAsDocbook() performs the actual conversion to Docbook XML and returns an instance of ezcDocumentDocbook.

You could now save the Docbook XML at this stage or, as shown, go on converting it. An ezcDocumentDocbookToRstConverter is used to convert from Docbook to reStructuredText, a format commonly used for documentation in software projects. As you might have guessed, $rst contains an instance of ezcDocumentRst, which is saved in the last line of the snippet. You could have called save() on the RST document to get its text content. This is not necessary here since the magic __toString() method has the very same effect.

You can review the input FODT and the output RST online to see the conversion. The import mechanism does not only convert the semantical elements contained in the ODT file, which are quite few, but also performs some heuristic magic to recognize e.g. emphasis text. I hope I find time to work some more on this aspect and expose an API for it in the future.

Generating ODT

The opposite way, generating FODT from a Docbook XML document, can be done as easy as the import was. For just a plain conversion, you can use the ezcDocumentOdt class, import an ezcDocumentDocbook and save it. For more tuning opportunities, you should better go for an ezcDocumentDocbookToOdtConverter. You'll see in the next section, why.

<?php // Initialize eZ Components … $xhtml = new ezcDocumentXhtml(); $xhtml->setFilters( array( new ezcDocumentXhtmlElementFilter(), new ezcDocumentXhtmlMetadataFilter(), new ezcDocumentXhtmlXpathFilter( '//div[@id="opensource_blog_0712_scalar_type_hints_in_php"]' ), ) ); $xhtml->loadFile( '0712_scalar_type_hints_in_php.html' ); $docbook = $xhtml->getAsDocbook(); $converter = new ezcDocumentDocbookToOdtConverter(); $odt = $converter->convert( $docbook ); file_put_contents( '0712_scalar_type_hints_in_php.fodt', $odt ); ?>

This example loads an XHTML file, which I stored locally from my website using Firefox. First, the XHTML content is read. Since web pages usually contain more than just the plain content, e.g. navigation and ads, an additional filter is appended to the ezcDocumentXhtml instance. The ezcDocumentXhtmlXpathFilter extracts the nodes identified by the given XPath expression and uses them as the document content instead of the full document.

The loaded XHTML document is again converted to Docbook XML in the same way you saw when FODT was loaded. Again similar, a converter instance (ezcDocumentDocbookToOdtConverter) is used to perform the conversion to FODT and the saving of the document works also the same.

You can find the source article online in my blog and the generated FODT for download.

Styling ODT

Styling an exported FODT document works almost exactly like with PDF styling in the Document component: Using a sub-set of CSS. The coolest thing with this is, that you can create PDF and ODT from the same source with the same style sheet and they will look almost identical. The following example is a bit longish and therefore split into 2 parts. It's based on the previous example, so I don't repeat the loading of the XHTML.

To apply styling information to a ODT or PDF you need to define a style sheet (file or string) in a format similar to CSS, the so called PCSS:

article { font-family: DejaVuSans; font-size: 10pt; font-weight: normal; color: #000000; } article > section > title { color: #444578; font-size: 24pt; font-family: DejaVuSans; font-weight: bold; } article > section > section > title { color: #444578; font-size: 20pt; font-family: DejaVuSans; font-weight: normal; border-bottom: 1pt solid #444578; } /* ... */ literallayout { margin: 10pt 30pt; padding: 10pt; } emphasis { color: #444578; font-weight: bold; } ulink { color: #444578; text-decoration: underline; }

The addressing rules are based on Docbook XML elements. This example defines default formatting for the pages on the <article> element, which is the root of the XML. Since Docbook uses a nesting model to define sections and therefore headlines. Therefore you need the shown rules to define different levels of headings. I skipped some of such here to shorten the example.

The <literallayout> element is commonly used for listings, <emphasis> is an in-line tag to accent text passages and <unlink> is used for web links. As you can see, the formatting rules look exactly like CSS, except for that not all of the style attributes defined in CSS are supported (yet).

To utilize this style sheet, it needs to be loaded during the ODT export. Note that a default style sheet is always loaded and that your custom style sheet must therefore only re-define what you desire.

<?php // … $converter = new ezcDocumentDocbookToOdtConverter(); $converter->options->styler->addStylesheetFile( 'ezc.pcss' ); $odt = $converter->convert( $docbook ); file_put_contents( '0712_scalar_type_hints_in_php_styled.fodt', $odt ); // … ?>

The ODT converter comes with a default styling mechanism which uses a PCSS definition. In future it will be possible to implement custom styling mechanisms, for whatever reason. To load the style sheet file, you just need to call the addStylesheetFile() method on the default styler. This way, you can also add multiple style sheets, which may override each others definitions in the order they get loaded. You know this mechanism from CSS.

That's all you need to do for styling the ODT. Download the result to validate its beauty. :)

To render a PDF with the very same style sheet, a little bit more work is necessary:

<?php // … $pdf = new ezcDocumentPdf(); $pdf->options->driver = new ezcDocumentPdfHaruDriver(); $pdf->options->driver->registerFont( 'DejaVuSans', ezcDocumentPdfHaruDriver::FONT_PLAIN, array( '/usr/share/fonts/truetype/ttf-dejavu/DejaVuSans.ttf' ) ); $pdf->options->driver->registerFont( 'DejaVuSans', ezcDocumentPdfHaruDriver::FONT_BOLD, array( '/usr/share/fonts/truetype/ttf-dejavu/DejaVuSans-Bold.ttf' ) ); $pdf->loadStyles( 'ezc.pcss' ); $pdf->createFromDocbook( $docbook ); file_put_contents( '0712_scalar_type_hints_in_php_styled.pdf', $pdf ); ?>

This example uses the libharu driver for rendering the PDF. You need the pecl/haru extension installed to reproduce it. Since the PDF converter, in contrast to ODT, needs to have the actual font files at hand, you need to register the fonts used in your style sheet.

The method registerFont() expects the name of the font used in the PCSS, the variation of the font (e.g. bold) for which the font file is used and a list of font files that contain the font definition. You can specify multiple font files here, since some drivers cannot cope with some font formats. Using TTF is fine with Haru, though.

After registering the fonts, you need to load the style sheet and are ready to render the PDF. That's all. You can compare the results of the ODT and PDF exports, to see they look almost exactly the same.

Note: The Haru output driver suffers from a little bug which keeps the PDF from being rendered, as this article is written. You can download a patch included in the bug report to make it working and I expect the issue from being fixed soonish after X-mas.

Conclusion

In the first release of ODT support in the eZ Document component you can import and export FODT files, which can be read and written by most OpenOffice.org distributions. You can style exported ODTs using PCSS, a CSS sub-set, as you can do when exporting PDF documents.

For future versions of the Document component I want to implement real ODT support, which should not be that hard since its basically handling the ZIP archives. In addition I want to implement even better and customizable detection of semantics in ODT, which has very few semantics included. Another idea is to provide templating for adding headers and footers, as well as supporting additional style information.

If you want to reproduce the above examples and maybe use them for own experiments, you can download a complete package including sources, examples and results. More documentation on the Document component can be found on the eZ Components website. Feedback is very welcome as a comment to this blog.

In this sense, merry X-Mas and a happy new 2010! Cheers!

If you liked this blog post or learned something, please consider using flattr to contribute back: .

Trackbacks

Comments

Add new comment

Fields with bold names are mandatory.