Property-based testing XSLT
Property-based testing is a wonderful tool to verify correctness of your programs. However, some people struggle with finding reasonable cases to use the technique. In this entry, I'll try to prove that property-based testing can be used in every-day coding by showing a rather exotic example of applying property-based testing to check results of XSL transformations.
Prelude
I've been constantly postponing creation of my own blog.
There were books and video courses which did encourage me to start, however I wasn't confident enough to do so.
The F# Advent event convinced me to break the ice - "at least I'll have any readers", I thought.
This is my first blog post ever, so chances are it's not gonna be the best F# article you've ever read.
Anyway bear with me - I think I've got something interesting to share.
Automatic publishing
My job is to develop and maintain a Content Management System (CMS). The company I work for is a big corporation and because big corporation often equals enterprise software, we deal a lot with the programmers' (especially Java) beloved enterprise format, namely XML. We store all the documents in the XML format, conforming to the DITA XML standard (with slight customizations).
Crucial part of the system is its publishing capability. In order to render PDF documents, we utilize third-party software. Within the software, the whole process of rendering printable documents can be cleverly automated by reusing a common template and applying different sets of content to it. To apply some content to the template, input format of the content must be XML which conforms to a provided schema (other than DITA). How do we prepare the input to manipulate the template? XML in, XML out - you guessed it, we do XSLT.
XSLT probably doesn't stand for one of the finest tools that every developer likes to work with. Verbose syntax (xsl transformation itself must be a valid XML), dynamic typing, immature tooling or template matching ambiguity are IMO the biggest cons of working with XSLT. It is based on functional concepts though, which after a while makes it a bit more attractive than it initially seemed to be. Be aware, one day I might even happen to write a post or two on XSLT only (that's what they call Stockholm syndrome, isn't it?). I'm playing a devil's advocate, you may think, but there's one gloomy thing about this DSL I'll have to admit: XSLT can get really hard to maintain and tricky after it reaches a certain level of complexity. That's why we have automated tests suite in our code-base, just to address XSL transformations. Majority of them are written in F# using a powerful library for property-based testing, FsCheck.
If you're new to the concept of property-based testing and using FsCheck library, I highly recommend reading this introductory article by Scott Wlaschin. There is also another great post on that blog, which talks about choosing properties for testing.
DITA XML
Let's have a quick glance at the DITA XML standard first, to grasp the idea of how the documents are stored.
1: 2: 3: 4: 5: 6: |
|
Above snippet describes a basic document, which contains a title
element as well as body
and a p
(paragraph) inside that body.
Both title
and body
are enclosed in root topic
element.
Such notation may look familiar to you already - DITA XML is akin to HTML markup.
Inside the body
we can also have such elements as image
, table
, h1
, h2
, etc.
To make things simple, snippets that follow are kind of "slimmed" versions of original code. I was too lazy to verify if they even compile - so please treat them as pseudo-code rather than copy-pastable pieces. After all, I think what matters the most is the idea itself of applying property-based tests to various use cases.
Generator
FsCheck can automatically generate random input for tests as long as generators for corresponding data types are registered within the assembly. The library comes with a few pre-registered generators for: primitive types, F# records or Discriminated Unions. However, in order to generate more fancy data structures, we have to do some manual work.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: |
|
To produce DITA XML documents, I use the XML object model from System.Xml.Linq
namespace and gen
computation expression from FsCheck.
Given such granular generators, it's very convenient to compose them together - e.g. topic
element generator makes use of title
and body
element generators.
Other generators which are used by, but not listed in above snippet include:
contents
, for contents of a paragraph or title - literal strings with possible formatting (bold, italics, etc.),p
, for plain paragraph elements,table
, for tables which conform to CALS Table schema,image
, for images with a source reference to a given graphic file.
Computation expression allows to define generators in similar to imperative paradigm fashion, what makes it easier for my colleagues to comprehend.
Tests
Implementation of a couple of helper functions is skipped for brevity as well, let's just assume we are given the following:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: |
|
All properties are written with the help of FsCheck.Xunit adapter - each property is a separate test marked with [<Property>]
attribute.
Conforming to XML schema
First test verifies if for any valid input XML (determined by our generator), output of the transformation conforms to a XML Schema provided by the vendor of PDF rendition software.
1: 2: 3: 4: |
|
The @|
operator used in line 4 allows to "label" failing properties.
If for some reason the transform produces illegal XML and the property does not hold,
test failure report will include both input topic
, as well as output
.
This allows to quickly spot the cause of failure.
In its extended version, this test can also report schema validation error messages, e.g. that a specific element is not a valid child of its parent.
Thanks to the above test, we can eliminate any issue related to producing XML with invalid schema, which would always result in rendition failure. XML Schema safety within XSLT can also be guaranteed with Schema-Aware XSLT. While Schema-Aware XSLT processors usually require a commercial license, we can maintain the schema-conforming test in our code-base for free.
Bolded text
Next test makes sure that all text enclosed in the <b>
tag should be indeed bold in the output PDF:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
|
The provided schema specifices RICHTEXT
elements as containers for text.
All text within a single RICHTEXT
is formatted in uniform fashion.
Therefore, RICHTEXT
elements form a flattened list of formatted text.
To verify that @BOLD
attribute is given on a proper RICHTEXT
, we first collect all textNodes
from input and richtexts
from the output (lines 4,5).
In next step, we use Seq.zip
to create a sequence of pairs of corresponding items (line 9).
Then in line 10 we filter those pairs, where text node has <b>
ancestor tag.
Finally we check with Seq.forAll
that all RICHTEXT
elements have @BOLD
attribute set to TRUE
(line 11).
Likewise, we can write tests for checking other types of text formatting, i.e. italic,underline,superscript or subscript.
Width of images and tables
Third and the last property-based test presented in this entry checks if every "object" (image or table) with specified layout has correct width in output:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: |
|
Symbol pairs
in line 4 is bound to sequence of pairs of input * output objects, filtered to only those which have @layout
attribute (which don't fall back to the default layout).
This sequence is then mapped in lines 6-9 to a pair of input's @layout
and outputs's @WIDTH
attributes.
Finally, in lines 11-13, with the help of Seq.forall
and layoutToWidth
helper function, all pairs are checked for correct width.
At first the test may seem needless, because it concerns a simple mapping from one value to another. In practice however it turned out quite useful mainly for regression purposes, whereas restructuring the code or introducing new features could break this property.
I chose only 3 examples of properties to show here, but we've got a lot more of them in our code-base. Some of them are quite complex and require thorough knowledge of domain and the third-party software.
Shrinker
Another important concept of property-based testing is shrinking.
In short, shrinkers allow to find minimal input which fails the test.
Let's consider the property which checked proper "boldness" of a piece of text.
If we spoil code which turns on the @BOLD
attribute, that property could fail for example on following input:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: |
|
It can be hard at once to say what exactly caused the test to fail, as this XML document is already quite complex. How about shrinking the above XML to a smaller version:
1: 2: 3: 4: 5: 6: |
|
The latter, shrinked data shows much clearer where the transform went wrong. It would be also hard to find smaller input, which still makes the test light red.
That's exactly what happens with properties in FsCheck. If any property fails for a given input, the library uses shrinker to make the input smaller untill it finds a minimal dataset which still causes the property to fail.
Implementing a shrinker for XML document turned out to be quite challenging and won't be described here, but maybe one day I'll put up a separate post dedicated only to this issue.
Conclusions
Property-based tests proves helpful while working with XSLT code. This rather unusual application of properties brings a number of advantages:
- It is much easier to maintain what is called arrange phase of the tests, because you can rely on a generator and don't have to create new XML documents each time for a new test,
- Randomly generated input discovers various edge cases, many of which could otherwise be found only in production,
- Thanks to the shrinker functionality, minimal faulty input can be spotted,
- All different tests use the same generator, hence a high degree of consistency is achieved.
It must be noted, that applying the technique in this case comes also with some costs:
- While arrange phase of tests is straightforward, the assert phase can sometimes get tricky and requires some brainstorming,
- It might be hard to troubleshoot problems when a property fails, and we're not immediately sure why,
- Solid shrinker's implementation for XML document is not easy (actually our shrinker has still a lot of gaps).
I'd like to thank Sergey Tihon once again for organizing F# Advent - if not this opportunity the blog wouldn't see the daylight today.
If you found this entry interesting, you may check out my presentation on this topic (created with FsReveal) which I prepared for my colleagues at work. That's it for now - Till next time!
Full name: propertybasedtestingxslt.title
Full name: propertybasedtestingxslt.body
Full name: propertybasedtestingxslt.topic
val string : value:'T -> string
Full name: Microsoft.FSharp.Core.Operators.string
--------------------
type string = System.String
Full name: Microsoft.FSharp.Core.string
Full name: Microsoft.FSharp.Core.bool
val seq : sequence:seq<'T> -> seq<'T>
Full name: Microsoft.FSharp.Core.Operators.seq
--------------------
type seq<'T> = System.Collections.Generic.IEnumerable<'T>
Full name: Microsoft.FSharp.Collections.seq<_>
from Microsoft.FSharp.Collections
Full name: Microsoft.FSharp.Collections.Seq.zip
Full name: Microsoft.FSharp.Collections.Seq.filter
Full name: Microsoft.FSharp.Core.Operators.fst
Full name: Microsoft.FSharp.Core.Operators.snd
Full name: Microsoft.FSharp.Collections.Seq.map
Full name: Microsoft.FSharp.Collections.Seq.forall