libxslt: An Extended Tutorial

PanosLouridas 2004 Panagiotis Louridas Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. libxslt: An Extended Tutorial Introduction The Extensible Stylesheet Language Transformations (XSLT) specification defines an XML template language for transforming XML documents. An XSLT engine reads an XSLT file and an XML document and transforms the document accordingly. We want to perform a series of XSLT transformations to a series of documents. An obvious solution is to use the operating system's pipe mechanism and start a series of transformation processes, each one taking as input the output of the previous transformation. It would be interesting, though, and perhaps more efficient if we could do our job within a single process. libxslt is a library for doing XSLT transformations. It is built on libxml, which is a library for handling XML documents. libxml and libxslt are used by the GNOME project. Although developed in the *NIX world, both libxml and libxslt have been ported to the MS-Windows platform. In principle an application using libxslt should be easily portable between the two systems. In practice, however, there arise various wrinkles. These do not have anything to do with libxml or libxslt per se, but rather with the different compilation and linking procedures of each system. The presented solution is an extension of John Fleck's libxslt tutorial, but the present tutorial tries to be self-contained. It develops a minimal libxslt application (libxslt_pipes) that can perform a series of transformations to a series of files in a pipe-like manner. An invocation might be: libxslt_pipes --out results.xml foo.xsl bar.xsl doc1.xml doc2.xml The foo.xsl stylesheet will be applied to doc1.xml and the bar.xsl stylesheet will be applied to the resulting document; then the two stylesheets will be applied in the same sequence to bar.xsl. The results are sent to results.xml (if no output is specified they are sent to standard output). The application is compiled in both *NIX systems and MS-Windows, where by *NIX systems we mean Linux, BSD, and other members of the family. The gcc suite is used in the *NIX platform and the Microsoft compiler and linker are used in the MS-Windows platform. Setting the Scene We need to include the necessary libraries: #include #include #include #include ]]> The first group of include directives includes general C libraries. The libraries we need to make libxslt work are in the second group. The transform.h header file declares the API that does the bulk of the actual processing. The xsltutils.h header file declares the API for some generic utility functions of the XSLT engine; among other things, saving to a file, which is what we need it for. If our input files contain entities through external subsets, we need to tell libxslt to load them. The global variable xmlLoadExtDtdDefaultValue, defined in libxml/globals.h, is responsible for that. As the variable is defined outside our program we must specify external linkage: extern int xmlLoadExtDtdDefaultValue; The program is called from the command line. We anticipate that the user may not call it the right way, so we define a function for describing its usage: static void usage(const char *name) { printf("Usage: %s [options] stylesheet [stylesheet ...] file [file ...]\n", name); printf(" --out file: send output to file\n"); printf(" --param name value: pass a (parameter,value) pair\n"); } Program Start We need to define a few variables that are used throughout the program: int main(int argc, char **argv) { int arg_indx; const char *params[16 + 1]; int params_indx = 0; int stylesheet_indx = 0; int file_indx = 0; int i, j, k; FILE *output_file = stdout; xsltStylesheetPtr *stylesheets = (xsltStylesheetPtr *) calloc(argc, sizeof(xsltStylesheetPtr)); xmlDocPtr *files = (xmlDocPtr *) calloc(argc, sizeof(xmlDocPtr)); int return_value = 0; The arg_indx integer is an index used to iterate over the program arguments. The params string array is used to collect the XSLT parameters. In XSLT, additional information may be passed to the processor via parameters. The user of the program specifies these in key-value pairs in the command line following the --param command line argument. We accept up to 8 such key-value pairs, which we track with the params_indx integer. libxslt expects the parameters array to be null-terminated, so we have to allocate one extra place (16 + 1) for it. The file_indx is an index to iterate over the files to be processed. The i, j, k integers are additional indices for iteration purposes, and return_value is the value the program returns to the operating system. We expect the result of the transformation to be the standard output in most cases, but the user may wish otherwise via the command line option, so we need to keep track of the situation with the output_file file pointer. In libxslt, XSLT stylesheets are internally stored in xsltStylesheet structures; similarly, in libxml XML documents are stored in xmlDoc structures. xsltStylesheetPtr and xmlDocPtr are simply typedefs of pointers to them. The user may specify any number of stylesheets that will be applied to the documents one after the other. To save time we parse the stylesheets and the documents as we read them from the command line and keep the parsed representation of them. The parsed results are kept in arrays. These are dynamically allocated and sized to the number of arguments; this wastes some space, but not much (the size of xmlStyleSheetPtr and xmlDocPtr is the size of a pointer) and simplifies code later on. The array memory is allocated with calloc to ensure contents are initialised to zero. Arguments Collection If the program gets no arguments at all, we print the usage description, set the program return value to 1 and exit. Instead of returning directly we go to (literally) to the end of the program text where some housekeeping takes place. = 16) { fprintf(stderr, "too many params\n"); return_value = 1; goto finish; } } else if ((!strcmp(argv[arg_indx], "-o")) || (!strcmp(argv[arg_indx], "--out"))) { arg_indx++; output_file = fopen(argv[arg_indx], "w"); } else { fprintf(stderr, "Unknown option %s\n", argv[arg_indx]); usage(argv[0]); return_value = 1; goto finish; } } params[params_indx] = 0; ]]> If the user passes arguments we have to collect them. This is a matter of iterating over the program argument list while we encounter arguments starting with a dash. The XSLT parameters are put into the params array and the output_file is set to the user request, if any. After processing all the parameter key-value pairs we set the last element of the params array to null. Parsing The rest of the argument list is taken to be stylesheets and files to be transformed. Stylesheets are identified by their suffix, which is expected to be xsl (case sensitive). All other files are assumed to be XML documents, regardless of suffix. Stylesheets are parsed using the xsltParseStylesheetFile function. xsltParseStylesheetFile takes as argument a pointer to an xmlChar, a typedef of an unsigned char; in effect, the filename of the stylesheet. The resulting xsltStylesheetPtr is placed in the stylesheets array. In the same vein, XML files are parsed using the xmlParseFile function that takes as argument the file's name; the resulting xmlDocPtr is placed in the files array. File Processing All stylesheets are applied to each file one after the other. Stylesheets are applied with the xsltApplyStylesheet function that takes as argument the stylesheet to be applied, the file to be transformed and any parameters we have collected. The in-memory representation of an XML document takes space, which we free using the xmlFreeDoc function. The file is then saved to the specified output. To output an XML document we have in memory we use the xlstSaveResultToFile function, where we specify the destination, the document and the stylesheet that has been applied to it. The stylesheet is required so that output-related information contained in the stylesheet, such as the encoding to be used, is used in output. If no transformation has taken place, which will happen when the user specifies no stylesheets at all in the command line, we use the xmlDocDump libxml function that saves the source document to the file without further ado. As parsed stylesheets take up space in memory, we take care to free that memory after use with a call to xmlFreeStyleSheet. When all work is done, we clean up all global variables used by the XSLT library using xsltCleanupGlobals. Likewise, all global memory allocated for the XML parser is reclaimed by a call to xmlCleanupParser. Before returning we deallocate the memory allocated for the holding the pointers to the XML documents and stylesheets. *NIX Compiling and Linking Compiling and linking in a *NIX environment is easy, as the required libraries are almost certain to be already in place (remember that libxml and libxslt are used by the GNOME project, so they are present in most installations). The program can be dynamically linked so that its footprint is minimized, or statically linked, so that it stands by itself, carrying all required code. For dynamic linking the following one liner will do: gcc -o libxslt_pipes -Wall -I/usr/include/libxml2 -lxslt -lxml2 -L/usr/lib libxslt_pipes.c We assume that the necessary header files are in /usr/include/libxml2 and that the required libraries (libxslt.so, libxml2.so) are in /usr/lib. In general, a program may need to link to additional libraries, depending on the processing it actually performs. A good way to start is to use the xslt-config script. The option displays usage information. Running xslt-config --cflags we get compile flags, while running xslt-config --libs we get the library settings for the linker. For static linking we must list more libraries than we did for dynamic linking, as the libraries on which the libxsl and libxslt libraries depend are also needed. Using xslt-config on a particular installation we create the following one-liner: gcc -o libxslt_pipes -Wall -I/usr/include/libxml2 libxslt_pipes.c -static -L/usr/lib -lxslt -lxml2 -lz -lpthread -lm If we get warnings to the effect that some function in statically linked applications requires at runtime the shared libraries used from the glibc version used for linking, that means that the binary is not completely static. Although we statically linked against the GNU C runtime library glibc, glibc uses external libraries to perform some of its functions. Same version libraries must be present on the system we want the application to run. One way to avoid this it to use an alternative C runtime, for example uClibc, which requires obtaining and building a uClibc toolchain first (if the reason for trying to get a statically linked version of the program is to embed it somewhere, using uClibc might be a good idea anyway). MS-Windows Compiling and Linking Compiling and linking in MS-Windows requires some attention. First, the MS-Windows ports must be downloaded and installed in the programming workstation. The ports are available in Igor Zlatković's site. We need the ports for iconv, zlib, libxml, and libxslt. In contrast to *NIX environments, we cannot assume that the libraries needed will be present in other computers where the program will be used. One solution is to distribute the program along with the necessary dynamic libraries. Another solution is to statically link the program so that only a single executable file will have to be distributed. We assume that we have decompressed the downloaded ports and have placed the required contents of their include directories in an include directory in our file system. The required contents include everything apart from the libexslt directory of the libxslt port, as we are not using EXLST (an initiative to provide extensions to XSLT) in this project. In order to compile the program we have to make sure that all necessary header files are included. When using the Microsoft compiler this translates to adding the required switches in the command line. If using a Visual Studio product the same effect is attained by specifying additional include directories in the compilation options. In the end, if the headers have been copied in C:\include the command line must contain . This being a C program, it needs to be compiled against an implementation of the C libraries. Microsoft provides various implementations. The ports, however, have been compiled against the msvcrt.dll implementation, so it is wise to use the same runtime in our project, lest we wish to come against unexpected runtime crashes. The msvcrt.dll is a multi-threaded implementation and is specified by giving as a compiler option. Unfortunately, the correspondence between the switch and msvcrt.dll breaks after version 6 of the Microsoft compiler. In version 7 and later (i.e., Visual Studio .NET), links against a different DLL; in version 7.1 this is msvcrt71.dll. The end result of this bit of esoterica is that if you try to dynamically link your application with a compiler whose version is greater than 6, your program is likely to crash unexpectedly. Alternatively, you may wish to compile all iconv, zlib, libxml and libxslt yourself, using the new runtime library. This is not a tall order, and some details are given below. There are three kinds of libraries in MS-Windows. Dynamically Linked Libraries (DLLs), like msvcrt.dll we met above, are used for dynamic linking; an application links to them at runtime, so the application does not include the code contained in them. Static libraries are used for static linking; an application adds the libraries' code to its own code at link time. Import libraries are used when building an application that uses DLLs. For the application to be built, the linker must somehow find the definitions of the functions that will be provided in runtime by the DLLs, otherwise it will complain about unresolved references. Import libraries contain function stubs that, for each DLL function we want to call, know where to look for it in the DLL. In essence, in order to use a DLL we must link against its corresponding import library. DLLs have a .dll suffix; static and import libraries both have a .lib suffix. In the MS-Windows ports of libxml and libxslt static libraries are distinguished by their name ending in _a.lib, while in the zlib port the import library is zdll.lib and the static library is zlib.lib. In what follows we assume we have a lib directory in our filesystem where we place the libraries we need for linking. If we want to link dynamically we must make sure the lib directory contains iconv.lib, libxslt.lib, libxml2.lib, and zdll.lib. When using the Microsoft linker this translates to adding the required switch and the necessary libraries in the command line. In Visual Studio we must specify an additional library directory for lib and put the necessary libraries in the additional dependencies. In the end, the command line must include , provided the libraries' directory is C:\lib. In order for the resulting executable to run, the ports DLLs must be present; one way is to place all DLLs contained in the ports in the home directory of our application, and make sure they are distributed together. If we want to link statically we must make sure the lib directory contains iconv_a.lib, libxslt_a.lib, libxml2_a.lib, and zlib.lib. Adding lib as a library directory and putting the necessary libraries in the additional dependencies, we get a command line that should include . The resulting executable is much bigger than if we linked dynamically; it is, however, self-contained and can be distributed more easily, in theory at least. In practice, however, the executable is not completely static. We saw that the ports are compiled against msvcrt.dll, so the program does require that DLL at runtime. Moreover, since when using a version of Microsoft developer tools with a version number greater than 6, we are no longer using msvcrt.dll, but another runtime like msvcrt71.dll, and we then need that DLL. In contrast to msvcrt.dll it may not be present on the target computer, so we may have to copy it along. Building the Ports in MS-Windows The source code of the ports is readily available on the web, one has to check the ports sites. Each port can be built without problems in an MS-Windows environment using Microsoft development tools. The necessary command line tools (compiler, linker, nmake) must be available. This means running a batch file called vcvars32.bat that comes with Visual Studio (its exact location in the directory tree may vary depending on the version of Visual Studio, but a file search will find it anyway). Makefiles for the Microsoft tools are found in all ports. They are distinguished by their suffix, e.g., Makefile.msvc or Makefile.msc. To build zlib it suffices to run nmake against Makefile.msc (i.e., with the option); similarly, to build iconv it suffices to run nmake against Makefile.msvc. Building libxml and libxslt requires an extra configuration step; we must run the configure.js configuration script with the cscript command. configure.js is found in the win32 directory in the distributions. It is written in JScript, Microsoft's implementation of the ECMA 262 language specification (ECMAScript Edition 3), a JavaScript offspring. The configuration string takes a number of parameters detailing our environment and needs; cscript configure.js help documents them. It is wise to read all documentation files in the source distributions before starting; moreover, pay attention to the dependencies between the ports. If we configure libxml and libxslt to use iconv and zlib we must build these two first and make sure their headers and libraries can be found by the compiler and the linker when building libxml and libxslt. zlib, iconv and All That We saw that libxml and libxslt depend on various other libraries, for instance zlib, iconv, and so forth. Taking a look into them gives us clues on the capabilities of libxml and libxslt. zlib is a free general purpose lossless data compression library. It is a venerable workhorse; more than 500 applications (both commercial and open source) seem to use the library. libxml uses zlib so that it can read from or write to compressed files directly. The xmlParseFile function can transparently parse a compressed document to produce an xmlDoc. If we want to create a compressed document with libxml we can use an xmlTextWriterPtr (obtained through xmlNewTextWriterDoc), or another related structure from libxml/xmlwriter.h, with compression enabled. XML allows documents to use a variety of different character encodings. iconv is a free library for converting between different character encodings. libxml provides a set of default converters for some encodings: UTF-8, UTF-16 (little endian and big endian), ISO-8859-1, ASCII, and HTML (a specific handler for the conversion of UTF-8 to ASCII with HTML predefined entities like © for the copyright sign). However, when compiled with iconv support, libxml and libxslt can handle the full range of encodings provided by iconv; these should cover most needs. libxml and libxslt can be used in multi-threaded applications. In MS-Windows they are linked against MSVCRT.DLL (or one of its descendants, as we saw above). In *NIX the pthreads (POSIX threads) library is used. The Complete Program The complete program listing is given below. The program is also available online.