PanosLouridas
2004
Panagiotis Louridas
Permission is hereby granted, free of charge, to
any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software
is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
libxslt: An Extended Tutorial
Introduction
The Extensible Stylesheet Language Transformations (XSLT)
specification defines an XML template language for transforming XML
documents. An XSLT engine reads an XSLT file and an XML document and
transforms the document accordingly.
We want to perform a series of XSLT transformations to a series
of documents. An obvious solution is to use the operating system's
pipe mechanism and start a series of transformation processes, each
one taking as input the output of the previous transformation. It
would be interesting, though, and perhaps more efficient if we could
do our job within a single process.
libxslt is a library for doing XSLT transformations. It is built
on libxml, which is a library for handling XML documents. libxml and
libxslt are used by the GNOME project. Although developed in the
*NIX world, both libxml and libxslt have been
ported to the MS-Windows platform. In principle an application using
libxslt should be easily portable between the two systems. In
practice, however, there arise various wrinkles. These do not have
anything to do with libxml or libxslt per se, but rather with the
different compilation and linking procedures of each system.
The presented solution is an extension of John
Fleck's libxslt tutorial, but the present tutorial tries to be
self-contained. It develops a minimal libxslt application
(libxslt_pipes) that can perform a series of transformations to a
series of files in a pipe-like manner. An invocation might be:
libxslt_pipes --out results.xml foo.xsl bar.xsl doc1.xml doc2.xml
The foo.xsl stylesheet will be applied to
doc1.xml and the bar.xsl
stylesheet will be applied to the resulting document; then the two
stylesheets will be applied in the same sequence to
bar.xsl. The results are sent to
results.xml (if no output is specified they are
sent to standard output).
The application is compiled in both *NIX
systems and MS-Windows, where by *NIX systems we
mean Linux, BSD, and other members of the
family. The gcc suite is used in the *NIX platform
and the Microsoft compiler and linker are used in the
MS-Windows platform.
Setting the Scene
We need to include the necessary libraries:
#include
#include
#include
#include
]]>
The first group of include directives includes general C
libraries. The libraries we need to make libxslt work are in the
second group. The transform.h header file
declares the API that does the bulk of the actual processing. The
xsltutils.h header file declares the API for some
generic utility functions of the XSLT engine; among other things,
saving to a file, which is what we need it for.
If our input files contain entities through external subsets, we need
to tell libxslt to load them. The global variable
xmlLoadExtDtdDefaultValue, defined in
libxml/globals.h, is responsible for that. As the
variable is defined outside our program we must specify external
linkage:
extern int xmlLoadExtDtdDefaultValue;
The program is called from the command line. We anticipate that the
user may not call it the right way, so we define a function for
describing its usage:
static void usage(const char *name) {
printf("Usage: %s [options] stylesheet [stylesheet ...] file [file ...]\n",
name);
printf(" --out file: send output to file\n");
printf(" --param name value: pass a (parameter,value) pair\n");
}
Program Start
We need to define a few variables that are used throughout the
program:
int main(int argc, char **argv) {
int arg_indx;
const char *params[16 + 1];
int params_indx = 0;
int stylesheet_indx = 0;
int file_indx = 0;
int i, j, k;
FILE *output_file = stdout;
xsltStylesheetPtr *stylesheets =
(xsltStylesheetPtr *) calloc(argc, sizeof(xsltStylesheetPtr));
xmlDocPtr *files = (xmlDocPtr *) calloc(argc, sizeof(xmlDocPtr));
int return_value = 0;
The arg_indx integer is an index used to
iterate over the program arguments. The params
string array is used to collect the XSLT parameters. In XSLT,
additional information may be passed to the processor via
parameters. The user of the program specifies these in key-value pairs
in the command line following the --param
command line argument. We accept up to 8 such key-value pairs, which
we track with the params_indx integer. libxslt
expects the parameters array to be null-terminated, so we have to
allocate one extra place (16 + 1) for it. The
file_indx is an index to iterate over the files to
be processed. The i, j,
k integers are additional indices for iteration
purposes, and return_value is the value the program
returns to the operating system. We expect the result of the
transformation to be the standard output in most cases, but the user
may wish otherwise via the command line
option, so we need to keep track of the situation with the
output_file file pointer.
In libxslt, XSLT stylesheets are internally stored in
xsltStylesheet structures; similarly, in
libxml XML documents are stored in xmlDoc
structures. xsltStylesheetPtr and xmlDocPtr
are simply typedefs of pointers to them. The user may specify any
number of stylesheets that will be applied to the documents one after
the other. To save time we parse the stylesheets and the documents as
we read them from the command line and keep the parsed representation
of them. The parsed results are kept in arrays. These are dynamically
allocated and sized to the number of arguments; this wastes some
space, but not much (the size of xmlStyleSheetPtr and
xmlDocPtr is the size of a pointer) and simplifies code
later on. The array memory is allocated with
calloc to ensure contents are initialised to
zero.
Arguments Collection
If the program gets no arguments at all, we print the usage
description, set the program return value to 1 and exit. Instead of
returning directly we go to (literally) to the end of the program text
where some housekeeping takes place.
= 16) {
fprintf(stderr, "too many params\n");
return_value = 1;
goto finish;
}
} else if ((!strcmp(argv[arg_indx], "-o"))
|| (!strcmp(argv[arg_indx], "--out"))) {
arg_indx++;
output_file = fopen(argv[arg_indx], "w");
} else {
fprintf(stderr, "Unknown option %s\n", argv[arg_indx]);
usage(argv[0]);
return_value = 1;
goto finish;
}
}
params[params_indx] = 0;
]]>
If the user passes arguments we have to collect them. This is a
matter of iterating over the program argument list while we encounter
arguments starting with a dash. The XSLT parameters are put into the
params array and the output_file
is set to the user request, if any. After processing all the parameter
key-value pairs we set the last element of the params
array to null.
Parsing
The rest of the argument list is taken to be stylesheets and
files to be transformed. Stylesheets are identified by their suffix,
which is expected to be xsl (case sensitive). All other files are
assumed to be XML documents, regardless of suffix.
Stylesheets are parsed using the
xsltParseStylesheetFile
function. xsltParseStylesheetFile takes as
argument a pointer to an xmlChar, a typedef of an
unsigned char; in effect, the filename of the stylesheet. The
resulting xsltStylesheetPtr is placed in the
stylesheets array. In the same vein, XML files are
parsed using the xmlParseFile function that takes
as argument the file's name; the resulting xmlDocPtr is
placed in the files array.
File Processing
All stylesheets are applied to each file one after the
other. Stylesheets are applied with the
xsltApplyStylesheet function that takes as
argument the stylesheet to be applied, the file to be transformed and
any parameters we have collected. The in-memory representation of an
XML document takes space, which we free using the
xmlFreeDoc function. The file is then saved to the
specified output.
To output an XML document we have in memory we use the
xlstSaveResultToFile function, where we specify
the destination, the document and the stylesheet that has been applied
to it. The stylesheet is required so that output-related information
contained in the stylesheet, such as the encoding to be used, is used
in output. If no transformation has taken place, which will happen
when the user specifies no stylesheets at all in the command line, we
use the xmlDocDump libxml function that saves the
source document to the file without further ado.
As parsed stylesheets take up space in memory, we take care to
free that memory after use with a call to
xmlFreeStyleSheet. When all work is done, we
clean up all global variables used by the XSLT library using
xsltCleanupGlobals. Likewise, all global memory
allocated for the XML parser is reclaimed by a call to
xmlCleanupParser. Before returning we deallocate
the memory allocated for the holding the pointers to the XML documents
and stylesheets.
*NIX Compiling and Linking
Compiling and linking in a *NIX environment
is easy, as the required libraries are almost certain to be already in
place (remember that libxml and libxslt are used by the GNOME project,
so they are present in most installations). The program can be
dynamically linked so that its footprint is minimized, or statically
linked, so that it stands by itself, carrying all required code.
For dynamic linking the following one liner will do:
gcc -o libxslt_pipes -Wall -I/usr/include/libxml2 -lxslt
-lxml2 -L/usr/lib libxslt_pipes.c
We assume that the necessary header files are in /usr/include/libxml2 and that the
required libraries (libxslt.so,
libxml2.so) are in /usr/lib.
In general, a program may need to link to additional libraries,
depending on the processing it actually performs. A good way to start
is to use the xslt-config script. The
option displays usage
information. Running
xslt-config --cflags
we get compile flags, while running
xslt-config --libs
we get the library settings for the linker.
For static linking we must list more libraries than we did for
dynamic linking, as the libraries on which the libxsl and libxslt
libraries depend are also needed. Using xslt-config
on a particular installation we create the following one-liner:
gcc -o libxslt_pipes -Wall -I/usr/include/libxml2 libxslt_pipes.c
-static -L/usr/lib -lxslt -lxml2 -lz -lpthread -lm
If we get warnings to the effect that some function in
statically linked applications requires at runtime the shared
libraries used from the glibc version used for linking, that means
that the binary is not completely static. Although we statically
linked against the GNU C runtime library glibc, glibc uses external
libraries to perform some of its functions. Same version libraries
must be present on the system we want the application to run. One way
to avoid this it to use an alternative C runtime, for example uClibc, which requires obtaining
and building a uClibc toolchain first (if the reason for trying to get
a statically linked version of the program is to embed it somewhere,
using uClibc might be a good idea anyway).
MS-Windows Compiling and
Linking
Compiling and linking in MS-Windows requires
some attention. First, the MS-Windows ports must be
downloaded and installed in the programming workstation. The ports are
available in Igor
Zlatkoviæ's site. We need the ports for iconv, zlib, libxml,
and libxslt. In contrast to *NIX environments, we
cannot assume that the libraries needed will be present in other
computers where the program will be used. One solution is to
distribute the program along with the necessary dynamic
libraries. Another solution is to statically link the program so that
only a single executable file will have to be distributed.
We assume that we have decompressed the downloaded ports and
have placed the required contents of their include directories in an include directory in our file system. The
required contents include everything apart from the libexslt directory of the libxslt port,
as we are not using EXLST (an initiative to provide extensions to
XSLT) in this project. In order to compile the program we have to make
sure that all necessary header files are included. When using the
Microsoft compiler this translates to adding the required
switches in the command line. If using a Visual
Studio product the same effect is attained by specifying additional
include directories in the compilation options. In the end, if the
headers have been copied in C:\include the command line must contain
.
This being a C program, it needs to be compiled against an
implementation of the C libraries. Microsoft provides various
implementations. The ports, however, have been compiled against the
msvcrt.dll implementation, so it is wise to use
the same runtime in our project, lest we wish to come against
unexpected runtime crashes. The msvcrt.dll is a
multi-threaded implementation and is specified by giving
as a compiler option. Unfortunately, the
correspondence between the switch and
msvcrt.dll breaks after version 6 of the
Microsoft compiler. In version 7 and later (i.e., Visual Studio .NET),
links against a different DLL; in version 7.1
this is msvcrt71.dll. The end result of this bit
of esoterica is that if you try to dynamically link your application
with a compiler whose version is greater than 6, your program is
likely to crash unexpectedly. Alternatively, you may wish to compile
all iconv, zlib, libxml and libxslt yourself, using the new runtime
library. This is not a tall order, and some details are given
below.
There are three kinds of libraries in MS-Windows. Dynamically
Linked Libraries (DLLs), like msvcrt.dll we met
above, are used for dynamic linking; an application links to them at
runtime, so the application does not include the code contained in
them. Static libraries are used for static linking; an application
adds the libraries' code to its own code at link time. Import
libraries are used when building an application that uses DLLs. For
the application to be built, the linker must somehow find the
definitions of the functions that will be provided in runtime by the
DLLs, otherwise it will complain about unresolved references. Import
libraries contain function stubs that, for each DLL function we want
to call, know where to look for it in the DLL. In essence, in order to
use a DLL we must link against its corresponding import library. DLLs
have a .dll suffix; static and import libraries
both have a .lib suffix. In the MS-Windows ports
of libxml and libxslt static libraries are distinguished by their name
ending in _a.lib, while in the zlib port the
import library is zdll.lib and the static library
is zlib.lib. In what follows we assume we have a
lib directory in our filesystem
where we place the libraries we need for linking.
If we want to link dynamically we must make sure the lib directory contains
iconv.lib, libxslt.lib,
libxml2.lib, and
zdll.lib. When using the Microsoft linker this
translates to adding the required
switch and the necessary libraries in the command line. In Visual
Studio we must specify an additional library directory for lib and put the necessary libraries in
the additional dependencies. In the end, the command line must include
, provided the libraries'
directory is C:\lib. In order
for the resulting executable to run, the ports DLLs must be present;
one way is to place all DLLs contained in the ports in the home
directory of our application, and make sure they are distributed
together.
If we want to link statically we must make sure the lib directory contains
iconv_a.lib, libxslt_a.lib,
libxml2_a.lib, and
zlib.lib. Adding lib as a library directory and putting
the necessary libraries in the additional dependencies, we get a
command line that should include . The resulting executable is much bigger
than if we linked dynamically; it is, however, self-contained and can
be distributed more easily, in theory at least. In practice, however,
the executable is not completely static. We saw that the ports are
compiled against msvcrt.dll, so the program does
require that DLL at runtime. Moreover, since when using a version of
Microsoft developer tools with a version number greater than 6, we are
no longer using msvcrt.dll, but another runtime
like msvcrt71.dll, and we then need that DLL. In
contrast to msvcrt.dll it may not be present on
the target computer, so we may have to copy it along.
Building the Ports in
MS-Windows
The source code of the ports is readily available on the web,
one has to check the ports sites. Each port can be built without
problems in an MS-Windows environment using Microsoft development
tools. The necessary command line tools (compiler, linker,
nmake) must be available. This means running a
batch file called vcvars32.bat that comes with
Visual Studio (its exact location in the directory tree may vary
depending on the version of Visual Studio, but a file search will find
it anyway). Makefiles for the Microsoft tools are found in all
ports. They are distinguished by their suffix, e.g.,
Makefile.msvc or
Makefile.msc. To build zlib it suffices to run
nmake against Makefile.msc
(i.e., with the option); similarly, to build
iconv it suffices to run nmake
against Makefile.msvc. Building libxml and
libxslt requires an extra configuration step; we must run the
configure.js configuration script with the
cscript command. configure.js
is found in the win32 directory
in the distributions. It is written in JScript, Microsoft's
implementation of the ECMA 262 language specification (ECMAScript
Edition 3), a JavaScript offspring. The configuration string takes a
number of parameters detailing our environment and needs;
cscript configure.js help documents
them.
It is wise to read all documentation files in the source
distributions before starting; moreover, pay attention to the
dependencies between the ports. If we configure libxml and libxslt to
use iconv and zlib we must build these two first and make sure their
headers and libraries can be found by the compiler and the
linker when building libxml and libxslt.
zlib, iconv and All That
We saw that libxml and libxslt depend on various other
libraries, for instance zlib, iconv, and so forth. Taking a look into
them gives us clues on the capabilities of libxml and libxslt.
zlib is a free general
purpose lossless data compression library. It is a venerable
workhorse; more than 500 applications
(both commercial and open source) seem to use the library. libxml uses
zlib so that it can read from or write to compressed files
directly. The xmlParseFile function can
transparently parse a compressed document to produce an
xmlDoc. If we want to create a compressed
document with libxml we can use an
xmlTextWriterPtr (obtained through
xmlNewTextWriterDoc), or another related
structure from libxml/xmlwriter.h, with
compression enabled.
XML allows documents to use a variety of different character
encodings. iconv is a free
library for converting between different character encodings. libxml
provides a set of default converters for some encodings: UTF-8, UTF-16
(little endian and big endian), ISO-8859-1, ASCII, and HTML (a
specific handler for the conversion of UTF-8 to ASCII with HTML
predefined entities like © for the copyright sign). However,
when compiled with iconv support, libxml and libxslt can handle the
full range of encodings provided by iconv; these should cover most
needs.
libxml and libxslt can be used in multi-threaded
applications. In MS-Windows they are linked against
MSVCRT.DLL (or one of its descendants, as we saw
above). In *NIX the pthreads
(POSIX threads) library is used.
The Complete Program
The complete program listing is given below. The program is also
available online.