XML::XPathScript - a Perl framework for XML stylesheets |
XML::XPathScript - a Perl framework for XML stylesheets
use XML::XPathScript; my $xps = XML::XPathScript->new(xml => $xml, stylesheet => $stylesheet);
# The short way:
$xps->process();
# The long way (caching the compiled stylesheet for reuse and # outputting to multiple files):
my $compiled = XML::XPathScript->new(stylesheetfile => $filename) ->compile('$r');
foreach my $xml (@xmlfiles) { use IO::File;
my $currentIO=new IO::File(shift @outputfiles);
XML::XPathScript->new(xml => $xml, compiledstylesheet=>$compiled) ->process(sub {$currentIO->print(shift)}); };
# Making extra variables available to the stylesheet dialect:
my $handler=$xps->compile('$r');
&$handler($xmltree,&Apache::print,Apache->request());
This is the XML::XPathScript stylesheet framework, part of the AxKit project at http://axkit.org/.
XPathScript is a stylesheet language similar in many ways to XSLT (in concept, not in appearance), for transforming XML from one format to another (possibly HTML, but XPathScript also shines for non-XML-like output).
Like XSLT, XPathScript offers a dialect to mix verbatim portions of documents and code. Also like XSLT, it leverages the powerful ``templates/apply-templates'' and ``cascading stylesheets'' design patterns, that greatly simplify the design of stylesheets for programmers. The availability of the XPath query language inside stylesheets promotes the use of a purely document-dependent, side-effect-free coding style. But unlike XSLT which uses its own dedicated control language with an XML-compliant syntax, XPathScript uses Perl which is terse and highly extendable.
The result of the merge is an extremely powerful tool for rendering complex XML documents into other formats. Stylesheets written in XPathScript are very easy to create, extend and reuse, even if they manage hundreds of different XML tags.
See http://axkit.org/docs/xpathscript/guide.dkb for a head start. There you will learn how to markup the embedded dialect and fill in the template hash $t.
This CPAN module is bundled with an ``xpathscript'' shell tool that is to be invoked like this:
xpathscript mydocument.xml mystylesheet.xps
It will produce the resulting document on standard output. For more options, refer to xpathscript's man page.
A number of callback functions are available from the stylesheet proper. They apply against the current document and template hash, which are transparently passed back and forth as global variables (see Global variables). They are defined in the the XML::XPathScript::Processor manpage package, which is implicitly imported into all code written in the embedded stylesheet dialect.
The following methods are also available to peek at the internal state
of the XPathScript engine from within the stylesheet. Although <
XML::XPathScript-
current()->whatever()
>> may be called from anywhere
within the stylesheet (except a BEGIN or END block or similar), it is
most unwise to alter the state of the interpreter from within a
testcode
block as the order of evaluation of the XML nodes is not
specified. Better tweak the stylesheet globals (e.g. binmode) once
and for all at the beginning of the stylesheet.
XML::XPathScript->current()
) returns
the stylesheet object currently being applied. This can be called from
anywhere within the stylesheet, except a BEGIN or END block or
similar. Beware though that using the return value for altering (as
opposed to reading) stuff from anywhere except the stylesheet's top
level is unwise.
$template->{pre}
and
similar may contain expressions within braces, that will be
interpreted as XPath expressions and substituted in place: for
example, when interpolation is on, the following code
$t->{'link'}{pre} = '<a href="{@url}">'; $t->{'link'}{post} = '</a>';
is enough for rendering a <link>
element as an HTML hyperlink.
The interpolation-less version is slightly more complex as it requires a
testcode
:
$t->{'link'}{testcode} = sub { my ($currentnode, $t) = @_; my $url = findvalue('@url', $currentnode); $t->{pre}="<a href='$url'>"; $t->{post}='</a>'; return DO_SELF_AND_KIDS(); };
Interpolation is on by default. A (now undocumented) global variable used to change the default to off, but don't do that.
XML::XPathScript->current()->binmode()
is an
irreversible operation with the consequences outlined in The Unicode mess.
Here are a few things to watch out for when coding stylesheets.
Unicode is a balucitherian character numbering standard, that strives to be a superset of all character sets currently in use by humans and computers. Going Unicode is therefore the way of the future, as it will guarantee compatibility of your applications with every character set on planet Earth: for this reason, all XML-compliant APIs (XML::XPathScript being no exception) should return Unicode strings in all their calls, regardless of the charset used to encode the XML document to begin with.
The gotcha is, the brave Unicode world sells itself in much the same way as XML when it promises that you'll still be able to read your data back in 30 years: that will probably turn out to be true, but until then, you can't :-)
Therefore, you as a stylesheet author will more likely than not need to do some wrestling with Unicode in Perl, XML::XPathScript or not. Here is a primer on how.
Unicode is not a text file format: UTF-8 is. Perl, when doing Unicode, prefers to use UTF-8 internally.
Unicode is a character numbering standard: that is, an abstract registry that associates unique integer numbers to a cast of thousands of characters. For example the ``smiling face'' is character number 0x263a, and the thin space is 0x2009 (there is a URL to a Unicode character table in SEE ALSO). Of course, this means that the 8-bits- (or even, Heaven forbid, 7-bits-?)-per-character idea goes through the window this instant. Coding every character on 16 bits in memory is an option (called UTF-16), but not as simple an idea as it sounds: one would have to rewrite nearly every piece of C code for starters, and even then the Chinese aren't quite happy with ``only'' 65536 character code points.
Introducing UTF-8, which is a way of encoding Unicode character
numbers (of any size) in an ASCII- and C-friendly way: all 127 ASCII
characters (such as ``A'' or or ``/'' or ``.'', but not the ISO-8859-1
8-bit extensions) have the same encoding in both ASCII and UTF-8,
including the null character (which is good for strcpy()
and
friends). Of course, this means that the other characters are rendered
using several bytes, for example ``é'' is ``é'' in UTF-8. The result
is therefore vaguely intelligible for a Western reader.
The programmer- and C-friendly characteristics of UTF-8 have made it the choice for dealing with Unicode in Perl. The interpreter maintains an ``UTF8-tainted'' bit on every string scalar it handles (much like what the perlsec manpage does for untrusted data). Every function in XML::XPathScript returns a string with such bit set to true: therefore, producing UTF-8 output is straightforward and one does not have to take any special precautions in XPathScript.
When binmode is invoked from the stylesheet body, it signals that
the stylesheet output should not be UTF-8, but instead some
user-chosen character encoding that XML::XPathScript cannot and will
not know or care about. Calling <
XML::XPathScript-
current()->binmode()
>> has the following
consequences:
testcode
blocks have to take input in UTF-8 (as per the XML
standard, UTF-8 indeed is what will be returned by
findvalue in the XML::XPathScript::Processor manpage and such) and provide output in
binary (in whatever character set is intended for the output), lest
translate_node() croaks as explained above. The the Unicode::String manpage
module comes in handy to the stylesheet writer to cast from UTF-8 to
an 8-bit-per-character charset such as ISO 8859-1, while laundering
Perl's internal UTF-8-string bit at the same time;
filehandle(s)
so
that a spurious, final charset conversion will not happen at print()
time under any locales, versions of Perl, or phases of moon.
XML::XPath calls such as findvalue() return objects in an object
class designed to map one of the types mandated by the XPath spec (see
the XML::XPath manpage for details). This is often not what a Perl programmer
comes to expect (e.g. strings and numbers cannot be treated the
same). There are some work-arounds built in XML::XPath, using operator
overloading: when using those objects as strings (by concatenating
them, using them in regular expressions etc.), they become strings,
through a transparent call to one of their methods such as <
-value()
>>. However, we do not support this for a variety of reasons
(from limitations in overload to stylesheet compatibility between
XML::XPath and XML::LibXML to Unicode considerations), and that is why
our findvalue and friends return a real Perl scalar, in violation
of the XPath specification.
On the other hand, findnodes does return a list of objects in list context, and an XML::XPath::NodeSet or XML::LibXML::NodeList instance in scalar context, obeying the XPath specification in full. Therefore you most likely do not want to call findnodes() in scalar context, ever: replace
my $attrnode = findnodes('@url',$xrefnode); # WRONG!
with
my ($attrnode) = findnodes('@url',$xrefnode);
The findvalue() such functions described in the XML::XPathScript::Processor manpage are not the only way of extracting bits from the XML document. Objects passed as the first argument to the < -{testcode} >> templates and returned by findnodes() in array context are of one of the XML::XPath::Node::* classes, and they feature some data extraction methods by themselves, conforming to the DOM specification.
However, the names of those methods are not standardized even among
DOM parsers (the accessor to the childNodes
property, for example,
is named childNodes()
in XML::LibXML and getChildNodes()
in
XML::XPath!). In order to write a stylesheet that is portable
between the XML::libXML manpage and the XML::XPath manpage used as back-ends to
the XML::XPathScript manpage, one should refrain from doing that. The exact
same data is available through appropriate XPath formulae, albeit more
slowly, and there are also type-checking accessors such as
is_element_node()
in the XML::XPathScript::Processor manpage.
The rest of this POD documentation is not useful to programmers who just want to write stylesheets; it is of use only to people wanting to call existing stylesheets or more generally embed the XPathScript engine into some wider framework.
XML::XPathScript is an object-oriented class with the following features:
When run, the stylesheet is expected to fill in the template hash $t, which is a lexically-scoped variable made available to it at preprocess time.
Although XPathScript is a core component of AxKit, which depends on this module to be able to process XPathScript stylesheets, there is plenty of motivation for doing stylesheets outside of a WWW application server and so XML::XPathScript is also distributed as a standalone CPAN module. The AxKit XPathScript component inherits from this class and provides the coupling with the application framework by overloading and adding some methods.
XML::XPathScript requires the following Perl packages:
Due to the peculiar syntax allowed in the embedded dialect for accessing the template hash, the stylesheet is not reentrant and cannot (yet) transform several documents at once. However, one should not rely on those variables existing forever.
An XML::XPathscript object without an xml argument to the constructor is only able to compile stylesheets (see SYNOPSIS).
<!--#include -->
constructs, which will be resolved relative to ``.''.
<!--#include -->
s in the
stylesheet file will be honored with respect to the dirname of
$filename instead of ``.''; this provides SGML-style behaviour for
inclusion (it does not depend on the current directory), which is
usually what you want.
my $buffer=""; $xps->process(sub {$buffer.=shift;});
or
$xps->process(sub {print ANOTHERFD (shift);});
(not that the latter would be any good, since <
$xps-
process(\*ANOTHERFD)
>> would do exactly the same, only faster)
If the stylesheet was compile()d with extra varnames, then the calling code should call process() with a corresponding number of @varvalues. The corresponding lexical variables will be set accordingly, so that the stylesheet code can get at them (looking at SYNOPSIS) is the easiest way of getting the meaning of this sentence).
include_file()
only<%= stuff %>
constructs in
the stylesheet dialect are converted into invocations of <
XML::XPathScript-current()->print()
>>, while <% stuff %>
constructs are transcripted verbatim.
<!-- #include -->
constructs are expanded by passing their
filename argument to include_file along with @includestack (if any)
like this:
$self->include_file($includefilename,@includestack);
@includestack is not interpreted by extract() (except for the first entry, to create line tags for the debugger). It is only a bandaid for include_file() to pass the inclusion stack to itself across the mutual recursion existing between the two methods (see include_file). If extract() is invoked from outside include_file(), the last invocation form should not be used.
This method does a purely syntactic job. No special framework declaration is prepended for isolating the code in its own package, defining $t or the like (compile does that). It may be overriden in subclasses to provide different escape forms in the stylesheet dialect.
<!--#include file="foo" -->
directive on behalf of
extract(), that is, returns the script contents of
$filename. The return value must be de-embedded too, which means
that extract() has to be called recursively to expand the contents
of $filename (which may contain more <!--#include -->
s etc.)
$filename has to be slash-separated, whatever OS it is you are using (this is the XML way of things). If $filename is relative (i.e. does not begin with ``/'' or ``./''), it is resolved according to the basename of the stylesheet that includes it (that is, $includestack[0], see below) or ``.'' if we are in the topmost stylesheet. Filenames beginning with ``./'' are considered absolute; this gives stylesheet writers a way to specify that they really really want a stylesheet that lies in the system's current working directory.
@includestack is the include stack currently in use, made up of all values of $filename through the stack, lastly added (innermost) entries first. The toplevel stylesheet is not in @includestack (that is, the outermost call does not specify an @includestack).
This method may be overridden in subclasses to provide support for alternate namespaces (e.g. ``axkit://'' URIs).
->extract($stylesheet)
will be called
first inside compile()).
varname1, varname2, etc. are extraneous arguments that will be made available to the stylesheet dialect as lexically scoped variables. SYNOPSIS shows a way to use this feature to pass the Apache handler to AxKit XPathScript stylesheets, which explains this feature better than a lengthy paragraph would do.
The return value is an opaque token that encapsulates a compiled stylesheet. It should not be used, except as the compiledstylesheet argument to new() to initiate new objects and amortize the compilation time. Subclasses may alter the type of the return value, but will need to overload process() accordingly of course.
The compile() method is idempotent. Subsequent calls to it will return the very same token, and calls to it when a compiledstylesheet argument was set at new() time will return said argument.
$self->{printer}
by said function. Overloading this
method in a subclass provides yet another method to redirect output.
The functions below are not methods.
Reads XML given in $uri, parses it and returns it in a nodeset.
Created by Matt Sergeant <matt@sergeant.org>
Improvements and feature merge with Apache::AxKit::Language::XPathScript by Yanick Champoux <yanick@babyl.dyndns.org> and Dominique Quatravaux <dom@idealx.com>
This is free software. You may distribute it under the same terms as Perl itself.
The XPathScript Guide at
http://axkit.org/wiki/view/AxKit/XPathScriptGuide
XPath documentation from W3C:
http://www.w3.org/TR/xpath
Unicode character table:
http://www.unicode.org/charts/charindex.html
XML::XPathScript - a Perl framework for XML stylesheets |