XML::XPathScript - a Perl framework for XML stylesheets



NAME

XML::XPathScript - a Perl framework for XML stylesheets


SYNOPSIS

  use XML::XPathScript;
  my $xps = XML::XPathScript->new(xml => $xml, stylesheet => $stylesheet);
  # The short way:
  $xps->process();
  # The long way (caching the compiled stylesheet for reuse and
  # outputting to multiple files):
  my $compiled = XML::XPathScript->new(stylesheetfile => $filename)
         ->compile('$r');
  foreach my $xml (@xmlfiles) {
     use IO::File;
     my $currentIO=new IO::File(shift @outputfiles);
     XML::XPathScript->new(xml => $xml, compiledstylesheet=>$compiled)
         ->process(sub {$currentIO->print(shift)});
  };
  # Making extra variables available to the stylesheet dialect:
  my $handler=$xps->compile('$r');
  &$handler($xmltree,&Apache::print,Apache->request());


DESCRIPTION

This is the XML::XPathScript stylesheet framework, part of the AxKit project at http://axkit.org/.

XPathScript is a stylesheet language similar in many ways to XSLT (in concept, not in appearance), for transforming XML from one format to another (possibly HTML, but XPathScript also shines for non-XML-like output).

Like XSLT, XPathScript offers a dialect to mix verbatim portions of documents and code. Also like XSLT, it leverages the powerful ``templates/apply-templates'' and ``cascading stylesheets'' design patterns, that greatly simplify the design of stylesheets for programmers. The availability of the XPath query language inside stylesheets promotes the use of a purely document-dependent, side-effect-free coding style. But unlike XSLT which uses its own dedicated control language with an XML-compliant syntax, XPathScript uses Perl which is terse and highly extendable.

The result of the merge is an extremely powerful tool for rendering complex XML documents into other formats. Stylesheets written in XPathScript are very easy to create, extend and reuse, even if they manage hundreds of different XML tags.


STYLESHEET WRITER DOCUMENTATION

Creating stylesheets

See http://axkit.org/docs/xpathscript/guide.dkb for a head start. There you will learn how to markup the embedded dialect and fill in the template hash $t.

xpathscript Invocation

This CPAN module is bundled with an ``xpathscript'' shell tool that is to be invoked like this:

   xpathscript mydocument.xml mystylesheet.xps

It will produce the resulting document on standard output. For more options, refer to xpathscript's man page.

XPathScript methods available from within the stylesheet

A number of callback functions are available from the stylesheet proper. They apply against the current document and template hash, which are transparently passed back and forth as global variables (see Global variables). They are defined in the the XML::XPathScript::Processor manpage package, which is implicitly imported into all code written in the embedded stylesheet dialect.

The following methods are also available to peek at the internal state of the XPathScript engine from within the stylesheet. Although < XML::XPathScript-current()->whatever() >> may be called from anywhere within the stylesheet (except a BEGIN or END block or similar), it is most unwise to alter the state of the interpreter from within a testcode block as the order of evaluation of the XML nodes is not specified. Better tweak the stylesheet globals (e.g. binmode) once and for all at the beginning of the stylesheet.

current()
This class method (e.g. XML::XPathScript->current()) returns the stylesheet object currently being applied. This can be called from anywhere within the stylesheet, except a BEGIN or END block or similar. Beware though that using the return value for altering (as opposed to reading) stuff from anywhere except the stylesheet's top level is unwise.

interpolating()
interpolating($boolean)
Gets (first call form) or sets (second form) the XPath interpolation boolean flag. If true, values set in $template->{pre} and similar may contain expressions within braces, that will be interpreted as XPath expressions and substituted in place: for example, when interpolation is on, the following code
   $t->{'link'}{pre} = '<a href="{@url}">';
   $t->{'link'}{post} = '</a>';

is enough for rendering a <link> element as an HTML hyperlink. The interpolation-less version is slightly more complex as it requires a testcode:

   $t->{'link'}{testcode} = sub {
      my ($currentnode, $t) = @_;
      my $url = findvalue('@url', $currentnode);
      $t->{pre}="<a href='$url'>";
      $t->{post}='</a>';
          return DO_SELF_AND_KIDS();
   };

Interpolation is on by default. A (now undocumented) global variable used to change the default to off, but don't do that.

binmode()
Declares that the stylesheet output is not in UTF-8, but instead in an (unspecified) character encoding embedded in the stylesheet source that neither Perl nor XPathScript should have any business dealing with. Calling XML::XPathScript->current()->binmode() is an irreversible operation with the consequences outlined in The Unicode mess.

Stylesheet Guidelines

Here are a few things to watch out for when coding stylesheets.

The Unicode mess

Unicode is a balucitherian character numbering standard, that strives to be a superset of all character sets currently in use by humans and computers. Going Unicode is therefore the way of the future, as it will guarantee compatibility of your applications with every character set on planet Earth: for this reason, all XML-compliant APIs (XML::XPathScript being no exception) should return Unicode strings in all their calls, regardless of the charset used to encode the XML document to begin with.

The gotcha is, the brave Unicode world sells itself in much the same way as XML when it promises that you'll still be able to read your data back in 30 years: that will probably turn out to be true, but until then, you can't :-)

Therefore, you as a stylesheet author will more likely than not need to do some wrestling with Unicode in Perl, XML::XPathScript or not. Here is a primer on how.

Unicode, UTF-8 and Perl

Unicode is not a text file format: UTF-8 is. Perl, when doing Unicode, prefers to use UTF-8 internally.

Unicode is a character numbering standard: that is, an abstract registry that associates unique integer numbers to a cast of thousands of characters. For example the ``smiling face'' is character number 0x263a, and the thin space is 0x2009 (there is a URL to a Unicode character table in SEE ALSO). Of course, this means that the 8-bits- (or even, Heaven forbid, 7-bits-?)-per-character idea goes through the window this instant. Coding every character on 16 bits in memory is an option (called UTF-16), but not as simple an idea as it sounds: one would have to rewrite nearly every piece of C code for starters, and even then the Chinese aren't quite happy with ``only'' 65536 character code points.

Introducing UTF-8, which is a way of encoding Unicode character numbers (of any size) in an ASCII- and C-friendly way: all 127 ASCII characters (such as ``A'' or or ``/'' or ``.'', but not the ISO-8859-1 8-bit extensions) have the same encoding in both ASCII and UTF-8, including the null character (which is good for strcpy() and friends). Of course, this means that the other characters are rendered using several bytes, for example ``é'' is ``é'' in UTF-8. The result is therefore vaguely intelligible for a Western reader.

Output to UTF-8 with XPathScript

The programmer- and C-friendly characteristics of UTF-8 have made it the choice for dealing with Unicode in Perl. The interpreter maintains an ``UTF8-tainted'' bit on every string scalar it handles (much like what the perlsec manpage does for untrusted data). Every function in XML::XPathScript returns a string with such bit set to true: therefore, producing UTF-8 output is straightforward and one does not have to take any special precautions in XPathScript.

Output to a non-UTF-8 character set with XPathScript

When binmode is invoked from the stylesheet body, it signals that the stylesheet output should not be UTF-8, but instead some user-chosen character encoding that XML::XPathScript cannot and will not know or care about. Calling < XML::XPathScript-current()->binmode() >> has the following consequences:

XPath scalar return values considered harmful

XML::XPath calls such as findvalue() return objects in an object class designed to map one of the types mandated by the XPath spec (see the XML::XPath manpage for details). This is often not what a Perl programmer comes to expect (e.g. strings and numbers cannot be treated the same). There are some work-arounds built in XML::XPath, using operator overloading: when using those objects as strings (by concatenating them, using them in regular expressions etc.), they become strings, through a transparent call to one of their methods such as < -value() >>. However, we do not support this for a variety of reasons (from limitations in overload to stylesheet compatibility between XML::XPath and XML::LibXML to Unicode considerations), and that is why our findvalue and friends return a real Perl scalar, in violation of the XPath specification.

On the other hand, findnodes does return a list of objects in list context, and an XML::XPath::NodeSet or XML::LibXML::NodeList instance in scalar context, obeying the XPath specification in full. Therefore you most likely do not want to call findnodes() in scalar context, ever: replace

   my $attrnode = findnodes('@url',$xrefnode); # WRONG!

with

   my ($attrnode) = findnodes('@url',$xrefnode);

Do not use DOM method calls, for they make stylesheets non-portable

The findvalue() such functions described in the XML::XPathScript::Processor manpage are not the only way of extracting bits from the XML document. Objects passed as the first argument to the < -{testcode} >> templates and returned by findnodes() in array context are of one of the XML::XPath::Node::* classes, and they feature some data extraction methods by themselves, conforming to the DOM specification.

However, the names of those methods are not standardized even among DOM parsers (the accessor to the childNodes property, for example, is named childNodes() in XML::LibXML and getChildNodes() in XML::XPath!). In order to write a stylesheet that is portable between the XML::libXML manpage and the XML::XPath manpage used as back-ends to the XML::XPathScript manpage, one should refrain from doing that. The exact same data is available through appropriate XPath formulae, albeit more slowly, and there are also type-checking accessors such as is_element_node() in the XML::XPathScript::Processor manpage.


TECHNICAL DOCUMENTATION

The rest of this POD documentation is not useful to programmers who just want to write stylesheets; it is of use only to people wanting to call existing stylesheets or more generally embed the XPathScript engine into some wider framework.

XML::XPathScript is an object-oriented class with the following features:

When run, the stylesheet is expected to fill in the template hash $t, which is a lexically-scoped variable made available to it at preprocess time.

Dependencies

Although XPathScript is a core component of AxKit, which depends on this module to be able to process XPathScript stylesheets, there is plenty of motivation for doing stylesheets outside of a WWW application server and so XML::XPathScript is also distributed as a standalone CPAN module. The AxKit XPathScript component inherits from this class and provides the coupling with the application framework by overloading and adding some methods.

XML::XPathScript requires the following Perl packages:

Symbol
For loading files from anonymous filehandles. Symbol is bundled with Perl.

File::Basename
For fetching stylesheets from system files. One may provide other means of fetching stylesheets through object inheritance (this is what AxKit does). File::Basename is bundled with Perl.

XML::Parser
XML::XPath
For the XML parser and XPath interpreter, obviously needed. Plans are to support the XML::libXML package as an alternative, which does the same as the above in C (and hence an order of magnitude faster).

Global variables

Due to the peculiar syntax allowed in the embedded dialect for accessing the template hash, the stylesheet is not reentrant and cannot (yet) transform several documents at once. However, one should not rely on those variables existing forever.

$XML::XPathScript::xp
The XML::XPath object that holds the whole document (created by new in the XML::XPath manpage)

$XML::XPathScript::trans
The template hash currently in use (known as $t in the AxKit documentation). Its keys are element names, and its values are the matching templates (as hash references).

Methods and class methods

new(key1=value1,key2=>value2,...) >>
Creates a new XPathScript translator. The recognized named arguments are
xml => $xml
$xml is a scalar containing XML text, or a reference to a filehandle from which XML input is available, or an XML::XPath or XML::libXML object (support for the latter object class is very poor for now, as it involves unparsing and parsing back into XML::XPath).

An XML::XPathscript object without an xml argument to the constructor is only able to compile stylesheets (see SYNOPSIS).

stylesheet => $stylesheet
$stylesheet is a scalar containing the stylesheet text, or a reference to a filehandle from which the stylesheet text is available. The stylesheet text may contain unresolved <!--#include --> constructs, which will be resolved relative to ``.''.

stylesheetfile => $filename
Same as stylesheet but let XML::XPathScript do the loading itself. Using this form, relative <!--#include -->s in the stylesheet file will be honored with respect to the dirname of $filename instead of ``.''; this provides SGML-style behaviour for inclusion (it does not depend on the current directory), which is usually what you want.

compiledstylesheet => $function
Re-uses a previous return value of compile() (see SYNOPSIS and compile), typically to apply the same stylesheet to several XML documents in a row.

interpolation_regex => $regex
Sets the interpolation regex to be $regex. Whatever is captured in $1 will be used as the xpath expression. Defaults to qr/{(.*?)}/.

process()
process($printer)
process($printer,@varvalues)
Processes the document and stylesheet set at construction time, and prints the result to STDOUT by default. If $printer is set, it must be either a reference to a filehandle open for output, or a reference to a string, or a reference to a subroutine which does the output, as in
   my $buffer="";
   $xps->process(sub {$buffer.=shift;});

or

   $xps->process(sub {print ANOTHERFD (shift);});

(not that the latter would be any good, since < $xps-process(\*ANOTHERFD) >> would do exactly the same, only faster)

If the stylesheet was compile()d with extra varnames, then the calling code should call process() with a corresponding number of @varvalues. The corresponding lexical variables will be set accordingly, so that the stylesheet code can get at them (looking at SYNOPSIS) is the easiest way of getting the meaning of this sentence).

extract($stylesheet)
extract($stylesheet,$filename)
extract($stylesheet,@includestack) # from include_file() only
The embedded dialect parser. Given $stylesheet, which is either a filehandle reference or a string, returns a string that holds all the code in real Perl. Unquoted text and <%= stuff %> constructs in the stylesheet dialect are converted into invocations of < XML::XPathScript-current()->print() >>, while <% stuff %> constructs are transcripted verbatim.

<!-- #include --> constructs are expanded by passing their filename argument to include_file along with @includestack (if any) like this:

   $self->include_file($includefilename,@includestack);

@includestack is not interpreted by extract() (except for the first entry, to create line tags for the debugger). It is only a bandaid for include_file() to pass the inclusion stack to itself across the mutual recursion existing between the two methods (see include_file). If extract() is invoked from outside include_file(), the last invocation form should not be used.

This method does a purely syntactic job. No special framework declaration is prepended for isolating the code in its own package, defining $t or the like (compile does that). It may be overriden in subclasses to provide different escape forms in the stylesheet dialect.

$string = read_stylesheet( $stylesheet )
Read the $stylesheet (which can be a filehandler or a string). Used by extract and exists such that it can be overloaded in Apache::AxKit::Language::YPathScript.

include_file($filename)
include_file($filename,@includestack)
Resolves a <!--#include file="foo" --> directive on behalf of extract(), that is, returns the script contents of $filename. The return value must be de-embedded too, which means that extract() has to be called recursively to expand the contents of $filename (which may contain more <!--#include -->s etc.)

$filename has to be slash-separated, whatever OS it is you are using (this is the XML way of things). If $filename is relative (i.e. does not begin with ``/'' or ``./''), it is resolved according to the basename of the stylesheet that includes it (that is, $includestack[0], see below) or ``.'' if we are in the topmost stylesheet. Filenames beginning with ``./'' are considered absolute; this gives stylesheet writers a way to specify that they really really want a stylesheet that lies in the system's current working directory.

@includestack is the include stack currently in use, made up of all values of $filename through the stack, lastly added (innermost) entries first. The toplevel stylesheet is not in @includestack (that is, the outermost call does not specify an @includestack).

This method may be overridden in subclasses to provide support for alternate namespaces (e.g. ``axkit://'' URIs).

compile()
compile(varname1, varname2,...)
Compiles the stylesheet set at new() time and returns an anonymous CODE reference. $stylesheet shall be written in the unparsed embedded dialect (in other words ->extract($stylesheet) will be called first inside compile()).

varname1, varname2, etc. are extraneous arguments that will be made available to the stylesheet dialect as lexically scoped variables. SYNOPSIS shows a way to use this feature to pass the Apache handler to AxKit XPathScript stylesheets, which explains this feature better than a lengthy paragraph would do.

The return value is an opaque token that encapsulates a compiled stylesheet. It should not be used, except as the compiledstylesheet argument to new() to initiate new objects and amortize the compilation time. Subclasses may alter the type of the return value, but will need to overload process() accordingly of course.

The compile() method is idempotent. Subsequent calls to it will return the very same token, and calls to it when a compiledstylesheet argument was set at new() time will return said argument.

print($text)
Outputs a chunk of text on behalf of the stylesheet. The default implementation is to use the second argument to process, which was stashed in $self->{printer} by said function. Overloading this method in a subclass provides yet another method to redirect output.

Utility functions

The functions below are not methods.

gen_package_name()
Generates a fresh package name in which we would compile a new stylesheet. Never returns twice the same name.

$nodeset = $xps->document( $uri )
        Reads XML given in $uri, parses it and returns it in a nodeset.


AUTHORS

Created by Matt Sergeant <matt@sergeant.org>

Improvements and feature merge with Apache::AxKit::Language::XPathScript by Yanick Champoux <yanick@babyl.dyndns.org> and Dominique Quatravaux <dom@idealx.com>


LICENSE

This is free software. You may distribute it under the same terms as Perl itself.


SEE ALSO

The XPathScript Guide at

  http://axkit.org/wiki/view/AxKit/XPathScriptGuide

XPath documentation from W3C:

  http://www.w3.org/TR/xpath

Unicode character table:

  http://www.unicode.org/charts/charindex.html

 XML::XPathScript - a Perl framework for XML stylesheets