C |
Parse::Lex
- Generator of lexical analyzers
require 5.005;
use Parse::Lex; @token = ( qw( ADDOP [-+] LEFTP [\(] RIGHTP [\)] INTEGER [1-9][0-9]* NEWLINE \n
), qw(STRING), [qw(" (?:[^"]+|"")* ")], qw(ERROR .*), sub { die qq!can\'t analyze: "$_[1]"!; } );
Parse::Lex->trace; # Class method $lexer = Parse::Lex->new(@token); $lexer->from(\*DATA); print "Tokenization of DATA:\n";
TOKEN:while (1) { $token = $lexer->next; if (not $lexer->eoi) { print "Line $.\t"; print "Type: ", $token->name, "\t"; print "Content:->", $token->text, "<-\n"; } else { last TOKEN; } }
__END__ 1+2-5 "a multiline string with an embedded "" in it" an invalid string with a "" in it"
The classes Parse::Lex
and Parse::CLex
create lexical analyzers.
They use different analysis techniques:
1. Parse::Lex
steps through the analysis by moving a pointer within
the character strings to be analyzed (use of pos()
together with \G
),
2. Parse::CLex
steps through the analysis by consuming the data
recognized (use of s///
).
Analyzers of the Parse::CLex
class do not allow the use of
anchoring in regular expressions. In addition, the subclasses
of Parse::Token
are not implemented for this type of analyzer.
A lexical analyzer is specified by means of a list of tokens passed as
arguments to the new()
method. Tokens are instances of the
Parse::Token
class, which comes with Parse::Lex
. The definition
of a token usually comprises two arguments: a symbolic name (like
INTEGER
), followed by a regular expression. If a sub ref (anonymous
subroutine) is given as third argument, it is called when the token is
recognized. Its arguments are the Parse::Token
instance and the
string recognized by the regular expression. The anonymous
subroutine's return value is used as the new string contents of the
Parse::Token
instance.
The order in which the lexical analyzer examines the regular
expressions is determined by the order in which these expressions are
passed as arguments to the new()
method. The token returned by the
lexical analyzer corresponds to the first regular expression which
matches (this strategy is different from that used by Lex, which
returns the longest match possible out of all that can be recognized).
The lexical analyzer can recognize tokens which span
multiple records. If the definition of the token comprises
more than one regular expression (placed within a reference to an anonymous
array), the analyzer reads as many records as required to recognize
the token (see the documentation for the Parse::Token
class).
When the start pattern is found, the analyzer looks for the end,
and if necessary, reads more records. No backtracking is done
in case of failure.
The analyzer can be used to analyze an isolated character string or
a stream of data coming from a file handle. At the end of the input
data the analyzer returns a Parse::Token
instance named
EOI
(End Of Input).
You can associate start conditions with the token-recognition rules that comprise your lexical analyzer (this is similar to what Flex provides). When start conditions are used, the rule which succeeds is no longer necessarily the first rule that matches.
A token symbol may be preceded by a start condition specifier for the associated recognition rule. For example:
qw(C1:TERMINAL_1 REGEXP), sub { # associated action }, qw(TERMINAL_2 REGEXP), sub { # associated action },
Symbol TERMINAL_1
will be recognized only if start condition C1
is active. Start conditions are activated/deactivated using the
start(CONDITION_NAME)
and end(CONDITION_NAME)
methods.
start('INITIAL')
resets the analysis automaton.
Start conditions can be combined using AND/OR operators as follows:
C1:SYMBOL condition C1
C1:C2:SYMBOL condition C1 AND condition C2
C1,C2:SYMBOL condition C1 OR condition C2
There are two types of start conditions: inclusive and exclusive,
which are declared by class methods inclusive()
and exclusive()
respectively. With an inclusive start condition, all rules are active
regardless of whether or not they are qualified with the start condition.
With an exclusive start condition, only the rules qualified with
the start condition are active; all other rules are deactivated.
Example (borrowed from the documentation of Flex):
use Parse::Lex; @token = ( 'EXPECT', 'expect-floats', sub { $lexer->start('expect'); $_[1] }, 'expect:FLOAT', '\d+\.\d+', sub { print "found a float: $_[1]\n"; $_[1] }, 'expect:NEWLINE', '\n', sub { $lexer->end('expect') ; $_[1] }, 'NEWLINE2', '\n', 'INT', '\d+', sub { print "found an integer: $_[1] \n"; $_[1] }, 'DOT', '\.', sub { print "found a dot\n"; $_[1] }, );
Parse::Lex->exclusive('expect'); $lexer = Parse::Lex->new(@token);
The special start condition ALL
is always verified.
EXPR
and returns a list of pairs consisting of a token name
followed by recognized text. EXPR
can be a character string or a
reference to a filehandle.
Examples:
@tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze("3+3+3"); @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze(\*STREAM);
It is not advisable to directly change the contents of the buffer
without changing the position of the analysis pointer (pos()
) and the
value length of the buffer (length()
).
configure(HASH)
from(EXPR)
method.
EXPR
can be a filehandle or a character string.
ARRAY_REF
must contain the list of attribute values specifying
the tokens to be recognized (see the documentation for Parse::Token
).
skip(REGEX)
method. REGEX
describes the patterns to skip over during the analysis.
EXPR
.
SUB
is an anonymous subroutine executed after the recognition
of each token. For example, to lex the string ``1+2'' you can write:
use Parse::Lex;
$lexer = Parse::Lex->new( qw( ADDOP [-+] INTEGER \d+ ));
$lexer->from("1+2"); $lexer->every (sub { print $_[0]->name, "\t"; print $_[0]->text, "\n"; });
The first argument of the anonymous subroutine is the Parse::Token
instance recognized.
flush()
returns and
clears the buffer containing the character strings recognized up to
now. This is only useful if hold()
has been called to activate
saving of consumed strings.
from(EXPR)
allows specifying the source of the data to be analyzed. The
argument of this method can be a string (or list of strings), or a
reference to a filehandle. If no argument is given, from()
returns the
filehandle if defined, or undef
if input is a string.
When an argument EXPR
is used, the return value is the calling lexer
object itself.
By default it is assumed that data are read from STDIN
.
Examples:
$handle = new IO::File; $handle->open("< filename"); $lexer->from($handle);
$lexer->from(\*DATA); $lexer->from('the data to be analyzed');
getSub
returns the anonymous subroutine that performs the lexical
analysis.
Example:
my $token = ''; my $sub = $lexer->getSub; while (($token = &$sub()) ne $Token::EOI) { print $token->name, "\t"; print $token->text, "\n"; }
# or
my $token = ''; local *tokenizer = $lexer->getSub; while (($token = tokenizer()) ne $Token::EOI) { print $token->name, "\t"; print $token->text, "\n"; }
token()
method.
You can obtain the contents of the buffer using the flush
method,
which also empties the buffer.
length EXPR
sets the length of the current record.
line EXPR
sets the value of the line number. Always returns 1 if a character
string is being analyzed. The readline()
method increments the
line number.
name EXPR
lets you give a name to the lexical analyzer.
name()
return the value of this name.
Parse::Token
instance. Returns the Token::EOI
instance at the end of the data.
Examples:
$lexer = Parse::Lex->new(@token); print $lexer->next->name; # print the token type print $lexer->next->text; # print the token content
next()
method. Tokens are placed in
SCALAR_REF
. The method returns 1 as long as the token is not EOI
.
Example:
while($lexer->nextis(\$token)) { print $token->text(); }
Parse::Token
instances, or a list of triplets permitting
their creation. The triplets consist of: the symbolic name of the token,
the regular expression necessary for its recognition, and possibly an
anonymous subroutine that is called when the token is recognized. For
each triplet, an instance of type Parse::Token
is created in the
calling package.
pos EXPR
sets the position of the beginning of the next token to
be recognized in the current line (this doesn't work with analyzers
of the Parse::CLex
class). pos()
returns the number of
characters already consumed in the current line.
from()
method. Returns the
result of the reading.
Example:
use Parse::Lex;
$lexer = Parse::Lex->new(); while (not $lexer->eoi) { print $lexer->readline() # read and print one line }
INITIAL
.
TOKEN
. Useful to requalify a token inside the
anonymous subroutine associated with this token.
EXPR
is a regular expression defining the token separator pattern
(by default [ \t]+
). skip('')
sets this to no pattern. With
no argument, skip()
returns the value of the pattern.
skip()
can be used as a class method.
Changing the skip pattern causes recompilation of the lexical analyzer.
Example:
Parse::Lex->skip('\s*#(?s:.*)|\s+'); @tokens = Parse::Lex->new('INTEGER' => '\d+')->analyze(\*DATA); print "@tokens\n"; # print INTEGER 1 INTEGER 2 INTEGER 3 INTEGER 4 EOI __END__ 1 # first string to skip 2 3# second string to skip 4
DEFAULT
.
new()
method. If no argument is
given, returns the name of the class. By default the class is
Parse::Token
.
OUTPUT
can be a file name or a reference to a filehandle where the
trace will be redirected.
To handle the cases of token non-recognition, you can define a specific token at the end of the list of tokens that comprise our lexical analyzer. If searching for this token succeeds, it is then possible to call an error handling function:
qw(ERROR (?s:.*)), sub { print STDERR "ERROR: buffer content->", $_[0]->lexer->buffer, "<-\n"; die qq!can\'t analyze: "$_[1]"!; }
ctokenizer.pl - Scan a stream of data using the Parse::CLex
class.
tokenizer.pl - Scan a stream of data using the Parse::Lex
class.
every.pl - Use of the every
method.
sexp.pl - Interpreter for prefix arithmetic expressions.
sexpcond.pl - Interpeter for prefix arithmetic expressions, using conditions.
Analyzers of the Parse::CLex
class do not allow the use of regular
expressions with anchoring.
Parse::Token
, Parse::LexEvent
, Parse::YYLex
.
Philippe Verdret. Documentation translated to English by Vladimir Alexiev and Ocrat.
Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has significantly contributed to improving this documentation. Thanks also to the numerous people who have sent me bug reports and occasionally fixes.
Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.
Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and elsewhere)
Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
C |