C |
Parse::Token
- Definition of tokens used by Parse::Lex
require 5.005;
use Parse::Lex; @token = qw( ADDOP [-+] INTEGER [1-9][0-9]* );
$lexer = Parse::Lex->new(@token); $lexer->from(\*DATA);
$content = $INTEGER->next; if ($INTEGER->status) { print "$content\n"; } $content = $ADDOP->next; if ($ADDOP->status) { print "$content\n"; } if ($INTEGER->isnext(\$content)) { print "$content\n"; } __END__ 1+2
The Parse::Token
class and its derived classes permit defining the
tokens used by Parse::Lex
or Parse::LexEvent
.
The creation of tokens can be done by means of the new()
or
factory()
methods. The Lex::new()
method of the Parse::Lex
package indirectly creates instances of the tokens to be recognized.
The next()
or isnext()
methods of the Parse::Token
package
permit interfacing the lexical analyzer with a syntactic analyzer
of recursive descent type. For interfacing with byacc
, see the
Parse::YYLex
package.
Parse::Token
is included indirectly by means of use Parse::Lex
or
use Parse::LexEvent
.
Parse::Token
object.
factory(LIST)
method creates a list of tokens from a list
of specifications, which include for each token: a name, a
regular expression, and possibly an anonymous subroutine.
The list can also include objects of class Parse::Token
or of a class derived from it.
The factory(ARRAY_REF)
method permits creating tokens from
specifications of type attribute-value:
Parse::Token->factory([Type => 'Simple', Name => 'EXAMPLE', Regex => '.+']);
Type
indicates the type of each token to be created
(the package prefix is not indicated).
factory()
creates a series of tokens but does not import these
tokens into the calling package.
You could for example write:
%keywords = qw ( PROC undef FUNC undef RETURN undef IF undef ELSE undef WHILE undef PRINT undef READ undef ); @tokens = Parse::Token->factory(%keywords);
and install these tokens in a symbol table in the following manner:
foreach $name (keys %keywords) { ${$name} = pop @tokens; $symbol{"\L$name"} = [${$name}, '']; }
${$name}
is the token instance.
During the lexical analysis phase, you can use the tokens in the following manner:
qw(IDENT [a-zA-Z][a-zA-Z0-9_]*), sub { $symbol{$_[1]} = [] unless defined $symbol{$_[1]}; my $type = $symbol{$_[1]}[0]; $lexer->setToken((not defined $type) ? $VAR : $type); $_[1]; # THE TOKEN TEXT }
This permits indicating that any symbol of unknown type is a variable.
In this example we have used $_[1]
which corresponds to the text
recognized by the regular expression. This text associated with the
token must be returned by the anonymous subroutine.
get
obtains the value of the attribute named by the result of
evaluating EXPR. You can also use the name of the attribute as a method name.
Parse::Token
object.
Same as the text()
method.
next
returns the string found and sets the
status of the object to true.
Parse::Token::Simple
or
Parse::Token::Segmented
. The arguments of the new()
method are,
respectively: a symbolic name, a regular expression, and possibly
an anonymous subroutine. The subclasses of Parse::Token
permit
specifying tokens by means of a list of attribute-values.
REGEXP is either a simple regular expression, or a reference to an
array containing from one to three regular expressions. In the
first case, the instance belongs to the Parse::Token::Simple
class. In the second case, the instance belongs to the
Parse::Token::Segmented
class. The tokens of this type permit
recognizing structures of type character string delimited by
quotation marks, comments in a C program, etc. The regular
expressions are used to recognize:
1. The beginning of the lexeme,
2. The ``body'' of the lexeme; if this second expression is missing,
Parse::Lex
uses ``(?:.*?)'',
3. the end of the lexeme; if this last expression is missing then the first one is used. (Note! The end of the lexeme cannot span several lines).
Example:
qw(STRING), [qw(" (?:[^"\\\\]+|\\\\(?:.|\n))* ")],
These regular expressions can recognize multi-line strings delimited by quotation marks, where the backslash is used to quote the quotation marks appearing within the string. Notice the quadrupling of the backslash.
Here is a variation of the previous example which uses the s
option to include newline in the characters recognized by ``.
'':
qw(STRING), [qw(" (?s:[^"\\\\]+|\\\\.)* ")],
(Note: it is possible to write regular expressions which are more efficient in terms of execution time, but this is not our objective with this example. See Mastering Regular Expressions.)
The anonymous subroutine is called when the lexeme is recognized by the
lexical analyzer. This subroutine takes two arguments: $_[0]
contains
the token instance, and $_[1]
contains the string recognized
by the regular expression. The scalar returned by the anonymous
subroutine defines the character string memorized in the token instance.
In the anonymous subroutine you can use the positional variables
$1
, $2
, etc. which correspond to the groups of parentheses
in the regular expression.
Token
object.
An attribute name can be used as a method name.
EXPR
defines the character string associated with the
lexeme.
Same as the text(EXPR)
method.
status EXPR
overrides the existing value and sets it to the value of EXPR.
text()
returns the character string recognized by means of the
token. The value of EXPR
sets the character string
associated with the lexeme.
OUTPUT
can be a file name or a reference to a filehandle to which
the trace will be directed.
Subclasses of the Parse::Token
class are being defined.
They permit recognizing specific structures such as,
for example, strings within double-quotes, C comments, etc.
Here are the subclasses which I am working on:
Parse::Token::Simple
: tokens of this class are defined
by means of a single regular expression.
Parse::Token::Segmented
: tokens of this class are defined
by means of three regular expressions. Reading of new data
is done automatically.
Parse::Token::Delimited
: permits recognizing, for example,
C language comments.
Parse::Token::Quoted
: permits recognizing, for example,
character strings within quotation marks.
Parse::Token::Nested
: permits recognizing nested structures
such as parenthesized expressions. NOT DEFINED.
These classes are recently created and no doubt contain some bugs.
Tokens of the Parse::Token::Action
class permit inserting arbitrary
Perl expressions within a lexical analyzer. An expression can be used
for instance to print out internal variables of the analyzer:
$LEX_BUFFER
: contents of the buffer to be analyzed
$LEX_LENGTH
: length of the character string being analyzed
$LEX_RECORD
: number of the record being analyzed
$LEX_OFFSET
: number of characters already consumed since the start
of the analysis.
$LEX_POS
: position reached by the analysis as a number of characters
since the start of the buffer.
The class constructor accepts the following attributes:
Name
: the name of the token
Expr
: a Perl expression
Example :
$ACTION = new Parse::Token::Action( Name => 'ACTION', Expr => q!print "LEX_POS: $LEX_POS\n" . "LEX_BUFFER: $LEX_BUFFER\n" . "LEX_LENGTH: $LEX_LENGTH\n" . "LEX_RECORD: $LEX_RECORD\n" . "LEX_OFFSET: $LEX_OFFSET\n" ;!, );
The class constructor accepts the following attributes:
Handler
: the value indicates the name of a function to call during
an analysis performed by an analyzer of class Parse::LexEvent
.
Name
: the associated value is the name of the token.
Regex
: the associated value is a regular expression
corresponding to the pattern to be recognized.
ReadMore
: if the associated value is 1, the recognition of the token
continues after reading a new record. The strings recognized are
concatenated. This attribute only has effect during analysis of a
character stream.
Sub
: the associated value must be an anonymous subroutine to be
executed after the token is recognized. This function is only used
with analyzers of class Parse::Lex
or Parse::CLex
.
Example. new Parse::Token::Simple(Name => 'remainder', Regex => '[^/\'\``]+', ReadMore => 1);
The definition of these tokens includes three regular expressions. During analysis of a data stream, new data is read as long as the end of the token has not been reached.
The class constructor accepts the following attributes:
Handler
: the value indicates the name of a function to call during
analysis performed by an analyzer of class Parse::LexEvent
.
Name
: the associated value is the name of the token.
Regex
: the associated value must be a reference to an array that
contains three regular expressions.
Sub
: the associated value must be an anonymous subroutine to be
executed after the token is recognized. This function is only used
with analyzers of class Parse::Lex
or Parse::CLex
.
Parse::Token::Quoted
is a subclass of
Parse::Token::Segmented
. It permits recognizing character
strings within double quotes or single quotes.
Examples.
--------------------------------------------------------- Start End Escaping --------------------------------------------------------- ' ' '' " " "" " " \ ---------------------------------------------------------
The class constructor accepts the following attributes:
End
: The associated value is a regular expression permitting
recognizing the end of the token.
Escape
: The associated value indicates the character used to escape
the delimiter. By default, a double occurrence of the terminating
character escapes that character.
Handler
: the value indicates the name of a function to be called
during an analysis performed by an analyzer of class Parse::LexEvent
.
Name
: the associated value is the name of the token.
Start
: the associated value is a regular expression permitting
recognizing the start of the token.
Sub
: the associated value must be an anonymous subroutine to be
executed after the token is recognized. This function is only used
with analyzers of class Parse::Lex
or Parse::CLex
.
Example. new Parse::Token::Quoted(Name => 'squotes', Handler => 'string', Escape => '\\', Quote => qq!\'!, );
Parse::Token::Delimited
is a subclass of
Parse::Token::Segmented
. It permits, for example, recognizing C
language comments.
Examples.
--------------------------------------------------------- Start End Constraint on the contents --------------------------------------------------------- /* */ C Comment <!-- --> No '--' XML Comment <!-- --> SGML Comment <? ?> Processing instruction in SGML/XML ---------------------------------------------------------
The class constructor accepts the following attributes:
End
: The associated value is a regular expression permitting
recognizing the end of the token.
Handler
: the value indicates the name of a function to be called
during an analysis performed by an analyzer of class Parse::LexEvent
.
Name
: the associated value is the name of the token.
Start
: the associated value is a regular expression permitting
recognizing the start of the token.
Sub
: the associated value must be an anonymous subroutine to be
executed after the token is recognized. This function is only used
with analyzers of class Parse::Lex
or Parse::CLex
.
Example. new Parse::Token::Delimited(Name => 'comment', Start => '/[*]', End => '[*]/' );
Examples.
---------------------------------------------------------- Start End ---------------------------------------------------------- ( ) Symbolic Expressions { } Rich Text Format Groups ----------------------------------------------------------
The implementation of subclasses of tokens is not complete for
analyzers of the Parse::CLex
class. I am not too keen to do
it, since an implementation for classes Parse::Lex
and
Parse::LexEvent
seems quite sufficient.
Philippe Verdret. Documentation translated to English by Vladimir Alexiev and Ocrat.
Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has significantly contributed to improving this documentation. Thanks also to the numerous persons who have made comments or sometimes sent bug fixes.
Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.
Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
C |