C - Definition of tokens used by C


Parse::Token - Definition of tokens used by Parse::Lex


        require 5.005;
        use Parse::Lex;
        @token = qw(
            ADDOP    [-+]
            INTEGER  [1-9][0-9]*
        $lexer = Parse::Lex->new(@token);
        $content = $INTEGER->next;
        if ($INTEGER->status) {
          print "$content\n";
        $content = $ADDOP->next;
        if ($ADDOP->status) {
          print "$content\n";
        if ($INTEGER->isnext(\$content)) {
          print "$content\n";


The Parse::Token class and its derived classes permit defining the tokens used by Parse::Lex or Parse::LexEvent.

The creation of tokens can be done by means of the new() or factory() methods. The Lex::new() method of the Parse::Lex package indirectly creates instances of the tokens to be recognized.

The next() or isnext() methods of the Parse::Token package permit interfacing the lexical analyzer with a syntactic analyzer of recursive descent type. For interfacing with byacc, see the Parse::YYLex package.

Parse::Token is included indirectly by means of use Parse::Lex or use Parse::LexEvent.


Returns the anonymous subroutine defined within the Parse::Token object.

factory LIST
factory ARRAY_REF
The factory(LIST) method creates a list of tokens from a list of specifications, which include for each token: a name, a regular expression, and possibly an anonymous subroutine. The list can also include objects of class Parse::Token or of a class derived from it.

The factory(ARRAY_REF) method permits creating tokens from specifications of type attribute-value:

        Parse::Token->factory([Type => 'Simple', 
                               Name => 'EXAMPLE', 
                               Regex => '.+']);

Type indicates the type of each token to be created (the package prefix is not indicated).

factory() creates a series of tokens but does not import these tokens into the calling package.

You could for example write:

        %keywords = 
          qw (
              PROC  undef
              FUNC  undef
              RETURN undef
              IF    undef
              ELSE  undef
              WHILE undef
              PRINT undef
              READ  undef
        @tokens = Parse::Token->factory(%keywords);

and install these tokens in a symbol table in the following manner:

        foreach $name (keys %keywords) {
          ${$name} = pop @tokens;
          $symbol{"\L$name"} = [${$name}, ''];

${$name} is the token instance.

During the lexical analysis phase, you can use the tokens in the following manner:

        qw(IDENT [a-zA-Z][a-zA-Z0-9_]*),  sub {
           $symbol{$_[1]} = [] unless defined $symbol{$_[1]};
           my $type = $symbol{$_[1]}[0];
           $lexer->setToken((not defined $type) ? $VAR : $type);
           $_[1];  # THE TOKEN TEXT

This permits indicating that any symbol of unknown type is a variable.

In this example we have used $_[1] which corresponds to the text recognized by the regular expression. This text associated with the token must be returned by the anonymous subroutine.

get EXPR
get obtains the value of the attribute named by the result of evaluating EXPR. You can also use the name of the attribute as a method name.

Returns the character string that was recognized by means of this Parse::Token object.

Same as the text() method.

isnext EXPR
Returns the status of the token. The consumed string is put into EXPR if it is a reference to a scalar.

Returns the name of the token.

Activate searching for the lexeme defined by the regular expression contained in the object. If this lexeme is recognized on the character stream to analyze, next returns the string found and sets the status of the object to true.

Creates an object of type Parse::Token::Simple or Parse::Token::Segmented. The arguments of the new() method are, respectively: a symbolic name, a regular expression, and possibly an anonymous subroutine. The subclasses of Parse::Token permit specifying tokens by means of a list of attribute-values.

REGEXP is either a simple regular expression, or a reference to an array containing from one to three regular expressions. In the first case, the instance belongs to the Parse::Token::Simple class. In the second case, the instance belongs to the Parse::Token::Segmented class. The tokens of this type permit recognizing structures of type character string delimited by quotation marks, comments in a C program, etc. The regular expressions are used to recognize:

1. The beginning of the lexeme,

2. The ``body'' of the lexeme; if this second expression is missing, Parse::Lex uses ``(?:.*?)'',

3. the end of the lexeme; if this last expression is missing then the first one is used. (Note! The end of the lexeme cannot span several lines).


          qw(STRING), [qw(" (?:[^"\\\\]+|\\\\(?:.|\n))* ")],

These regular expressions can recognize multi-line strings delimited by quotation marks, where the backslash is used to quote the quotation marks appearing within the string. Notice the quadrupling of the backslash.

Here is a variation of the previous example which uses the s option to include newline in the characters recognized by ``.'':

          qw(STRING), [qw(" (?s:[^"\\\\]+|\\\\.)* ")],

(Note: it is possible to write regular expressions which are more efficient in terms of execution time, but this is not our objective with this example. See Mastering Regular Expressions.)

The anonymous subroutine is called when the lexeme is recognized by the lexical analyzer. This subroutine takes two arguments: $_[0] contains the token instance, and $_[1] contains the string recognized by the regular expression. The scalar returned by the anonymous subroutine defines the character string memorized in the token instance.

In the anonymous subroutine you can use the positional variables $1, $2, etc. which correspond to the groups of parentheses in the regular expression.

Returns the regular expression of the Token object.

set LIST
Allows marking a token with a list of attribute-value pairs.

An attribute name can be used as a method name.

setText EXPR
The value of EXPR defines the character string associated with the lexeme.

Same as the text(EXPR) method.

status EXPR
Indicates if the last search of the lexeme succeeded or failed. status EXPR overrides the existing value and sets it to the value of EXPR.

text EXPR
text() returns the character string recognized by means of the token. The value of EXPR sets the character string associated with the lexeme.

trace OUTPUT
Class method which activates/deactivates a trace of the lexical analysis.

OUTPUT can be a file name or a reference to a filehandle to which the trace will be directed.

Subclasses of Parse::Token

Subclasses of the Parse::Token class are being defined. They permit recognizing specific structures such as, for example, strings within double-quotes, C comments, etc. Here are the subclasses which I am working on:

Parse::Token::Simple : tokens of this class are defined by means of a single regular expression.

Parse::Token::Segmented : tokens of this class are defined by means of three regular expressions. Reading of new data is done automatically.

Parse::Token::Delimited : permits recognizing, for example, C language comments.

Parse::Token::Quoted : permits recognizing, for example, character strings within quotation marks.

Parse::Token::Nested : permits recognizing nested structures such as parenthesized expressions. NOT DEFINED.

These classes are recently created and no doubt contain some bugs.


Tokens of the Parse::Token::Action class permit inserting arbitrary Perl expressions within a lexical analyzer. An expression can be used for instance to print out internal variables of the analyzer:

The class constructor accepts the following attributes:

Example :

        $ACTION = new Parse::Token::Action(
                                      Name => 'ACTION',
                                      Expr => q!print "LEX_POS: $LEX_POS\n" .
                                      "LEX_BUFFER: $LEX_BUFFER\n" .
                                      "LEX_LENGTH: $LEX_LENGTH\n" .
                                      "LEX_RECORD: $LEX_RECORD\n" .
                                      "LEX_OFFSET: $LEX_OFFSET\n" 


The class constructor accepts the following attributes:

Example. new Parse::Token::Simple(Name => 'remainder', Regex => '[^/\'\``]+', ReadMore => 1);


The definition of these tokens includes three regular expressions. During analysis of a data stream, new data is read as long as the end of the token has not been reached.

The class constructor accepts the following attributes:


Parse::Token::Quoted is a subclass of Parse::Token::Segmented. It permits recognizing character strings within double quotes or single quotes.


       Start    End            Escaping
        '        '              ''
        "        "              ""
        "        "              \

The class constructor accepts the following attributes:

Example. new Parse::Token::Quoted(Name => 'squotes', Handler => 'string', Escape => '\\', Quote => qq!\'!, );


Parse::Token::Delimited is a subclass of Parse::Token::Segmented. It permits, for example, recognizing C language comments.


        Start   End     Constraint
                        on the contents
        /*       */                         C Comment
        <!--     -->      No '--'           XML Comment
        <!--     -->                        SGML Comment
        <?       ?>                         Processing instruction
                                            in SGML/XML

The class constructor accepts the following attributes:

Example. new Parse::Token::Delimited(Name => 'comment', Start => '/[*]', End => '[*]/' );

Parse::Token::Nested - Not defined


        Start   End
        (        )                      Symbolic Expressions
        {        }                      Rich Text Format Groups


The implementation of subclasses of tokens is not complete for analyzers of the Parse::CLex class. I am not too keen to do it, since an implementation for classes Parse::Lex and Parse::LexEvent seems quite sufficient.


Philippe Verdret. Documentation translated to English by Vladimir Alexiev and Ocrat.


Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has significantly contributed to improving this documentation. Thanks also to the numerous persons who have made comments or sometimes sent bug fixes.


Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.

Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.


Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 C - Definition of tokens used by C