RTF::Tokenizer - Tokenize RTF |
get_token()
RTF::Tokenizer - Tokenize RTF
Tokenizes RTF
use RTF::Tokenizer;
# Create a tokenizer object my $tokenizer = RTF::Tokenizer->new(); my $tokenizer = RTF::Tokenizer->new( string => '{\rtf1}' ); my $tokenizer = RTF::Tokenizer->new( file => \*STDIN ); my $tokenizer = RTF::Tokenizer->new( file => 'lala.rtf' ); my $tokenizer = RTF::Tokenizer->new( file => 'lala.rtf', sloppy => 1 );
my $tokenizer = RTF::Tokenizer->new( string => '{\rtf1}', note_escapes => 1 );
# Populate it from a file $tokenizer->read_file('filename.txt');
# Or a file handle $tokenizer->read_file( \*STDIN );
# Or a string $tokenizer->read_string( '{\*\some rtf}' );
# Get the first token my ( $token_type, $argument, $parameter ) = $tokenizer->get_token();
# Ooops, that was wrong... $tokenizer->put_token( 'control', 'b', 1 );
This documentation assumes some basic knowledge of RTF. If you lack that, go read The_RTF_Cookbook:
Returns a Tokenizer object. Normally called with no arguments,
however, you can save yourself calling read_file()
or read_string()
by passing new()
a hash (well, a list really) containing either
a 'file'- or 'string'-indexed couplet, where the value is what
you would like passed to the respective routine. The example in
the synopsis makes this much more clear than does this description :-)
As of version 1.04, we can also differentiate between control words
and escapes. If you pass a note_escapes
parameter with a true value
then escapes will have a token type of escape
rather than control
.
Version 1.08 and above allow you to deal with a common RTF error that programs insist on spitting out without just panicking:
\control1Plaintext
Which is nasty. Do this by passing the 'sloppy' attribute with a true
value to new()
. You can also use the sloppy()
method.
Appends the string to the tokenizer-object's buffer (earlier versions would over-write the buffer - this version does not).
Appends a chunk of data from the filehandle to the buffer, and remembers the filehandle, so if you ask for a token, and the buffer is empty, it'll try and read the next line from the file (earlier versions would over-write the buffer - this version does not).
This chunk is 500 characters, and then whatever is left until the next occurrence of the IRS (a newline character in this case). If for whatever reason, you want to change that number to something else, $self->{_INITIAL_READ} can be modified.
get_token()
Returns the next token as a three-item list: 'type', 'argument', 'parameter'.
Token is one of: text
, control
, group
, escape
or eof
.
text
\{
, \}
, and \\
are all returned as control words,
rather than rendered as text for you, as are \_
, \-
and friends.
control
group
eof
escape
escape
type, which is identical to control
, only, it's
only returned for escapes.
Adds an item to the token cache, so that the next time you call get_token, the arguments you passed here will be returned. We don't check any of the values, so use this carefully. This is on a first in last out basis.
Decides whether we allow some types of broken RTF. See new()
's docs
for a little more explanation about this. Pass it 1 to turn it on, 0 to
turn it off. This will always return undef.
Don't call this unless you actually have a good reason. When the Tokenizer reads from a file, it first attempts to work out what the correct input record-seperator should be, by reading some characters from the file handle. This value starts off as 512, which is twice the amount of characters that version 1.7 of the RTF specification says you should go before including a line feed if you're writing RTF.
Called with no argument, this returns the current value of the number of characters we're going to read. Called with a numeric argument, it sets the number of characters we'll read.
You really don't need to use this method.
Returns (non-destructively) the next 50 characters from the buffer, OR, the number of characters you specify. Printing these to STDERR, causing fatal errors, and the like, are left as an exercise to the programmer.
Note the part about 'from the buffer'. It really means that, which means
if there's nothing in the buffer, but still stuff we're reading from a
file it won't be shown. Chances are, if you're using this function, you're
debugging. There's an internal method called _get_line
, which is called
without arguments ($self-
_get_line()>) that's how we get more stuff into
the buffer when we're reading from filehandles. There's no guarentee that'll
stay, or will always work that way, but, if you're debugging, that shouldn't
matter.
To avoid intrusively deep parsing, if an alternative ASCII
representation is available for a Unicode entity, and that
ASCII representation contains {
, or \
, by themselves, things
will go funky. But I'm not convinced either of those is
allowed by the spec.
Pete Sergeant -- rtfr@clueball.com
Copyright 2004 Pete Sergeant.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
RTF::Tokenizer - Tokenize RTF |