Mail::Webmail::MessageParser -- class to parse HTML webmail messages.


Mail::Webmail::MessageParser -- class to parse HTML webmail messages.


        $p = new Mail::Webmail::MessageParser();
        $p->message_start(_tag => 'div', id => 'message');
        $body_text = $p->parse_body($html, $style);
        while (($field, $data) = each @html_fields_from_somewhere) {
                $header = $p->parse_header($field, $data);
                push @headers, $header if $header;


Parses header and body HTML and converts both to text, or optionally (for body text) to simpler fully-formed HTML.

The package extends HTML::TreeBuilder to include functionality for parsing email elements from an HTML string.


Sets the tokens to watch for that denote the beginning of a message. This allows email messages to be embedded within a DIV or other HTML enclosing tag, or simple just follow a particular sequence of tags.

The @message_start_tokens array is passed verbatim to the HTML::TreeBuilder/ HTML::Element functions for traversing the HTML tree. This is typically a list of items such as

  '_tag', 'a', 'href', ''

which is interpreted to mean ``look for an 'anchor' tag with an 'href' parameter of '''.

Since this is a list or array, I typically use the slightly easier-to-read notation of

  '_tag' => 'a', 'href' => ''

$hdr_text = $parser->parse_header($field, $data);
Attempts to find a valid Email header name in $field, and a corresponding value in $data. Potential header names are compared to those in @mail_header_names iff $field matches the $LOOKS_LIKE_A_HEADER regexp.

If a valid field name is found, the returned string contains the header in the form 'Name: Value', for example 'To: ``A User'' <>'. If no such field name is found, undef is returned.

$normalised_html = $parser->parse_body_as_html($html);
Convenience method; calls parse_body() with a style of 'html'.

$text = $parser->parse_body_as_text($html);
Convenience method; calls parse_body() with a style of 'text'.

$text = $parser->parse_body($html, $style);
Returns the parsed message body from $html, using the value passed to message_start() to determine the beginning of the message. The end of the message will be the corresponding close tag of the beginning, or the end of the string if the value passed to message_start() is not a container tag.

The message body returned is converted to normalised HTML (i.e. wrapped in <html>, <body> tags as appropriate) if $style is 'html'. If $style is empty or 'text', the message body is returned as plain text. Regardless of the style used, certain character conversions are performed, to remove non-standard HTML entities such as those introduced by MicroSoft HTML editors.

$parser->start($tagname, $attr, $attrseq, $origtext); =item $parser->end($tagname);
Override the corresponding methods in HTML::TreeBuilder, which itself override those in HTML::Parser. These methods should not be called directly from an application. They are here mainly to remove surplus HTML tags from around the message body; these tags confuse HTML::TreeBuilder and thus result in poor behaviour.




o Currently the parse_body() method returns only text. o There may be some issues with the HTML entities being decoded. o Message bodies should really be enclosed in container tags; I have not tested what happens if a non-contained tag is passed to message_start().


  Simon Drabble  E<lt>sdrabble@cpan.orgE<gt>



 Mail::Webmail::MessageParser -- class to parse HTML webmail messages.