Mail::Webmail::MessageParser -- class to parse HTML webmail messages. |
Mail::Webmail::MessageParser -- class to parse HTML webmail messages.
$p = new Mail::Webmail::MessageParser(); $p->message_start(_tag => 'div', id => 'message'); $body_text = $p->parse_body($html, $style); while (($field, $data) = each @html_fields_from_somewhere) { $header = $p->parse_header($field, $data); push @headers, $header if $header; }
Parses header and body HTML and converts both to text, or optionally (for body text) to simpler fully-formed HTML.
The package extends HTML::TreeBuilder to include functionality for parsing email elements from an HTML string.
The @message_start_tokens array is passed verbatim to the HTML::TreeBuilder/ HTML::Element functions for traversing the HTML tree. This is typically a list of items such as
'_tag', 'a', 'href', 'http://foo.bar.com'
which is interpreted to mean ``look for an 'anchor' tag with an 'href' parameter of 'http://foo.bar.com''.
Since this is a list or array, I typically use the slightly easier-to-read notation of
'_tag' => 'a', 'href' => 'http://foo.bar.com'
If a valid field name is found, the returned string contains the header in the form 'Name: Value', for example 'To: ``A User'' <user@server.com>'. If no such field name is found, undef is returned.
parse_body()
with a style of 'html'.
parse_body()
with a style of 'text'.
message_start()
to determine the beginning of the message. The end of the
message will be the corresponding close tag of the beginning, or the end of
the string if the value passed to message_start()
is not a container tag.
The message body returned is converted to normalised HTML (i.e. wrapped in <html>, <body> tags as appropriate) if $style is 'html'. If $style is empty or 'text', the message body is returned as plain text. Regardless of the style used, certain character conversions are performed, to remove non-standard HTML entities such as those introduced by MicroSoft HTML editors.
Nothing.
o Currently the parse_body()
method returns only text.
o There may be some issues with the HTML entities being decoded.
o Message bodies should really be enclosed in container tags; I have not
tested what happens if a non-contained tag is passed to message_start().
Simon Drabble E<lt>sdrabble@cpan.orgE<gt>
Mail::Webmail::Yahoo
Mail::Webmail::MessageParser -- class to parse HTML webmail messages. |