WebFetch - Perl module to download and save information from the Web
|
WebFetch - Perl module to download and save information from the Web
use WebFetch;
The WebFetch module is a general framework for downloading and saving
information from the web, and for display on the web.
It requires another module to inherit it and fill in the specifics of
what and how to download.
WebFetch provides a generalized interface for saving to a file
while keeping the previous version as a backup.
This is expected to be used for periodically-updated information
which is run as a cron job.
After unpacking and the module sources from the tar file, run
perl Makefile.PL
make
make install
Or from a CPAN shell you can simply type ``install WebFetch
''
and it will download, build and install it for you.
If you need help setting up a separate area to install the modules
(i.e. if you don't have write permission where perl keeps its modules)
then see the Perl FAQ.
To begin using the WebFetch modules, you will need to test your
fetch operations manually, put them into a crontab, and then
use server-side include (SSI) or a similar server configuration to
include the files in a live web page.
Select a directory which will be the storage area for files created
by WebFetch. This is an important administrative decision -
keep the volatile automatically-generated files in their own directory
so they'll be separated from manually-maintained files.
Choose the specific WebFetch-derived modules that do the work you want.
See their particular manual/web pages for details on command-line arguments.
Test run them first before committing to a crontab.
First of all, if you don't have crontab access or don't know what they are,
contact your site's system administrator(s). Only local help will do any
good on local-configuration issues. No one on the Internet can help.
(If you are the administrator for your system, see the crontab(1)
and
crontab(5)
manpages and nearly any book on Unix system administration.)
Since the WebFetch command lines are usually very long, you may prefer
to make one or more scripts as front-ends so your crontab entries aren't
so huge.
Do not run the crontab entries too often - be a good net.citizen and
do your updates no more often than necessary.
Popular sites need their users to refrain from making automated
requests too often because they add up on an enormous scale
on the Internet.
Some sites such as Freshmeat prefer no shorter than hourly intervals.
Slashdot prefers no shorter than half-hourly intervals.
When in doubt, ask the site maintainers what they prefer.
(Then again, there are a very few sites like Yahoo and CNN who don't
mind getting the extra hits if you're going to create links to them.
Even so, more often than every 20 minutes would still be excessive
to the biggest web sites.)
See the manual for your web server to make sure you have server-side include
(SSI) enabled for the files that need it.
(It's wasteful to enable it for all your files so be careful.)
When using Apache HTTPD,
a line like this will include a WebFetch-generated file:
<!--#include file=``fetch/slashdot.html''-->
The following function definitions assume $obj
is a blessed
reference to a module that is derived from (inherits from) WebFetch.
- Do not use the
new()
function directly from WebFetch.
-
Use the
new
function from a derived class, not directly from WebFetch.
The WebFetch module itself is just infrastructure for the other modules,
and contains none of the details needed to complete any specific fetches.
- $obj->init( ... )
-
This is called from the
new
function of all WebFetch modules.
It takes ``name'' => ``value'' pairs which are all placed verbatim as
attributes in $obj
.
- $obj->run
-
This function is exported by standard WebFetch-derived modules as
fetch_main
.
This handles command-line processing for some standard options,
calling the module-specific fetch function and WebFetch's $obj->save
function to save the contents to one or more files.
The command-line processing for some standard options are as follows:
- --dir directory
-
(required) the directory in which to write output files
- --group group
-
(optional) the group ID to set the output
file(s)
to
- --mode mode
-
(optional) the file mode (permissions) to set the output
file(s)
to
- --export export-file
-
(optional) save a portable WebFetch-export copy of the fetched info
in the file named by this parameter.
The contents of this file can be read by the WebFetch::General module.
You may use this to export your own news to other WebFetch users.
(Exports may be explicitly disabled by some WebFetch-derived modules
simply by omiting the export step from their
fetch()
functions.
Though it works with all the modules that come included with the
WebFetch package itself.)
- --xml_export xml-export-file
-
(optional) save a generic XML copy of the fetched info
into the file named by this parameter.
(A module to read this XML output will be included in a near-future
version of WebFetch.)
For more info on XML see
http://www.w3.org/XML/
or
http://www.perlxml.com/faq/perl-xml-faq.html
If you choose to generate and sustain XML content on your site
over the long term,
you may want to have your site listed on the XML Tree
at http://www.xmltree.com/
- --ns_export ns-export-file
-
(optional) save a MyNetscape export copy of the fetched info
into the file named by this parameter.
If this optional parameter is used, three additional parameters
become required: --ns_site_title, --ns_site_link, and --ns_site_desc.
If you want to include an icon in the channel display,
you should also use --ns_image_title and --ns_image_url.
A URL Prefix must also be set for this to work correctly,
which can be supplied via the
the --url_prefix parameter or in the url-prefix line of the
WebFetch::SiteNews news input file.
For more info see
http://my.netscape.com/publish/
and
http://www.w3.org/RDF/
Note that MyNetscape uses Resource Description Framework (RDF),
which is a form of XML, for its imports.
Though this command-line option uses some specific RDF parameters
for the MyNetscape portal,
this format should be readable by any other RDF-capable
and even some XML-capable sites.
You should use the ``.rdf'' suffix on file names that use this format.
- --ns_site_title site-title
-
(required if --ns_export is used)
For exporting to MyNetscape, this sets the name of your site.
It cannot be more than 40 characters
- --ns_site_link site-link
-
(required if --ns_export is used)
For exporting to MyNetscape, this is the full URL MyNetscape will
use to link to your site.
It cannot be more than 500 characters.
- --ns_site_desc site-description
-
(required if --ns_export is used)
For exporting to MyNetscape, this is a short description of your site.
It cannot be more than 500 characters.
- --ns_image_title image-title
-
(optional)
For exporting to MyNetscape, this is the title (alt) text for the icon image.
- --ns_image_url image-url
-
(optional)
For exporting to MyNetscape, this is the URL MyNetscpae will use
for your icon image.
If this is present, the link on the image will be the same as your
--ns_site_link parameter.
- --url_prefix url-prefix
-
(optional) include a URL prefix to use on the saved URLs on --ns_export
output files.
(It could also be used in the future by other output formats that need
URL prefixes.)
This is considered optional by WebFetch though you will probably need
it for MyNetscape to properly link to your site.
This information can also be supplied via the
url-prefix line of the WebFetch::SiteNews news input file.
If it is set in the WebFetch::SiteNews,
it will override the --url_prefix command line parameter.
- --font_size number
-
(optional) choose a font size for generated HTML text.
This will be used in a font tag so it may be relative,
like ``-1'' or ``+1''.
- --font_face string
-
(optional) choose a font face for generated HTML text.
This will be used in a font tag so it may be any standard font name
or a list. For example, for a sans-serif font, use
``
Helvetica,Arial,sans-serif
''.
- --style style-name-list
-
(optional) select from one or more of various HTML output styles for the
generated HTML text. If more than one style name is listed, they must
be separated by commas (no spaces.)
- para
-
use paragraph breaks between lines/links instead of unordered lists
- notable
-
usually WebFetch modules generate HTML table-formatted output text but
this option will disable the e of tables
- bullet
-
use explicit bullet characters (HTML entity #149) and line breaks (br)
to identify and separate each link
- ul
-
(default) use an HTML unnumbered list (ul) block for the list of links
The para, bullet and ul styles are mutually exclusive. Others
may be specified at the same time.
- --quiet
-
(optional) suppress printed warnings for HTTP errors
(applies only to modules which use the WebFetch::get() function)
in case they are not desired for cron outputs
- --debug
-
(optional) print verbose debugging outputs,
only useful for developers adding new WebFetch-based modules
or finding/reporting a bug in an existing module
Modules derived from WebFetch may add their own command-line options
that WebFetch::run() will use by defining a variable called
@Options
in the calling module,
using the name/value pairs defined in Perl's Getopts::Long module.
Derived modules can also add to the command-line usage error message by
defining a variable called $Usage
with a string of the additional
parameters, as they should appear in the usage message.
- $obj->do_actions
-
do_actions
was added in WebFetch 0.10 as part of the
WebFetch Embedding API.
Upon entry to this function, $obj must contain the following attributes:
- data
-
is a reference to a hash containing the following three (required)
keys:
- fields
-
is a reference to an array containing the names of the fetched data fields
in the order they appear in the records of the data array.
This is necessary to define what each field is called
because any kind of data can be fetched from the web.
- wk_names
-
is a reference to a hash which maps from
a key string with a ``well-known'' (to WebFetch) field type
to a field name used in this table.
The well-known names are defined as follows:
- title
-
a one-liner banner or title text
(plain text, no HTML tags)
- url
-
URL/link to the news
(fully-qualified URL only, no HTML tags)
- date
-
a date stamp,
which must be program-readable
by Perl's Date::Calc module in the
Parse_Date()
function
in order to support timestamp-related comparisons
and processing that some users have requested.
If the date cannot be parsed by Date::Calc,
either translate it when your module captures it,
or do not define this ``well-known'' field
because it wouldn't fit the definition.
(plain text, no HTML tags)
- summary
-
a paragraph of summary text in HTML
- comments
-
number of comments/replies at the news site
(plain text, no HTML tags)
- author
-
a name, handle or login name representing the author of the news item
(plain text, no HTML tags)
- category
-
a word or short phrase representing the category, topic or department
of the news item
(plain text, no HTML tags)
- location
-
a location associated with the news item
(plain text, no HTML tags)
The field names for this table are defined in the fields array.
The hash only maps for the fields available in the table.
If no field representing a given well-known name is present
in the data fields,
that well-known name key must not be defined in this hash.
- records
-
an array containing the data records.
Each record is itself a reference to an array of strings which are
the data fields.
This is effectively a two-dimensional array or a table.
Only one table-type set of data is permitted per fetch operation.
If more are needed, they should be arranged as separate fetches
with different parameters.
- actions
-
is a reference to a hash.
The hash keys are names for handler functions.
The WebFetch core provides internal handler functions called
fmt_handler_html (for HTML output),
fmt_handler_xml (for XML output),
fmt_handler_wf (for WebFetch::General format),
fmt_handler_rdf (for MyNetscape RDF format).
However, WebFetch modules may provide additional
format handler functions of their own by prepending
``fmt_handler_'' to the key string used in the actions array.
The values are array references containing
``action specs'',
which are themselves arrays of parameters
that will be passed to the handler functions
for generating output in a specific format.
There may be more than one entry for a given format if multiple outputs
with different parameters are needed.
The presence of values in this field mean that output is to be
generated in the specified format.
The presence of these would have been chosed by the WebFetch module that
created them - possibly by default settings or by a command-line argument
that directed a specific output format to be used.
For each valid action spec,
a separate ``savable'' (contents to be placed in a file)
will be generated from the contents of the data variable.
The valid (but all optional) keys are
- html
-
the value must be a reference to an array which specifies all the
HTML generation (html_gen) operations that will take place upon the data.
Each entry in the array is itself an array reference,
containing the following parameters for a call to html_gen():
- filename
-
a file name or path string
(relative to the WebFetch output directory unless a full path is given)
for output of HTML text.
- params
-
a hash reference containing optional name/value parameters for the
HTML format handler.
- filter_func
-
(optional)
a reference to code that, given a reference to an entry in
@{$self->{data}{records}},
returns true (1) or false (0) for whether it will be included in the
HTML output.
By default, all records are included.
- sort_func
-
(optional)
a reference to code that, given references to two entries in
@{$self->{data}{records}},
returns the sort comparison value for the order they should be in.
By default, no sorting is done and all records (subject to filtering)
are accepted in order.
- format_func
-
(optional)
a refernce to code that, given a reference to an entry in
@{$self->{data}{records}},
returns an HTML representation of the string.
By default, a standard HTML formatting is generated using the
well-known fields in the record.
(This default generation fails if none of the title, url or text
names are defined in %{$self->{data}{wk_names}}.
- xml
-
the value must be a reference to an array which specifies all the
XML export (xml_export) operations that will take place upon the data.
Each entry in the array is itself an array reference,
containing the following parameters for a call to xml_export():
- filename
-
a file name or path string
(relative to the WebFetch output directory unless a full path is given)
for output of XML text.
- wf
-
the value must be a reference to an array which specifies all the
WebFetch export (wf_export) operations that will take place upon the data.
Each entry in the array is itself an array reference,
containing the following parameters for a call to wf_export():
- filename
-
a file name or path string
(relative to the WebFetch output directory unless a full path is given)
for output of the WebFetch::General export format.
- rdf
-
the value must be a reference to an array which specifies all the
Resource Description Framework (RDF) export (ns_export, used by MyNetscape)
operations that will take place upon the data.
Each entry in the array is itself an array reference,
containing the following parameters for a call to ns_export():
- filename
-
a file name or path string
(relative to the WebFetch output directory unless a full path is given)
for output of RDF format,
for the MyNetscape portal or other sites that can use RDF.
- site_title
-
For exporting to MyNetscape, this sets the name of your site.
It cannot be more than 40 characters
- site_link
-
For exporting to MyNetscape, this is the full URL MyNetscape will
use to link to your site.
It cannot be more than 500 characters.
- site_desc
-
For exporting to MyNetscape, this is a short description of your site.
It cannot be more than 500 characters.
- image_title
-
(optional)
For exporting to MyNetscape, this is the title (alt) text for the icon image.
- image_url
-
(optional)
For exporting to MyNetscape, this is the URL MyNetscpae will use
for your icon image.
If this is present, the link on the image will be the same as your
$site_link parameter.
Additional valid keys may be created by modules that inherit from WebFetch
by supplying a method/function named with ``fmt_handler_'' preceding the
string used for the key.
For example, for an ``xyz'' format, the handler function would be
fmt_handler_xyz.
The value (the ``action spec'') of the hash entry
must be an array reference.
Within that array are ``action spec entries'',
each of which is a reference to an array containing the list of
parameters that will be passed verbatim to the fmt_handler_xyz function.
When the format handler function returns, it is expected to have
created entries in the $obj->{savables} array
(even if they only contain error messages explaining a failure),
which will be used by $obj->save()
to save the files and print the
error messages.
For coding examples, use the fmt_handler_* functions in WebFetch.pm itself.
- $obj->fetch
-
This function must be provided by each derived module to perform the
fetch operaton specific to that module.
It will be called from
new()
so you should not call it directly.
Your fetch function should extract some data from somewhere
and place of it in HTML or other meaningful form in the ``savable'' array.
Upon entry to this function, $obj must contain the following attributes:
- dir
-
The name of the directory to save in.
(If called from the command-line, this will already have been provided
by the required
--dir
parameter.)
- savable
-
a reference to an array where the ``savable'' items will be placed by
the $obj->fetch function.
(You only need to provide an array reference -
other WebFetch functions can write to it.)
In WebFetch 0.10 and later,
this parameter should no longer be supplied by the fetch function
(unless you wish to use 0.09 backward compatibility)
because it is filled in by the do_actions
after the fetch function is completed
based on the data and actions variables
that are set in the fetch function.
(See below.)
Each entry of the savable array is a hash reference with the following
attributes:
- file
-
file name to save in
- content
-
scalar w/ entire text or raw content to write to the file
- group
-
(optional) group setting to apply to file
- mode
-
(optional) file permissions to apply to file
Contents of savable items may be generated directly by derived modules
or with WebFetch's html_gen
, html_savable
or raw_savable
functions.
These functions will set the group and mode parameters from the
object's own settings, which in turn could have originated from
the WebFetch command-line if this was called that way.
Note that the fetch functions requirements changed in WebFetch 0.10.
The old requirement (0.09 and earlier) is supported for backward compatibility.
In WebFetch 0.09 and earlier,
upon exit from this function, the $obj->savable array must contain
one entry for each file to be saved.
More than one array entry means more than one file to save.
The WebFetch infrastructure will save them, retaining backup copies
and setting file modes as needed.
Beginning in WebFetch 0.10, the ``WebFetch embedding'' capability was introduced.
In order to do this, the captured data of the fetch function
had to be externalized where other Perl routines could access it.
So the fetch function now only populates data structures
(including code references necessary to process the data.)
Upon exit from the function,
the following variables must be set in $obj
:
- data
-
is a reference to a hash which will be used by the do_actions function.
(See above.)
- actions
-
is a reference to a hash which will be used by the do_actions function.
(See above.)
- $obj->get
-
This WebFetch utility function will get a URL and return a reference
to a scalar with the retrieved contents.
Upon entry to this function,
$obj
must contain the following attributes:
- url
-
the URL to get
- quiet
-
a flag which, when set to a non-zero (true) value,
suppresses printing of HTTP request errors on STDERR
- $obj->wf_export ( $filename, $fields, $links, [ $comment, [ $param ]] )
-
In WebFetch 0.10 and later, this should be used only in
format handler functions. See do_handlers() for details.
This WebFetch utility function generates contents for a WebFetch export
file, which can be placed on a web server to be read by other WebFetch sites.
The WebFetch::General module reads this format.
$obj->wf_export has the following parameters:
- $filename
-
the file to save the WebFetch export contents to;
this will be placed in the savable record with the contents
so the save function knows were to write them
- $fields
-
a reference to an array containing a list of the names of the data fields
(in each entry of the @$lines array)
- $lines
-
a reference to an array of arrays;
the outer array contains each line of the exported data;
the inner array is a list of the fields within that line
corresponding in index number to the field names in the @$fields array
- $comment
-
(optional) a Human-readable string comment (probably describing the purpose
of the format and the definitions of the fields used) to be placed at the
top of the exported file
- $param
-
(optional) a reference to a hash of global parameters for the exported data.
This is currently unused but reserved for future versions of WebFetch.
- $obj->ns_export ( $filename, $lines, $site_title, $site_link, $site_desc, $image_title, $image_url)
-
In WebFetch 0.10 and later, this should be used only in
format handler functions. See do_handlers() for details.
This WebFetch utility function generates contents for a MyNetscape export
file, which can be placed on a web server to be read by the MyNetscape
site (my.netscape.com) if you create a ``channel'' for your site at MyNetscape.
Of the modules included with WebFetch, only WebFetch::SiteNews and
WebFetch::Genercal call $obj->ns_export().
The others will ignore it (because they're just obtaining data from
other sites themselves.)
You may use $obj->ns_export()
in your own modules which inherit from WebFetch.
For more info see http://my.netscape.com/publish/$obj->ns_export has the following parameters:
- $filename
-
(required)
the file to save the WebFetch export contents to;
this will be placed in the savable record with the contents
so the save function knows were to write them
- $lines
-
(required)
a reference to an array of arrays;
the outer array contains each line of the exported data;
the inner array is a list of two fields within that line
consisting of a text title string in one entry and a
URL in the second entry.
- $site_title
-
(required)
For exporting to MyNetscape, this sets the name of your site.
It cannot be more than 40 characters
- $site_link
-
(required)
For exporting to MyNetscape, this is the full URL MyNetscape will
use to link to your site.
It cannot be more than 500 characters.
- $site_desc
-
(required)
For exporting to MyNetscape, this is a short description of your site.
It cannot be more than 500 characters.
- $image_title
-
(optional)
For exporting to MyNetscape, this is the title (alt) text for the icon image.
- $image_url
-
(optional)
For exporting to MyNetscape, this is the URL MyNetscpae will use
for your icon image.
If this is present, the link on the image will be the same as your
$site_link parameter.
- $obj->html_gen( $filename, $format_func, $links )
-
In WebFetch 0.10 and later, this should be used only in
format handler functions. See do_handlers() for details.
This WebFetch utility function generates some common formats of
HTML output used by WebFetch-derived modules.
The HTML output is stored in the $obj->{savable} array,
for which all the files in that array can later be saved by the
$obj->save function.
It has the following parameters:
- $filename
-
the file name to save the generated contents to;
this will be placed in the savable record with the contents
so the save function knows were to write them
- $format_func
-
a refernce to code that formats each entry in @$links into a
line of HTML
- $links
-
a reference to an array of arrays of parameters for
&$format_func
;
each entry in the outer array is contents for a separate HTML line
and a separate call to &$format_func
Upon entry to this function, $obj
must contain the following attributes:
- num_links
-
number of lines/links to display
- savable
-
reference to an array of hashes which this function will use as
storage for filenames and contents to save
(you only need to provide an array reference - the function will write to it)
See $obj->fetch for details on the contents of the savable
parameter
- table_sections
-
(optional) if present, this specifies the number of table columns to use;
the number of links from
num_links
will be divided evenly between the
columns
- style
-
(optional) a hash reference with style parameter names/values
that can modify the behavior of the funciton to use different HTML styles.
The recognized values are enumerated with WebFetch's --style command line
option.
(When they reach this point, they are no longer a comma-delimited string -
WebFetch or another module has parsed them into a hash with the style
name as the key and the integer 1 for the value.)
- $obj->html_savable( $filename, $content )
-
In WebFetch 0.10 and later, this should be used only in
format handler functions. See do_handlers() for details.
This WebFetch utility function stores pre-generated HTML in a new entry in
the $obj->{savable} array, for later writing to a file.
It's basically a simple wrapper that puts HTML comments
warning that it's machine-generated around the provided HTML text.
This is generally a good idea so that neophyte webmasters
(and you know there are a lot of them in the world :-)
will see the warning before trying to manually modify
your automatically-generated text.
See $obj->fetch for details on the contents of the savable
parameter
- $obj->raw_savable( $filename, $content )
-
In WebFetch 0.10 and later, this should be used only in
format handler functions. See do_handlers() for details.
This WebFetch utility function stores any raw content and a filename
in the $obj->{savable} array,
in preparation for writing to that file.
(The actual save operation may also automatically include keeping
backup files and setting the group and mode of the file.)
See $obj->fetch for details on the contents of the savable
parameter
- $obj->save
-
This WebFetch utility function goes through all the entries in the
$obj->{savable} array and saves their contents,
providing several services such as keeping backup copies,
and setting the group and mode of the file, if requested to do so.
If you call a WebFetch-derived module from the command-line run()
or fetch_main()
functions, this will already be done for you.
Otherwise you will need to call it after populating the
savable
array with one entry per file to save.
Upon entry to this function, $obj
must contain the following attributes:
- dir
-
directory to save files in
- savable
-
names and contents for files to save
See $obj->fetch for details on the contents of the savable
parameter
The easiest way to make a new WebFetch-derived module is to start
from the module closest to your fetch operation and modify it.
Make sure to change all of the following:
- fetch function
-
The fetch function is the meat of the operation.
Get the desired info from a local file or remote site and place the
contents that need to be saved in the
savable
parameter.
- module name
-
Be sure to catch and change them all.
- file names
-
The code and documentation may refer to output files by name.
- module parameters
-
Change the URL, number of links, etc as necessary.
- command-line parameters
-
If you need to add command-line parameters, modify both the
@Options
and $Usage
variables.
Don't forget to add documentation for your command-line options
and remove old documentation for any you removed.
When adding documentation, if the existing formatting isn't enough
for your changes, there's more information about
Perl's
POD (``plain old documentation'')
embedded documentation format at
http://www.cpan.org/doc/manual/html/pod/perlpod.html
- authors
-
Add yourself as an author if you added any significant functionality.
But if you used anyone else's code, retain the existing author credits
in any module you modify to make a new one.
- export function
-
If it's appropriate for users of your module to be able to export its
data to other sites, add an
export()
function.
Use the one in WebFetch::SiteNews as an example if you need to.
Please consider contributing any useful changes back to the WebFetch
project at maint@webfetch.org
.
WebFetch was written by Ian Kluft
for the Silicon Valley Linux User Group (SVLUG).
Send patches, bug reports, suggestions and questions to
maint@webfetch.org
.
WebFetch is Open Source software distributed via the
Comprehensive Perl Archive Network (CPAN),
a worldwide network of Perl web mirror sites.
WebFetch may be copied under the same terms and licensing as Perl itelf.
A current copy of the source code and documentation may be found at
http://www.webfetch.org/
perl(1),
WebFetch::CNETnews,
WebFetch::CNNsearch,
WebFetch::COLA,
WebFetch::DebianNews,
WebFetch::Freshmeat,
WebFetch::LinuxDevNet,
WebFetch::LinuxTelephony,
WebFetch::LinuxToday,
WebFetch::ListSubs,
WebFetch::PerlStruct,
WebFetch::SiteNews,
WebFetch::Slashdot,
WebFetch::32BitsOnline,
WebFetch::YahooBiz.
WebFetch - Perl module to download and save information from the Web
|