DBIx::TextIndex - Perl extension for full-text searching in SQL databases

$index = DBIx::TextIndex->new(\%args)
$index->initialize
$index->upgrade_collection_table
$index->add_document(\@document_ids)
$index->remove_document(\@document_ids)
$index->disable_document(\@document_ids)
$index->search(\%search_args)
$index->unscored_search(\%search_args)
$index->stat
$index->delete

SUPPORT FOR SEARCH MASKS

$index->add_mask($mask_name, \@document_ids);
$index->delete_mask($mask_name);

PARTIAL PATTERN MATCHING USING WILDCARDS
HIGHLIGHTING OF QUERY WORDS OR PATTERNS IN RESULTING DOCUMENTS
CZECH LANGUAGE SUPPORT
AUTHORS
COPYRIGHT
LICENSE
DISCLAIMER
ACKNOWLEDGEMENTS
BUGS
SEE ALSO

NAME

DBIx::TextIndex - Perl extension for full-text searching in SQL databases

SYNOPSIS

use DBIx::TextIndex;

my $index = DBIx::TextIndex->new({ document_dbh => $document_dbh, document_table => 'document_table', document_fields => ['column_1', 'column_2'], document_id_field => 'primary_key', index_dbh => $index_dbh, collection => 'collection_1', db => 'mysql', proximity_index => 0, errors => { empty_query => ``your query was empty'', quote_count => ``phrases must be quoted correctly'', no_results => ``your seach did not produce any results'', no_results_stop => ``no results, these words were stoplisted: '' }, language => 'en', # cz or en stoplist => [ 'en' ], max_word_length => 12, result_threshold => 5000, phrase_threshold => 1000, min_wildcard_length => 5, print_activity => 0 });

$index->initialize;

$index->add_document(\@document_ids);

my $results = $index->search({ column_1 => '``a phrase'' +and -not or', column_2 => 'more words', });

foreach my $document_id (sort {$$results{$b} <=> $$results{$a}} keys %$results ) { print ``DocumentID: $document_id Score: $$results{$document_id} \n''; }

$index->delete;

DESCRIPTION

DBIx::TextIndex was developed for doing full-text searches on BLOB columns stored in a database. Almost any database with BLOB and DBI support should work with minor adjustments to SQL statements in the module.

Implements a crude parser for tokenizing a user input string into phrases, can-include words, must-include words, and must-not-include words.

Operates in case insensitive manner.

The following methods are available:

$index = DBIx::TextIndex->`new(\%args)`

Constructor method. The first time an index is created, the following arguments must be passed to new():

Other arguments are optional.

document_dbh

DBI connection handle to database containing text documents

document_table

Name of database table containing text documents

document_fields

Reference to a list of column names to be indexed from document_table

document_id_field

Name of a unique integer key column in document_table

index_dbh

DBI connection handle to database containing TextIndex tables. Using a separate database for your TextIndex is recommended, because the module creates and drops tables without warning.

collection

A name for the index. Should contain only alpha-numeric characters or underscores [A-Za-z0-9_]

proximity_index

Activates a proximity index for faster phrase searches and word proximity based matching. Disabled by default. Only efficient for bigger documents. Takes up a lot of space and slows down the indexing proccess. Proximity based matching is activated by a query containing a phrase in form of:

        ":2 some phrase" => matches "some nice phrase"
        ":1 some phrase" => matches only exact "some phrase"
        ":10 some phrase" => matches "some [1..9 words] phrase"


        Defaults to ":1" when omitted.

The proximity matches work only forwards, not backwards, that means:

        ":3 some phrase" does not match "phrase nice some" or "phrase some"

db

SQL used in this module is database specific in some aspects. In order to use this module with a variety of databases, so called ``database module'' can be specified. Default is the mysql module. Another modules have yet to be written.

Names of the database modules correspond to the names of DBI drivers and are case sensitive.

errors

This hash reference can be used to override default error messages. Please refer to the SYNOPSIS for meaning of the particular keys and values.

language Accepts a value of 'en' or 'cz'. Default is 'en'.

Passing 'cz' to language activates support for the Czech language. Operates in a diacritics insensitive manner. This option may also be usable for other iso-8859-2 based Slavic languages. Basically it converts both indices data and queries from iso-8859-2 to pure ASCII.

Requires module CzFast that is available on CPAN in a directory of author ``TRIPIE''.

stoplist

Activates stoplisting of very common words that are present in almost every document. Default is not to use stoplisting. Value of the parameter is a reference to array of two-letter language codes in lower case. Currently only two stoplists exist:

        en => English
        cz => Czech

max_word_length

Specifies maximum word length resolution. Defaults to 12 characters.

result_threshold

Defaults to 5000 documents.

phrase_threshold

Defaults to 1000 documents.

print_activity

Activates STDOUT debugging. Higher value increases verbosity.

After creating a new TextIndex for the first time, and after calling initialize(), only the index_dbh, document_dbh, and collection arguments are needed to create subsequent instances of a TextIndex.

$index->initialize

This method creates all the inverted tables for the TextIndex in the database specified by document_dbh. This method should be called only once when creating a new index! It drops all the inverted tables before creating new ones.

initialize() also stores the document_table, document_fields, document_id_field, language, stoplist, error attributes, proximity_index, max_word_length, result_threshold, phrase_threshold and min_wildcard_length preferences in a special table called ``collection,'' so subsequent calls to new() for a given collection do not need those arguments.

Calling initialize() will upgrade the collection table created by earlier versions of DBIx::TextIndex if necessary.

$index->upgrade_collection_table

Upgrades the collection table to the latest format. Usually does not need to be called by the programmer, because initialize() handles upgrades automatically.

$index->`add_document(\@document_ids)`

Add all the @documents_ids from document_id_field to the TextIndex. @document_ids must be sorted from lowest to highest. All further calls to add_document() must use @document_ids higher than those previously added to the index. Reindexing previously-indexed documents will yield unpredictable results!

$index->`remove_document(\@document_ids)`

This method accepts a reference to an array of document ids as its parameter. The specified documents will be removed from the index, but not from the actual documents table that is being indexed. The documents itself must be accessible when you remove them from the index. The ids should be sorted from lowest to highest.

It's actually not possible to completely recover the space taken by the documents that are removed, therefore it's recommended to rebuild the index when you remove a significant amount of documents.

All space reserved in the proximity index is recovered. Approx. 75% of space reserved in the inverted tables and max term frequency table is recovered.

$index->`disable_document(\@document_ids)`

This method can be used to disable documents. Disabled documents are not included in search results. This method should be used to ``remove'' documents from the index. Disabled documents are not actually removed from the index, therefore its size will remain the same. It's recommended to rebuild the index when you remove a significant amount of documents.

$index->`search(\%search_args)`

search() returns $results, a reference to a hash. The keys of the hash are document ids, and the values are the relative scores of the documents. If an error occured while searching, $results will be a scalar containing an error message.

$results = $index->search({ first_field => '+andword -notword orword ``phrase words''', second_field => ... ... });

if (ref $results) { print ``The score for $document_id is $results->{$document_id}\n''; } else { print ``Error: $results\n''; }

$index->`unscored_search(\%search_args)`

unscored_search() returns $document_ids, a reference to an array. Since the scoring algorithm is skipped, this method is much faster than search(). If an error occured while searching $document_ids will be a scalar containing an error message.

$document_ids = $index->unscored_search({ first_field => '+andword -notword orword ``phrase words''', second_field => ... });

if (ref $document_ids) { print ``Here's all the document ids:\n''; map { print ``$_\n'' } @$document_ids; } else { print ``Error: $document_ids\n''; }

$index->stat

Allows you to obtain some meta information about the index. Accepts one parameter that specifies what you want to obtain.

        $index->stat('total_words')

Returns a total count of words in the index. This number may differ from the total count of words in the documents itself.

$index->delete

delete() removes the tables associated with a TextIndex from index_dbh.

SUPPORT FOR SEARCH MASKS

DBIx::TextIndex can apply boolean operations on arbitrary lists of document ids to search results.

Take this table:

doc_id category doc_full_text

1 green full text here ...

2 green ...

3 blue ...

4 red ...

5 blue ...

6 green ...

Masks that represent document ids for in each the three categories can be created:

$index->add_mask($mask_name, \@document_ids);

$index->add_mask('green_category', [ 1, 2, 6 ]); $index->add_mask('blue_category', [ 3, 5 ]); $index->add_mask('red_category', [ 4 ]);

The first argument is an arbitrary string, and the second is a reference to any array of documents ids that the mask name identifies.

mask operations are passed in a second argument hash reference to $index->search():

%query_args = ( first_field => '+andword -notword orword ``phrase words''', second_field => ... ... );

%args = ( not_mask => \@not_mask_list, and_mask => \@and_mask_list, or_mask => \@or_mask_list, or_mask_set => [ \@or_mask_list_1, \@or_mask_list_2, ... ], );

$index->search(\%query_args, \%args);

not_mask

For each mask in the not_mask list, the intersection of the search query results and all documents not in the mask is calculated.

From our example above, to narrow search results to documents not in green category:

$index->search(\%query_args, { not_mask => ['green_category'] });

and_mask

For each mask in the and_mask list, the intersection of the search query results and all documents in the mask is calculated.

This would give return results only in blue category:

$index->search(\%query_args, { and_mask => ['blue_category'] });

Instead of using named masks, lists of document ids can be passed on the fly as array references. This would give the same results as the previous example:

my @blue_ids = (3, 5); $index->search(\%query_args, { and_mask => [ \@blue_ids ] });

or_mask_set

With the or_mask_set argument, the union of all the masks in each list is computed individually, and then the intersection of each union set with the query results is calculated.

or_mask

An or_mask is treated as an or_mask_set with only one list. In this example, the union of blue_category and red_category is taken, and then the intersection of that union with the query results is calculated:

$index->search(\%query_args, { or_mask => [ 'blue_category', 'red_category' ] });

$index->delete_mask($mask_name);

Deletes a single mask from the mask table in the database.

PARTIAL PATTERN MATCHING USING WILDCARDS

You can use wildcard characters ``%'' or ``*'' at end of a word to match all words that begin with that word. Example:

    the "%" character means "match any characters"

    car%        ==> matches "car", "cars", "careful", "cartel", ....

    the "*" character means "match also the plural form"

    car*        ==> matches only "car" or "cars"

The option min_wildcard_length is used to set the minimum length of word base appearing before the ``%'' wildcard character. Defaults to five characters to avoid selection of excessive amounts of word combinations. Unless this option is set to a lower value, the examle above (car%) wouldn't produce any results.

HIGHLIGHTING OF QUERY WORDS OR PATTERNS IN RESULTING DOCUMENTS

A module HTML::Highlight can be used either independently or together with DBIx::TextIndex for this task.

The HTML::Highlight module provides a very nice Google-like highligting using different colors for different words or phrases and also can be used to preview a context in which the query words appear in resulting documents.


The module works together with DBIx::TextIndex using its new method
html_highlight().

Check example script 'html_search.cgi' in the 'examples/' directory of DBIx::TextIndex distribution or refer to the documentation of HTML::Highlight for more information.

CZECH LANGUAGE SUPPORT

For czech diacritics insensitive operation you need to set the language option to 'cz'.

        my $index = DBIx::TextIndex->new({
                ....
                language => 'cz',
                ....
        });

This option MUST be set for correct czech language proccessing. Diacritics sensitive operation is not possible.

Requires the module ``CzFast'' that is available on CPAN in directory of author ``TRIPIE''.

AUTHORS

Daniel Koch, dkoch@bizjournals.com. Contributions by Tomas Styblo, tripie@cpan.org.

COPYRIGHT

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the ``Artistic License'' or the ``GNU General Public License''.

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the ``GNU General Public License'' for more details.

ACKNOWLEDGEMENTS

Special thanks to Tomas Styblo, for proximity index support, Czech language support, stoplists, highlighting, document removal and many other improvements.

Thanks to Ulrich Pfeifer for ideas and code from Man::Index module in ``Information Retrieval, and What pack 'w' Is For'' article from The Perl Journal vol. 2 no. 2.

Thanks to Steffen Beyer for the Bit::Vector module, which enables fast set operations in this module. Version 5.3 or greater of Bit::Vector is required by DBIx::TextIndex.

BUGS

Uses quite a bit of memory.

Parser is not very good.

Documentation is not complete.

Please feel free to email me (dkoch@bizjournals.com) with any questions or suggestions.