DBIx::TextIndex - Perl extension for full-text searching in SQL databases |
new(\%args)
add_document(\@document_ids)
remove_document(\@document_ids)
disable_document(\@document_ids)
search(\%search_args)
unscored_search(\%search_args)
DBIx::TextIndex - Perl extension for full-text searching in SQL databases
use DBIx::TextIndex;
my $index = DBIx::TextIndex->new({ document_dbh => $document_dbh, document_table => 'document_table', document_fields => ['column_1', 'column_2'], document_id_field => 'primary_key', index_dbh => $index_dbh, collection => 'collection_1', db => 'mysql', proximity_index => 0, errors => { empty_query => ``your query was empty'', quote_count => ``phrases must be quoted correctly'', no_results => ``your seach did not produce any results'', no_results_stop => ``no results, these words were stoplisted: '' }, language => 'en', # cz or en stoplist => [ 'en' ], max_word_length => 12, result_threshold => 5000, phrase_threshold => 1000, min_wildcard_length => 5, print_activity => 0 });
$index->initialize;
$index->add_document(\@document_ids);
my $results = $index->search({ column_1 => '``a phrase'' +and -not or', column_2 => 'more words', });
foreach my $document_id (sort {$$results{$b} <=> $$results{$a}} keys %$results ) { print ``DocumentID: $document_id Score: $$results{$document_id} \n''; }
$index->delete;
DBIx::TextIndex was developed for doing full-text searches on BLOB columns stored in a database. Almost any database with BLOB and DBI support should work with minor adjustments to SQL statements in the module.
Implements a crude parser for tokenizing a user input string into phrases, can-include words, must-include words, and must-not-include words.
Operates in case insensitive manner.
The following methods are available:
new(\%args)
Constructor method. The first time an index is created, the following arguments must be passed to new():
my $index = DBIx::TextIndex->new({ document_dbh => $document_dbh, document_table => 'document_table', document_fields => ['column_1', 'column_2'], document_id_field => 'primary_key', index_dbh => $index_dbh, collection => 'collection_1' });
Other arguments are optional.
":2 some phrase" => matches "some nice phrase" ":1 some phrase" => matches only exact "some phrase" ":10 some phrase" => matches "some [1..9 words] phrase"
Defaults to ":1" when omitted.
The proximity matches work only forwards, not backwards, that means:
":3 some phrase" does not match "phrase nice some" or "phrase some"
Names of the database modules correspond to the names of DBI drivers and are case sensitive.
Requires module CzFast that is available on CPAN in a directory of author ``TRIPIE''.
en => English cz => Czech
After creating a new TextIndex for the first time, and after calling initialize(), only the index_dbh, document_dbh, and collection arguments are needed to create subsequent instances of a TextIndex.
This method creates all the inverted tables for the TextIndex in the database specified by document_dbh. This method should be called only once when creating a new index! It drops all the inverted tables before creating new ones.
initialize()
also stores the document_table, document_fields,
document_id_field, language, stoplist, error attributes,
proximity_index, max_word_length, result_threshold, phrase_threshold
and min_wildcard_length preferences in a special table called
``collection,'' so subsequent calls to new()
for a given collection do
not need those arguments.
Calling initialize()
will upgrade the collection table created by
earlier versions of DBIx::TextIndex if necessary.
Upgrades the collection table to the latest format. Usually does not
need to be called by the programmer, because initialize()
handles
upgrades automatically.
add_document(\@document_ids)
Add all the @documents_ids from document_id_field to the TextIndex.
@document_ids must be sorted from lowest to highest. All further
calls to add_document()
must use @document_ids higher than those
previously added to the index. Reindexing previously-indexed
documents will yield unpredictable results!
remove_document(\@document_ids)
This method accepts a reference to an array of document ids as its parameter. The specified documents will be removed from the index, but not from the actual documents table that is being indexed. The documents itself must be accessible when you remove them from the index. The ids should be sorted from lowest to highest.
It's actually not possible to completely recover the space taken by the documents that are removed, therefore it's recommended to rebuild the index when you remove a significant amount of documents.
All space reserved in the proximity index is recovered. Approx. 75% of space reserved in the inverted tables and max term frequency table is recovered.
disable_document(\@document_ids)
This method can be used to disable documents. Disabled documents are not included in search results. This method should be used to ``remove'' documents from the index. Disabled documents are not actually removed from the index, therefore its size will remain the same. It's recommended to rebuild the index when you remove a significant amount of documents.
search(\%search_args)
search()
returns $results, a reference to a hash. The keys of the
hash are document ids, and the values are the relative scores of the
documents. If an error occured while searching, $results will be a
scalar containing an error message.
$results = $index->search({ first_field => '+andword -notword orword ``phrase words''', second_field => ... ... });
if (ref $results) { print ``The score for $document_id is $results->{$document_id}\n''; } else { print ``Error: $results\n''; }
unscored_search(\%search_args)
unscored_search()
returns $document_ids, a reference to an array. Since
the scoring algorithm is skipped, this method is much faster than search().
If an error occured while searching $document_ids will be a scalar
containing an error message.
$document_ids = $index->unscored_search({ first_field => '+andword -notword orword ``phrase words''', second_field => ... });
if (ref $document_ids) { print ``Here's all the document ids:\n''; map { print ``$_\n'' } @$document_ids; } else { print ``Error: $document_ids\n''; }
Allows you to obtain some meta information about the index. Accepts one parameter that specifies what you want to obtain.
$index->stat('total_words')
Returns a total count of words in the index. This number may differ from the total count of words in the documents itself.
delete()
removes the tables associated with a TextIndex from index_dbh.
DBIx::TextIndex can apply boolean operations on arbitrary lists of document ids to search results.
Take this table:
doc_id category doc_full_text
1 green full text here ...
2 green ...
3 blue ...
4 red ...
5 blue ...
6 green ...
Masks that represent document ids for in each the three categories can be created:
$index->add_mask('green_category', [ 1, 2, 6 ]); $index->add_mask('blue_category', [ 3, 5 ]); $index->add_mask('red_category', [ 4 ]);
The first argument is an arbitrary string, and the second is a reference to any array of documents ids that the mask name identifies.
mask operations are passed in a second argument hash reference to $index->search():
%query_args = ( first_field => '+andword -notword orword ``phrase words''', second_field => ... ... );
%args = ( not_mask => \@not_mask_list, and_mask => \@and_mask_list, or_mask => \@or_mask_list, or_mask_set => [ \@or_mask_list_1, \@or_mask_list_2, ... ], );
$index->search(\%query_args, \%args);
From our example above, to narrow search results to documents not in green category:
$index->search(\%query_args, { not_mask => ['green_category'] });
This would give return results only in blue category:
$index->search(\%query_args, { and_mask => ['blue_category'] });
Instead of using named masks, lists of document ids can be passed on the fly as array references. This would give the same results as the previous example:
my @blue_ids = (3, 5); $index->search(\%query_args, { and_mask => [ \@blue_ids ] });
$index->search(\%query_args, { or_mask => [ 'blue_category', 'red_category' ] });
Deletes a single mask from the mask table in the database.
You can use wildcard characters ``%'' or ``*'' at end of a word to match all words that begin with that word. Example:
the "%" character means "match any characters"
car% ==> matches "car", "cars", "careful", "cartel", ....
the "*" character means "match also the plural form"
car* ==> matches only "car" or "cars"
The option min_wildcard_length is used to set the minimum length of word base appearing before the ``%'' wildcard character. Defaults to five characters to avoid selection of excessive amounts of word combinations. Unless this option is set to a lower value, the examle above (car%) wouldn't produce any results.
A module HTML::Highlight can be used either independently or together with DBIx::TextIndex for this task.
The HTML::Highlight module provides a very nice Google-like highligting using different colors for different words or phrases and also can be used to preview a context in which the query words appear in resulting documents.
The module works together with DBIx::TextIndex using its new method html_highlight().
Check example script 'html_search.cgi' in the 'examples/' directory of DBIx::TextIndex distribution or refer to the documentation of HTML::Highlight for more information.
For czech diacritics insensitive operation you need to set the language option to 'cz'.
my $index = DBIx::TextIndex->new({ .... language => 'cz', .... });
This option MUST be set for correct czech language proccessing. Diacritics sensitive operation is not possible.
Requires the module ``CzFast'' that is available on CPAN in directory of author ``TRIPIE''.
Daniel Koch, dkoch@bizjournals.com. Contributions by Tomas Styblo, tripie@cpan.org.
Copyright 1997, 1998, 1999, 2000, 2001 by Daniel Koch. All rights reserved.
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the ``Artistic License'' or the ``GNU General Public License''.
This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the ``GNU General Public License'' for more details.
Special thanks to Tomas Styblo, for proximity index support, Czech language support, stoplists, highlighting, document removal and many other improvements.
Thanks to Ulrich Pfeifer for ideas and code from Man::Index module in ``Information Retrieval, and What pack 'w' Is For'' article from The Perl Journal vol. 2 no. 2.
Thanks to Steffen Beyer for the Bit::Vector module, which enables fast set operations in this module. Version 5.3 or greater of Bit::Vector is required by DBIx::TextIndex.
Uses quite a bit of memory.
Parser is not very good.
Documentation is not complete.
Please feel free to email me (dkoch@bizjournals.com) with any questions or suggestions.
perl(1).
DBIx::TextIndex - Perl extension for full-text searching in SQL databases |