Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.sai.msu.su/~megera/postgres/gist/openfts/README.INSIDE
Дата изменения: Sun Aug 3 17:11:27 2003
Дата индексирования: Sat Dec 22 07:25:40 2007
Кодировка:

"OpenFTS Inside"
---------------

MOTIVATION:

This document is a list of comments to examples script. If you already pass
through examples and looking for instrinsic details you're in a right way.
For detailed description of API see OpenFTS primer.

INITIALIZATION:

There is no magic at this step. You have to create tables and configure
OpenFTS. This is what init.pl is doing.

a) $dbi->do("create table txt ( tid int not null primary key, path varchar, fts_index tsvector );") || die;

Creates test table 'txt' with three fields: tid - document id,
path - path to the document, fts_index of type tsvector (see tsearch2 doc)
as a storage for unique lexemes from the document.

b)

my $idx=Search::OpenFTS::Index->init(
dbi=>$dbi,
txttid=>'txt.tid',
tsvector_field=>'fts_index',
ignore_id_index=>[ qw( 7 13 14 12 23 ) ],
ignore_headline=>[ qw(13 15 16 17 5) ],
map=>'{
\'4\'=>[1], 5=>[1], 6=>[1], 8=>[1], 18=>[1], 19=>[1], # unknown
}',
dict=>[
'Search::OpenFTS::Dict::PorterEng',
# example how to use snowball stemmer
# { mod=>'Search::OpenFTS::Dict::Snowball', param=>'{lang=>"english", stop_file=>"/u/megera/app/pgsql/fts/test-suite/Dict/english.stop"}' },
'Search::OpenFTS::Dict::UnknownDict',
]
);

Creates (instantiates) fts object with some attributes, stored in
table fts_conf of your database, which looks like:

---------------------------------------------------
openfts=# \d fts_conf
Table "fts_conf"
Column | Type | Modifiers
--------+-------------------+---------------------
name | character varying | not null
did | integer | not null default -1
mod | character varying | not null
param | character varying |
Primary key: fts_conf_pkey
---------------------------------------------------

Now, we have something to play with. First, you see a bunch of
integers in attributes 'ignore_id_index', 'ignore_headline', 'map'.
These numbers designate types of lexemes (see OpenFTS primer),
which should be ignored ( attributes 'ignore_id_index', 'ignore_headline')
or recognized by according dictionaries ('map'). For example,

ignore_id_index=>[ qw( 7 13 14 12 23 ) ]

means that numbers in scientific notation (7), HTML tags (13), HTML entities (23), protocol
part of URL (string like 'http://', 'ftp://') (14), and special
symbols (12) should be ignored while indexing a document.

map=>'{
\'4\'=>[1], 5=>[1], 6=>[1], 8=>[1], 18=>[1], 19=>[1], # unknown
}',

means that lexemes of specified types should be processed by the
dictionary specified in [] (enumeration of dictionaries starts from zero !),
defined in the 'dict' attribute.
Types of lexems are defined in parser and specific numbers
are what default OpenFTS parser uses, so if you write you custom parser
(why not ?), keep in mind you have to be in sync with 'init' method.

(see simple_parser.pl as an example of simple parser in perl, which
read from STDIN and recognize space delimited words with length =>2 )

NOTICE: Due to bug in perl5 you should use \'4\' notation for the
first element in map attribute !

Parser passes a lexeme to dictionaries in order specified in 'dict'
attribute until it recognized by some dictionary. In our example,
Search::OpenFTS::Dict::UnknownDict dictionary has a deal with unrecognized
words. But you may dont' use it, so those lexems will be ignored.

Parser uses mapping to pass a lexem to specific dictionary, which is
not only an optimization but is also good for indexing of mixed-languages
documents. Some dictionaries (stemming, for example) could recognize all
lexems, so we could use mapping to define explicit rule what
dictionary should be used for which type of lexeme. In our example,
we use Porter's stemmer ( [1] ) for english "words" and UnknownDict for
unrecognized words.

Some dictionaries requires parameters and commented line
# { mod=>'Search::OpenFTS::Dict::Snowball', param=>'{lang=>"english"}' },
demonstrates how to define dictionary in this case. Be sure you read
a documentation for full list of parameters for Snowball stemmer.

If you have several dictionaries and especially if you index
multi-language collection you may *explicitly* define the order
of dictionaries lexeme processed. For example, if you want to use
our interface to ISpell dictionaries (provides sort of morphology)
and Porter's stemming algorithm, it's sound idea, that lexeme pass to
ISpell and if it doesn't recognized, pass to Porter's dictionary
(NOTE: Stemming dictionaries does recognized any words by definition !)
In this case, you may map latin words as :

1=>[0,1], 11=>[0,1], 16=>[0,1], # latin
2=>[2,3], 10=>[2,3], 17=>[2,3], # cyrillic

where dictionaries assigned as :

0 - ISpell english
1 - Porter english
2 - ISpell russian
3 - Snowball russian

Read OpenFTS primer for information about dictionaries API.

c) $idx->create_index;

Creates index 'gist_key' on field 'fts_index' of table 'txt'.
This index is used for speeding search operation, but could
significantly slowdown indexing process. So, as a rule of thumb,
for batch indexing of documents create index only after finishing
of indexing. But for online indexing you need to create index at
initialization. In our test example, we use batch mode, but for sake of
clarity we leave index creation in init.pl.

INDEXING:

Read filenames from STDIN and invoke method $idx->index to index
document. Also, 'tid' and 'path' are stored for further referencing
by search script. Actually, there are two operations with database:

$sth->execute( $STID, ,$file ) - inserting 'tid', 'path'
and
$idx->index($STID, \*INFILE) - indexing and inserting 'fts_index'

That's why we need to explicitly invoke $dbi->commit if everything is ok
and $dbi->rollback if something gets wrong.

Read DBI, DBD::Pg documentation for details about transactions support.

SEARCHING:

Search script could be used for testing, benchmarking and searching.
Invoke ./search.pl without parameters to see syntax.

$sql = $fts->_sql( \@ARGV );

Method '_sql' returns sql query for given search query (reference to @ARGV).
For example: ./search.pl -p openfts -vq supernovae stars

select
txt.tid,
rank( '{0.1, 0.2, 0.4, 1.0}', txt.fts_index, '\'supernova\' & \'star\'', 1 ) as pos
from
txt
where
txt.fts_index @@ '\'supernova\' & \'star\''
order by pos desc

Notice, that query terms are passed through dictionary (Porter's stemmer):
'supernovae' becomes 'supernova' and 'stars' - 'star'.
relkov - is a relevation function based on proximity between search terms
and used for ranking (order by) results. Magic numbers could be defined
while creating of fts object, see documentation for Search::OpenFTS
(perldoc Search::OpenFTS). We use defaults in our example.

Also, for testing purposes, you could invoke search.pl with -e option
to get explain for sql command used for searching (see above).

$dbi->do("explain $sql" );

For example: ./search.pl -p openfts -qe supernovae stars

NOTICE: QUERY PLAN:

Sort (cost=4.83..4.83 rows=1 width=4)
-> Index Scan using gist_key on txt (cost=0.00..4.82 rows=1 width=4)

Benchmarking, use 'search' method, \@ARGV is a reference to array
with search terms.

foreach ( 1..$opt{b} ) {
my $a=$fts->search( \@ARGV );
$count=$#{$a};
}

Example: ./search.pl -p openfts -b 100 Uma 47

Found documents:2
908;328
Speed gun in use :)...
Found documents:2, total time (100 runs): 0.39 sec; average time: 0.004 sec

In real life applications searching usually includes an additional
constraints to metadata. Method get_sql returns sql parts which could be
used to construct sql query. For example:

my ($out, $condition, $order) = $fts->get_sql( $query, rejected => \@stopwords );

my $sql="
select
txt.tid,
txt.path,
$out
from
txt
where
$condition
order by $order";

@stopwords contains words recognized by dictionaries as stop words or
rejected by S attribute. It's quite useful to return
feedback to user.

To get real feeling from searching invoke search.pl with '-h' option:

In this way search uses explicit sql command and results are displayed as
documents fragments with search terms hilighted. Hilighting is done using
termcap control sequences. You may use HTML's markup instead:

my $headline=$fts->get_headline(query=>$query, src=>\*FH,
maxread=>1024, maxlen=>100,
otag=>'[1m',ctag=>'[0m' );
# otag=>'',ctag=>'' );

Please note, 'maxread' is a maximum bytes to read from 'src' and
'maxlen' is a length of text fragment. You're welcome to use your custom
procedure to generate text fragments. Default method supplied by
OpenFTS is currently not smart to keep text fragments looking nice,
i.e. without heading or trailing punctuation marks.
Play with get->get_headline2 method which should be smarter as regards
this problem. Read the primer for references.

Example: /search.pl -p openfts -h 3 crab nebulae

------TID: 1589 WEIGHT:0.077 PATH:/u/megera/app/pgsql/fts/test-suite/apod/1165090
Energy Crab Nebula Credit: NASA , UIT Explanation: This is the mess that is left when a star explodes. The Crab Nebula is so
------TID: 1121 WEIGHT:0.073 PATH:/u/megera/app/pgsql/fts/test-suite/apod/1163277
. The Crab Nebula is so energetic that it glows in every kind of light known. Shown above are images of the Crab Nebula from
------TID: 667 WEIGHT:0.062 PATH:/u/megera/app/pgsql/fts/test-suite/apod/1162865
M1: Filaments of the Crab Nebula Credit and Copyright: S. Kohle, T. Credner et al. ( AIUB ) Explanation : The Crab Nebula is

(Hilighting is lost here because of cat'n paste from xterm).

TID here is a document id as specified in database, PATH - path to
document and WEIGHT - weight of document in terms of relevance function.

USING PREFIXES:

OpenFTS could works with different collections stored in one database.
Storing collections in one database doesn't require establishing
different connections to database. Collections could be specified
using prefixes (currently, they are characters from english alphabet).

You may play with collections using examples scripts - just use
DATABASE:PREFIX instead of DATABASE. For example:

./init.pl openfts:a
find /path/to/test-collection/apod -type f | ./index.pl openfts:a
./search.pl -p openfts:a supernovae stars

another collection

./init.pl openfts:x
find /path/to/test-collection/xfiles -type f | ./index.pl openfts:x
./search.pl -p openfts:x spaceship biogenesis

Name of table (template name), used for storing meta data and search
index, is specified at the init stage. It could be changed in
init.pl script ( my $TABLE = 'txt'; ).

SEE ALSO:

The OpenFTS Primer
perldoc Search::OpenFTS::Search
perldoc Search::OpenFTS::Index
perldoc Search::OpenFTS::Parser
perldoc Search::OpenFTS::Dict::PorterEng
perldoc Search::OpenFTS::Dict::Snowball
perldoc Search::OpenFTS::Dict::UnknownDict
perldoc Search::OpenFTS::Morph::ISpell

TODO:

Simple crawler for indexing personal web site.
Volunteer are welcome.

Done. See perldoc Search::OpenFTS::Crawler and
example scripts.

FINAL NOTES:

Test suite for OpenFTS is a start point for novices and could be used
for customization and writing your own search application.
Consult the OpenFTS primer and documentation to the perl modules
(use perldoc) for details. Authors appreciate your ideas and
comments about further development of OpenFTS and support,
please, use OpenFTS discussion list
(http://lists.sourceforge.net/lists/listinfo/openfts-general)

--------------------------------------------------------------------
Sat Aug 2 23:08:10 MSD 2003
Comments to Oleg Bartunov