Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.sai.msu.su/~megera/postgres/talks/fts84+.pdf
Äàòà èçìåíåíèÿ: Tue Oct 21 22:51:09 2008
Äàòà èíäåêñèðîâàíèÿ: Thu Jan 15 01:11:58 2009
Êîäèðîâêà:
Text Search in 8.4?+


WIP: Phrase Search


Algebra for text queries





Fast approximate statistics based on GIN index Prefix search support (GIN partial match) Middleware for text search configurations
Oleg Bartunov, Teodor Sigaev, Mikhail Prokhorov Moscow University, SAI, Russia


Phrase Search


What is a phrase ?


It's tsquery


'a b c'::tsquery 'a b c' != 'a c b' 'a b x c' != 'a b c'



Ordering is important




Distance between words is important



Phrase Search


What is a phrase ?


It's tsquery


'a b c'::tsquery 'a b c' != 'a c b' 'a b x c' != 'a b c'



Ordering is important




Distance between words is important



Phrase Search


Why there is no phrase search support ?




There are already support for boolean operations There is positional information for each lexeme


Motivation for Algebra


Existing operators defined at document level


'A & B'::tsquery means intersections of two sets
'A'
Documents with

Documents with

'B'


Motivation for Algebra


Operators AND, OR, AND NOT work with sets
A OR B

A AND B

A N OT B


Motivation for Algebra




Phrase Search requires operation at lexeme level -- operator BEFORE ($) Different semantics - A NOT B




Normal search: Document with A and A B at the same time without B Phrase search: «A $ X» (X is anything, except B)


Motivation for Algebra


Phrase can be very complex Even simplest phrase can be transformed to a complex expression.
to_tsquery('nb', 'telefonsvarer') => 'telefonsvarer' | 'telefon' & 'svar'


to_tsquery('footballklubber $ SMTH') => '( (football & klubber) | (foot & ball & klubber) ) $ SMTH' hard, but it's not the hardest case


Motivation for Algebra


Phrase can be constructed by a program, or manually using casting ( SMTH::tsquery) «A $ ( B $ !(C $ !D))»





We need well-defined algebra for operations: & | ! $ Backward compatibility !


Motivation for Algebra


We introduced «generalized» phrase a $[n] b



Operator BEFORE ($[n]) guarantees


An order of operands -- a BEFORE b Distance between operands, default is 1

a $[n] b == a & b & ( i,j : pos(b)i ­ pos(a)j = n) a $ b == a & b & ( i,j : pos(b)i ­ pos(a)j = 1)


Operations
· · · · · · · a $[ n] b = b $[ - n] a !(a $[n] b) = !a | !b | ( i,j : pos(b)i ­ pos(a)j != n ) !!(a $[n] b) = a $[n] b a $ !b = a & ( i,j : !pos(b)i ­ pos(a)j = 1) !a $ b = b & ( i,j : pos(b)i ­ !pos(a)j = 1 ) !a $ !b = ( i,j : !pos(b)i ­ !pos(a)j = 1 ) a $ (b | c) = a $ b | a $ c (b | c) $ a = b $ a | $ a · a $ (b & c) = b & & (a $ b | a $ c); (b & c) $ a = b & & (b $ a | $ a)


Recursive definition
(a $[n] b) $[m] c
= (a $[n] b) & & (i,j: posL(c)j ­posR("ab")i=m) => (a $[n] b) & c & (i,j: pos(c)j ­ posR("ab")i = m ) => (a & b & (k,l: pos(b)k ­ pos(a)l = n)) & c & (i,j: pos(c)j ­ posR("ab")i = m ) = =a&b&c& (k,l: pos(b)k­pos(a)l=n) & (j: pos(c)j­pos(b)k = m ) = = a & b & c & (j,k,l: pos(b)k­pos(a)l=n & pos(c)j ­ pos(b)k=m) =

a $[n] b $[m] c a $[n] (b $[m] c) = a $[n] b $[m] c ( as above)
=


Query: «black» $ («hole» | «nebulae») ==> «black» $ «hole» | «black» $ «nebulae»

|

&

&

&

&

«black»

«hole»

«black»

«nebulae» pos(«hole»)-pos(«nebulae»)=1

pos(«hole»)-pos(«black»)=1


Example
Query: «close» $ «galaxies» After dictionary: «close» $ («m33» | («andromeda» $ «nebulae» | («magellanic» & «clouds»)) Phrase: «close» $ «m33» | ( «close» $ («andromeda» $ «nebulae» )) | ( «magellanic» & «clouds» & ( «close» $ «magellanic» | «close» $ «clouds») )


| $ $

| «m33» &

«close» «close»

$

&

|

«nebulae»

«andromeda»

«magellanic»

«clouds»

$

$

«close»

«magellanic» «close» «clouds»


Phrase Search


Possible extensions






#[n] -- soft $[n], order doesn't important a<$[n] b -- at most n words between operands a $[n]> b -- at least n words between operands And so on ...


Partial Match for GIN Prefix search for a text search Improve performance LIKE '%foo%'
­ ­





It's not a full text search Btree index (text_pattern_ops) can improve


LIKE '%FOO' LIKE 'FOO%'


Partial Match: Wildspeed
Index all permutations of string !
=# select permute('hello'); permute -------------------------------------{hello$,ello$h,llo$he,lo$hel,o$hell} '$' is used for visualization, we use \0 LI LI LI LI KE KE KE KE ' ' ' ' %l%' h%o' %o' h%' => => => => ~ ~ ~ ~ 'l*' 'o$h*' 'o$*' 'h*$


Partial Match: wildspeed
750,000 words, average length is 8 characters, time in ms
| h% | hel% | h%o | %l% | %lll% | %l | %lll | %ll%o |

--------------+------+------+------+-----+-------+-----+-------+-------+ wildspeed | 28.0 | 8.5 | 1.1 | 1.0 | 1.1 | 434 | 8.6 | 415 | 0.7 408 | 426 | 0.7 | 18 | 404 |

Btree/seqscan |

| 407 | 404.0 |

CREATE INDEX ... USING btree (w text_pattern_ops) : CREATE INDEX ... USING gin (w2 wildcard_ops) :

3.175 seconds 1 hour 10 minutes


Prefix search

The popular request for the text search
SEL ECT ' supers tar on pa rty': :tsve ctor @@ 'su per: *' AS yes; ye s --- -t

SEL ECT ' supern ov :1A s ky:2B '::ts vecto r @@ ' supe r:A*' AS y e ye s --- -t


Prefix Search


Based on partial match algorithm in GIN Syntax -- use flag '*'


'abc:*'::tsquery - search documents with words 'abc*'





Prefix search comes for free, no special actions required ! Dictionary API supports prefix flag


Prefix search


tsquery @@ to_tsquery('supernova:a* & stars') Find supernova* in titles

=# select count(*) from papers where fts @@ to_squery('supernova:a* & stars'); count ---- --838 (1 ro w ) =# select count(*) from papers where fts @@ to_tsquery('supernova:a & stars'); count ---- --835 (1 ro w )


Fast approximate statistics


Gevel extension -- GiST/GIN indexes explorer
(http://www.sai.msu.su/~megera/wiki/Gevel)







Fast -- uses only GIN index (no table access) Approximate -- no table access, which contains visibility information, approx. for long posting lists Statistics looks good for mostly read-only data


Fast approximate statistics


Top-5 most frequent words (463,873 docs)

=# SELECT * FROM gin_stat('gin_idx') as t(word text, ndoc int) order by ndoc desc limit 5; word | ndoc --------+-------page | 340858 figur | 240366 use | 148022 model | 134442 result | 129010 (5 rows) Time: 520.714 ms


Fast approximate statistics


gin_stat() vs ts_stat()

=# select * into stat from ts_stat('select fts from papers') order by ndoc desc, nentry desc,word; ...wait.... =# SELECT a.word, b.ndoc as exact, a.estimation as estimation, round ( (a.estimation-b.ndoc)*100.0/a.estimation,2)||'%' as error FROM (SELECT * FROM gin_stat('gin_x_idx') as t(word text, estimation int) order by estimation desc limit 5 ) as a, stat b WHERE a.word = b.word; word | exact | estimation | error --------+--------+------------+------page | 340430 | 340858 | 0.13% figur | 240104 | 240366 | 0.11% use | 147132 | 148022 | 0.60% model | 133444 | 134442 | 0.74% result | 128977 | 129010 | 0.03% (5 rows) Time: 550.562 ms


Middleware for FTS configuration


Dictionaries should be able to specify how interpret their output. For example, dictionary returns (a,b) . Possible interpretations:


(a,b) (a,b) (a,b) (a,b) Etc.

-

> > > >

a a a a

&b |b $[n] b - 'b' follows 'a', n words max #[n] b - soft $ (no order)


Middleware for FTS configuration




Option for dictionary to return also an original word Manage how word is processed by a stack of dictionaries


Stop if recognized -- current behaviour Process and continue -- filters


Use case: accent removal problem (wrong highlighting) -- cannot be solved by using function
to_tsvector(remove_accent(document))


Text Search in 8.4?+


This work is supported by


jfg://networks -- over-blog.net EnterpriseDB

Oleg Bartunov, Teodor Sigaev, Mikhail Prokhorov Moscow University, SAI, Russia