Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.sai.msu.su/~megera/postgres/talks/fts_pgsql_tit.pdf
Дата изменения: Thu Apr 18 15:42:51 2013
Дата индексирования: Mon Apr 11 08:01:13 2016
Кодировка:

Поисковые слова: http astrokuban.info astrokuban
"" PostgreSQL. ( )
.. . .., . .. , , , e-mail: obartunov@gmail.com
, , , , . -, , , . , "" , , , , , , .. , PostgreSQL, , . , , , Sphinx.

, - , . , , , , - "Google", - "", "". , , , . "" . - http-, URL , , . . , , , , , , , .. , , . , , . , , , . , , , , , . , , . " " (Hidden Web), . "" . ? . , "" .


-, (.doc, .pdf, .ps, ...). , SQL . , , . ("bag of words"), , .. , . , . SELECT m.title ||' '|| m.author ||' '|| FROM messages m, docs d WHERE m.id = d.id and m.id = 12; m.abstract ||' '|| d.body as document

, . , , , . , , coalesce(), , NULL. , , . "" "", , "" , . , , . - , , , , , -, , .

PostgreSQL
, PostgreSQL [PGSQL] . , , LIKE, ILIKE, ~, ~*. , , * , , 'satisfies' 'satisfy' . , OR , , , ! * () , , . * - , . , ( ) , tsearch2 PostgreSQL, 8.3. , , , , , , . , " ",


, ,

. ,

, , , TRUE, - FALSE.

PostgreSQL ( ) . , , , , . PostgreSQL , . , , , PostgreSQL. PostgreSQL , , , , , btree , . , , , , . (GiST, GIN, SP-GiST) , , -, , , . , , , . - tsvector, , tsquery - , " " @@ GiST ( ) [GiST] GIN ( ) [GIN]. tsvector, , (), . =# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------t =# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------f , * to_tsvector tsvector =# select to_tsvector('a fat cat sat on a mat - it ate a fat rats'); to_tsvector ----------------------------------------------------'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 * to_tsquery - tsquery =# select to_tsquery('fat & cats'); to_tsquery


--------------'fat' & 'cat' , , . . , "" . , , , . , -, . , , , . 23 , SQL: =# select * from ts_token_type('default'); tokid | alias | description -------+-----------------+-----------------------------------------1 | asciiword | Word, all ASCII 2 | word | Word, all letters 3 | numword | Word, letters and digits 4 | email | Email address 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | Version number 9 | hword_numpart | Hyphenated word part, letters and digits 10 | hword_part | Hyphenated word part, all letters 11 | hword_asciipart | Hyphenated word part, all ASCII 12 | blank | Space symbols 13 | tag | XML tag 14 | protocol | Protocol head 15 | numhword | Hyphenated word, letters and digits 16 | asciihword | Hyphenated word, all ASCII 17 | hword | Hyphenated word, all letters 18 | url_path | URL path 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | XML entity (23 rows) - , "", , , , , , . "" , . - , , '' '', '', '', '' .. '/usr/local/bin', '/usr/ local/share/../bin', '/usr/local/./bin/' '/usr/local/bin', , '13 & ' `XIII '. (, ,....) , , 'footbalklubber' '( football & klubber ) | ( foot & ball & clubber )'. (CREATE TEXT SEARCH CONFIGURATION), . 10


PostgreSQL , pg_catalog. , GUC default_text_search_config, - ALTER DATABASE. , , SQL (CREATE TEXT SEARCH DICTIONARY, CREATE TEXT SEARCH PARSER). psql (\dF, \dFd, \dFp, \dFt) , , , . ts_debug(configuration, text), ,
=# select "alias","token","dictionaries","lexemes" from ts_debug('english', 'as 12 cats'); alias | token | dictionaries | lexemes -----------+-------+----------------+--------asciiword | as | {english_stem} | {} blank | | {} | uint | 12 | {simple} | {12} blank | | {} | asciiword | cats | {english_stem} | {cat} (5 rows)

, 'as' english_stem, - , '12' 'cats' , . ts_lexize() , : =# select ts_lexize('russian_stem',''); ts_lexize ----------{} ( ts_rank, ts_rank_cd) ( ts_headline), .
apod=# select ts_headline(body,q, 'StartSel=<,StopSel=>,MaxWords=10,MinWords=5'), rank from ( select body,q, rank_cd(fts,q) as rank from apod, to_tsquery('supernovae & x-ray') q where fts @@ q order by rank desc limit 5 ) as foo; headline | rank_cd ----------------------------------------------------------------------+-------- remnant E0102-72, however, is giving astronomers a clue | 1.59087 explosion. The picture was taken in - | 1.47733 - glow is produced by multi-million degree | 1.34823 - emitted by this shockwave made by a telescope | 1.14318 - glow. Pictured is the | 1.08116 (5 rows)

C PostgreSQL: * , . * SQL . 10 . * , API. 23 . * , -. - ispell, snowball . , -


thesaurus, synonym. API . * , , UTF-8. . * , , . ­ GiST , GIN . , . * . * , . * , . PostgreSQL [FTSDOC]. « » ?

.
, . , , `black cat' `cat black' . . BEFORE $n, , n, .., a $[n] b == a & b & ( i,j : pos(b)i ­ pos(a)j = n) PostgreSQL ( ). [ALGEBRAFTS].

.
PostgreSQL . , , : 1) , ACID, , , .., , , ( ), ( ). 2) , , - . GIN-. PostgreSQL, (GIN-) . , . (~7 ) avito.ru


8 42 Sphinx 38 . vs 26 . . PostgreSQL , .

38

, . 20 . , ,

156676 Wikipedia articles, , 'title'. =# explain analyze SELECT docid, ts_rank(text_vector, to_tsquery('english', 'title')) AS rank FROM ti2 WHERE text_vector @@ to_tsquery('english', 'title') ORDER BY rank DESC LIMIT 3; :
Limit (cost=8087.40..8087.41 rows=3 width=282) (actual time=433.750..433.752 rows=3 loops=1) -> Sort (cost=8087.40..8206.63 rows=47692 width=282) (actual time=433.749..433.749 rows=3 loops=1) Sort Key: (ts_rank(text_vector, '''titl'''::tsquery)) Sort Method: top-N heapsort Memory: 25kB -> Bitmap Heap Scan on ti2 (cost=529.61..7470.99 rows=47692 width=282) (actual time=15.094..423.452 rows=47855 loops=1) Recheck Cond: (text_vector @@ '''titl'''::tsquery) -> Bitmap Index Scan on ti2_index (cost=0.00..517.69 rows=47692 width=0) (actual time=13.736..13.736 rows=47855 loops=1) Index Cond: (text_vector @@ '''titl'''::tsquery) Total runtime: 433.787 ms

:
=# explain analyze SELECT docid, ts_rank(text_vector, to_tsquery('english', 'title')) AS rank FROM ti2 WHERE text_vector @@ to_tsquery('english', 'title') ORDER BY text_vector>< plainto_tsquery('english','title') LIMIT 3; Limit (cost=20.00..21.65 rows=3 width=282) (actual time=18.376..18.427 rows=3 loops=1) -> Index Scan using ti2_index on ti2 (cost=20.00..26256.30 rows=47692 width=282) (actual time=18.375..18.425 rows=3 loops=1) Index Cond: (text_vector @@ '''titl'''::tsquery) Order By: (text_vector >< '''titl'''::tsquery) Total runtime: 18.511 ms !!!!

(): 1. 2. 3. 4. 5. 3

: 1. 2. 3. 3 - GIN-, , ,


, . , , . PostgreSQL. [FTSMSEC].

(BNF, http://en.wikipedia.org/wiki/Backus-Naur_Form)
::= { } | "'"{ }"'" ::= | "\" "'" /* single quote */ | "'" "'" /* single quote */ | ::= { } ::= A|B|C|D ::= { } | /* empty */ ::= [ ":" ] ::= { "," } ::= [ ] ::= | /* empty */ ::= "&" | "|" | "!" | "(" ")" | [:] ::= "*" | { }[ "*" ] : tsvector: 'star':1B,3A,15,26B,32B,44 'supernovae':2A,25B,31B 'star':1B,3A,15,26B,32B,44 'supernovae':25B,31B tsquery: 'supernovae':A & 'star':A* . @@ , , . , . ( 1- tsvector)


, , , . , , ( -), , . - GiST [GIST] GIN [GIN].

GIN , - , , - , . , , PostgreSQL. , . , , ( , ) . , , N N . . GIN , , , . , GiST "" , .. , , , . RD-Tree (russian doll, ), GiST. - , '0'. - , '1'. . superimposed coding bitwise OR, . word signature ---------------w1 -> 01000000 w2 -> 00010000 w3 -> 10000000 ---------------------11010000 , '11010000' , w1,w2,w3. , . , c . . , . '1' , , . , '1' '1' , , , .


. . 11010000 00000001 01000000 01010000 c Q1, Q2, Q3,

Q2 w1 , , , Q3 - (false drop), , . , , .., '1', . , (signature file), , . RD-Tree R-Tree, . RD-Tree [RDTREE].

ROOT 11011011 Internal nodes: Leaves: 11011001 | 11010000,11010001,11011000 10010011 | 10010010,10010001

, , , , .., '1', , . , '1', , .., , . "false drops", .., , , "heap" () . (lossiness), GiST-, - , . explain analyze (100,000 integer[]) intarray, RD-Tree. =# explain analyze select id from tt where v2 @> '{2}'::int[]; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------Bitmap Heap Scan on tt (cost=21.05..398.87 rows=100 width=4) (actual time=54.673..1001.433 rows=735 loops=1) Recheck Cond: (v2 @> '{2}'::integer[]) Rows Removed by Index Recheck: 11192 -> Bitmap Index Scan on gist_tt_idx2 (cost=0.00..21.03 rows=100 width=0) (actual time=49.550..49.550 rows=11927 loops=1) Index Cond: (v2 @> '{2}'::integer[]) Total runtime: 1001.904 ms GIN-, lossy.


=# explain analyze

select id from tt where v1 @> '{2}'::integer[]; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------Bitmap Heap Scan on tt (cost=20.77..398.59 rows=100 width=4) (actual time=0.480..1.311 rows=735 loops=1) Recheck Cond: (v1 @> '{2}'::integer[]) -> Bitmap Index Scan on gin_tt_idx1 (cost=0.00..20.75 rows=100 width=0) (actual time=0.284..0.284 rows=735 loops=1) Index Cond: (v1 @> '{2}'::integer[]) Total runtime: 1.416 ms , GiST (gist_tt_idx2) 11927 , 735 . (Recheck) 11927 . GiST- ! , , tsvector-, TOAST_INDEX_TARGET , - 512 . . , , , , , . GiST-, Gevel [GEVEL], . gist_idx_50 , . , tsvector-, , - . arxiv=# select gist_stat('gist_idx_90'); gist_stat -------------------------------------------Number of levels: 4 Number of pages: 18296 Number of leaf pages: 17496 Number of tuples: 435661 Number of invalid tuples: 0 Number of leaf tuples: 417366 Total size of tuples: 124776048 bytes Total size of leaf tuples: 119803816 bytes Total size of index: 149880832 bytes -- leaf node arxiv=# select * from gist_print('gist_idx_90') as t(level int,valid bool, fts gtsvector) where level =4; level | valid | fts -------+-------+-------------------------------4|t | 130 true bits, 1886 false bits 4|t | 95 unique words 4|t | 33 unique words 4|t | 77 unique words 4|t | 68 unique words 4|t | 86 unique words 4|t | 77 unique words 4|t | 51 unique words 4|t | 122 unique words 4|t | 127 true bits, 1889 false bits 4|t | 105 unique words 4|t | 170 true bits, 1846 false bits 4|t | 77 unique words 4|t | 121 true bits, 1895 false bits .................................... 4|t | 61 unique words


(417366 rows) -- internal node arxiv=# select * from gist_print('gist_idx_90') as t(level int, valid bool, fts gtsvector) where level =3; level | valid | fts -------+-------+-------------------------------3|t | 852 true bits, 1164 false bits 3|t | 861 true bits, 1155 false bits 3|t | 858 true bits, 1158 false bits 3|t | 872 true bits, 1144 false bits 3|t | 858 true bits, 1158 false bits 3|t | 855 true bits, 1161 false bits 3|t | 853 true bits, 1163 false bits 3|t | 857 true bits, 1159 false bits .................................................. 3|t | 782 true bits, 1234 false bits 3|t | 773 true bits, 1243 false bits (17496 rows)

PostgreSQL (05-07-90225), EnterpriseDB PostgreSQL Development Fund, Mannheim University, jfg:networks, Georgia Public Library Service, .


[PGSQL], " PostgreSQL", ., http://www.sai.msu.su/~megera/postgres/talks/what_is_postgresql.html, , http://www.postgresql.org [GIST], " PostgreSQL GiST", ., . , http://www.sai.msu.su/~megera/postgres/talks/gist_tutorial.html [RDTREE], "THE RD-TREE: AN INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein, http://epoch.cs.berkeley.edu/postgres/papers/UW-CS-TR-1252.pdf [GIN], "Gin for PostgreSQL", http://www.sai.msu.su/~megera/wiki/Gin, "GIN Presentation on PostgreSQL Anniversary Summit, 2006, http://www.sigaev.ru/gin/Gin.pdf [GEVEL], http://www.sai.msu.su/~megera/wiki/Gevel [FTSPGCON2007], advanced lecture "Full-Text Search in PostgreSQL", O.Bartunov, http://www.sai.msu.su/~megera/postgres/talks/fts-pgcon2007.pdf [BLOOM], B.H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM 13(7), 422-426 (1970). [RDTREE], Joseph M. Hellerstein, Avi Pfeffer, THE RD-TREE: An Index Structure for Sets, http://db.cs.berkeley.edu/papers/UW-CS-TR-1252.pdf [ALGEBRAFTS] «Algebra for full-text queries», O.Bartunov, T.Sigaev,


http://www.sai.msu.su/~megera/postgres/talks/algebra-fts.pdf [FTSMSEC] "Full-text search in PostgreSQL in milliseconds», O.Bartunov, A.Korotkov, http://www.sai.msu.su/~megera/postgres/talks/Full-text%20search%20in%20PostgreSQL%20in%20millisecondsextended-version.pdf [FTSDOC] «Full Text Search», http://www.postgresql.org/docs/current/static/textsearch.html