Home > mailing lists

Statistics for tsvector "wildcards". term* - Mailing list pgsql-hackers

From	Jesper Krogh
Subject	Statistics for tsvector "wildcards". term*
Date	May 29, 2010 09:08:20
Msg-id	4C010399.8050806@krogh.cc Whole thread Raw
List	pgsql-hackers

Tree view

Hi.

There seems to be an "unimplemented" area around getting statistics for
wildcard searches done. Wildcards anchored to the left can be matched
up by the gin-index and the ts_match_vq operator:


testdb=# select to_tsvector('project') @@ to_tsquery('proj:*'); ?column?
---------- t
(1 row)

Searching for project gives me this estimate:
testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('projects') order by id limit 50;                                           QUERY PLAN
------------------------------------------------------------------------------------------------- Limit
(cost=0.00..3008.54rows=50 width=76)   ->  Index Scan using reference_pkey on reference  
 
(cost=0.00..3762544.72 rows=62531 width=76)         Filter: (document_tsvector @@ to_tsquery('projects'::text))
(3 rows)

whereas searching for proj:* gives:

testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('proj:*') order by id limit 50;                                         QUERY PLAN
--------------------------------------------------------------------------------------------- Limit  (cost=73.56..73.58
rows=6width=76)   ->  Sort  (cost=73.56..73.58 rows=6 width=76)         Sort Key: id         ->  Bitmap Heap Scan on
reference (cost=34.55..73.49 rows=6 
 
width=76)               Recheck Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))               ->  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..34.54 rows=6 width=0)                     Index Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))
(7 rows)

There are abouvios challenges in getting statistics for submatches where 
there are no real information
in the pg_stats table, but there will also be a huge amount of cases 
where a fairly reliable
guess can be extracted.

-- 
Jesper

pgsql-hackers by date:

From: Jesper Krogh
Date: 29 May 2010, 07:43:51
Subject: Re: tsvector pg_stats seems quite a bit off.

From: Jan Urbański
Date: 29 May 2010, 10:56:58
Subject: Re: tsvector pg_stats seems quite a bit off.

Statistics for tsvector "wildcards". term* - Mailing list pgsql-hackers

Previous

Next