Home > mailing lists

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

From	Jesper Krogh
Subject	Re: tsvector pg_stats seems quite a bit off.
Date	May 30, 2010 04:08:49
Msg-id	4C020EF0.40007@krogh.cc Whole thread Raw
In response to	Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>)
Responses	Re: tsvector pg_stats seems quite a bit off. Re: tsvector pg_stats seems quite a bit off.
List	pgsql-hackers

Tree view

On 2010-05-29 15:56, Jan Urbański wrote:
> On 29/05/10 12:34, Jesper Krogh wrote:
>    
>> On 2010-05-28 23:47, Jan Urbański wrote:
>>      
>>> On 28/05/10 22:22, Tom Lane wrote:
>>> Now I tried to substitute some numbers there, and so assuming the
>>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>>> statistics target to be 100.
>>>
>>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>>> probably be stopwords, so we will never see them in the input.
>>>
>>>        
>> I think you should skip the assumption about stop-words, users may
>> use something where they are needed in the index or have a language
>> than the typical.  (and they dont seem to influcence the math that much).
>>      
> Turns out it has nearly linear influence on the bucket width and the
> frequency necessary to survive the final pruning. I put some data in a
> spreadsheet, results below.
>
>    
How about setting it to "some default" in the first analyze round, but
setting it to the count of MCE's with a frequency of 1 in the subsequent
analyze rounds?

>> Isn't it the same "type" of logic that is used for collecting statistics
>> for "array-types", say integer-arrays and text arrays?
>>      
> AFAIK statistics for everything other than tsvectors are built based on
> the values of whole rows. ts_typanalyze is the only typanalyze function
> that takes the trouble of looping over the actual contents of each cell,
> all the others just compare whole arrays (which means that for a text[]
> field you will probably a quite useless MCV entry).
>    

In another area, I was thinking about modelling a complete tree structure
where I would like to extract complete sub-btranches as int[] of the 
node-ids
in the set and then indexing them using gin. That seems like a "really 
bad idea"
based on the above information.

Wouldn't it make sense to treat array types like the tsvectors?

> The results are attached in a text (CSV) file, to preserve formatting.
> Based on them I'd like to propose top_stopwords and error_factor to be 100.
>    

I know it is not percieved the correct way to do things, but I would
really like to keep the "stop words" in the dataset and have
something that is robust to that.

There are 2 issues for that wish, one is that the application
becomes more general. I really cannot stop my users from searching
for stop-words and they would expect the "full set" and not the "empty 
set" as
we get now.

The list of stop words is by no means an finite and would very
much depend on the input data set.

I would try to add the stop-words to the dictionary, so they still work, but
doesn't occupy that much space in the actual index. That seems to
solve the same task but with fewer issues for the users and a more 
generalized
code around it.

>> I can "fairly easy" try out patches or do other kind of testing.
>>      
> I'll try to come up with a patch for you to try and fiddle with these
> values before Monday.
>    

Excellent.


testdb=# explain select id from testdb.reference where document_tsvector 
@@ plainto_tsquery('where') order by id limit 50;
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored                                         QUERY PLAN
--------------------------------------------------------------------------------------------- Limit  (cost=41.02..41.03
rows=1width=4)   ->  Sort  (cost=41.02..41.03 rows=1 width=4)         Sort Key: id         ->  Bitmap Heap Scan on
reference (cost=34.50..41.01 rows=1 
 
width=4)               Recheck Cond: (document_tsvector @@ 
plainto_tsquery('where'::text))               ->  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..34.50 rows=1 width=0)                     Index Cond: (document_tsvector @@ 
plainto_tsquery('where'::text))
(7 rows)

testdb=# select id from testdb.reference where document_tsvector @@ 
plainto_tsquery('where') order by id limit 50;
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored id
----
(0 rows)

testdb=#

I would indeed have expected the first 50 rows ordered by id..  trivial 
to extract.

-- 
Jesper

pgsql-hackers by date:

From: Mohammad Heykal Abdillah
Date: 30 May 2010, 02:43:55
Subject: Re: Is there anyway to get list of table name, before raw parser is analyze?

From: Andres Freund
Date: 30 May 2010, 06:56:39
Subject: Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

Previous

Next