Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers
From | Jesper Krogh |
---|---|
Subject | Re: tsvector pg_stats seems quite a bit off. |
Date | |
Msg-id | 4C020EF0.40007@krogh.cc Whole thread Raw |
In response to | Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>) |
Responses |
Re: tsvector pg_stats seems quite a bit off.
Re: tsvector pg_stats seems quite a bit off. |
List | pgsql-hackers |
On 2010-05-29 15:56, Jan Urbański wrote: > On 29/05/10 12:34, Jesper Krogh wrote: > >> On 2010-05-28 23:47, Jan Urbański wrote: >> >>> On 28/05/10 22:22, Tom Lane wrote: >>> Now I tried to substitute some numbers there, and so assuming the >>> English language has ~1e6 words H(W) is around 6.5. Let's assume the >>> statistics target to be 100. >>> >>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >>> probably be stopwords, so we will never see them in the input. >>> >>> >> I think you should skip the assumption about stop-words, users may >> use something where they are needed in the index or have a language >> than the typical. (and they dont seem to influcence the math that much). >> > Turns out it has nearly linear influence on the bucket width and the > frequency necessary to survive the final pruning. I put some data in a > spreadsheet, results below. > > How about setting it to "some default" in the first analyze round, but setting it to the count of MCE's with a frequency of 1 in the subsequent analyze rounds? >> Isn't it the same "type" of logic that is used for collecting statistics >> for "array-types", say integer-arrays and text arrays? >> > AFAIK statistics for everything other than tsvectors are built based on > the values of whole rows. ts_typanalyze is the only typanalyze function > that takes the trouble of looping over the actual contents of each cell, > all the others just compare whole arrays (which means that for a text[] > field you will probably a quite useless MCV entry). > In another area, I was thinking about modelling a complete tree structure where I would like to extract complete sub-btranches as int[] of the node-ids in the set and then indexing them using gin. That seems like a "really bad idea" based on the above information. Wouldn't it make sense to treat array types like the tsvectors? > The results are attached in a text (CSV) file, to preserve formatting. > Based on them I'd like to propose top_stopwords and error_factor to be 100. > I know it is not percieved the correct way to do things, but I would really like to keep the "stop words" in the dataset and have something that is robust to that. There are 2 issues for that wish, one is that the application becomes more general. I really cannot stop my users from searching for stop-words and they would expect the "full set" and not the "empty set" as we get now. The list of stop words is by no means an finite and would very much depend on the input data set. I would try to add the stop-words to the dictionary, so they still work, but doesn't occupy that much space in the actual index. That seems to solve the same task but with fewer issues for the users and a more generalized code around it. >> I can "fairly easy" try out patches or do other kind of testing. >> > I'll try to come up with a patch for you to try and fiddle with these > values before Monday. > Excellent. testdb=# explain select id from testdb.reference where document_tsvector @@ plainto_tsquery('where') order by id limit 50; NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored QUERY PLAN --------------------------------------------------------------------------------------------- Limit (cost=41.02..41.03 rows=1width=4) -> Sort (cost=41.02..41.03 rows=1 width=4) Sort Key: id -> Bitmap Heap Scan on reference (cost=34.50..41.01 rows=1 width=4) Recheck Cond: (document_tsvector @@ plainto_tsquery('where'::text)) -> Bitmap Index Scan on reference_fts_idx (cost=0.00..34.50 rows=1 width=0) Index Cond: (document_tsvector @@ plainto_tsquery('where'::text)) (7 rows) testdb=# select id from testdb.reference where document_tsvector @@ plainto_tsquery('where') order by id limit 50; NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored id ---- (0 rows) testdb=# I would indeed have expected the first 50 rows ordered by id.. trivial to extract. -- Jesper
pgsql-hackers by date: