Text search selectivity improvements (was Re: Google Summer of Code 2008) - Mailing list pgsql-hackers
From | Jan Urbański |
---|---|
Subject | Text search selectivity improvements (was Re: Google Summer of Code 2008) |
Date | |
Msg-id | 47E06FE9.8050703@students.mimuw.edu.pl Whole thread Raw |
In response to | Re: Google Summer of Code 2008 (Oleg Bartunov <oleg@sai.msu.su>) |
List | pgsql-hackers |
OK, here's a more detailed description of the FTS selectivity improvement idea: === Write a typanalyze function for column type tsvector ==== The function would go through the tuples returned by the BlockSampler and compute the number of times each distinct lexeme appears inside the tsvectors of those tuples. It would then store the most common lexemes, along with their frequencies in pg_statistics. This will likely require adding a new STATISTIC_KIND_* constant, as with tsvector statistics we won't store the most common values of the tsvector column based on some kind of equality operator, but rather the most common lexemes appearing in those tsvectors. The frequencies will be the fraction of total rows, that contain a particular lexeme in the tsvector column being analyzed. Most frequent lexemes would be stored as text[] in stavalues1 and their frequencies as real[] in stanumbers1. XXXs: - Will looking for most common lexemes in just a few sample rows be enough to do useful selectivity estimation? Maybe the minimum number of rows returned by the sampler should be raised for this kind of stat-gathering. Or maybe it should be documented that it is advisable to SET STATISTICS for a tsvector column to something fairly large if you are to get good results from the planner. - There are typically very few or none at all deletes in tables storing indexed documents. This means that when whe're regularly sampling rows and computing most common lexemes, maybe we shouldn't throw away the previous results? Maybe it would be smart to "merge" previous results with the freshly obtained? If assume no deletes were made between the ANALYZEs we could do the maths and do MCV and frequencies estimates based on that assumption and the previous results. - Right now there seems to be some duplicate code in compute_minimal_stats and compute_scalar_stats, maybe this could be cleaned up as a side effect. The custom typanalyze function would also need to estimate the number of nonnull entries and the average width of the column, perheaps these could be made into separate functions and called from all three places (compute_minimal_stats, compute_scalar_stats, tsvector_typanalyze). - Maybe there are other interesting statistics we could collect for tsvectors, something more fancy than just most common lexemes? === Write a selectivity estimation function for the @@ operator === The function would look at the tsquery and the statistics gathered by the function described earlier and return a selectivity estimation based on them. For example, given SELECT * FROM documents WHERE doc_vector @@ to_tsquery('dog') if the lexeme 'dog' appears among the MCV of doc_vector and has a frequency of 0.7, we would get a 0.7 returned rows estimation. Of course this is a very simple example, it'll be much harder than this. First, the function would have to walk the TSQuery and take it's structure in consideration. For example SELECT * FROM documents WHERE doc_vector @@ to_tsquery('!dog') would have to return a 0.3 estimation (or something more subtle than just 1 - 0.7?). Same goes for other modifiers like &, |. If no lexemes from the tsquery are among MCV, we return an arbitrary 0.001, as it is done currently for all queries. === Deploy these functions === This could at first be deployed as a contrib module, that would define tsvector_typanalyze (or maybe ts_typanalyze, to be consistent with other ts_* functions) and tsvectorsel and update pg_operator and pg_type so tsvector would be ANALYZed and @@ restricted with the new method. So much for the idea, but I might very well have missed some crucial things that'd have to be done in order to pull this off. Comments, suggestions, criticism? Regards, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
pgsql-hackers by date: