Text search selectivity improvements (was Re: Google Summer of Code 2008) - Mailing list pgsql-hackers

From Jan Urbański
Subject Text search selectivity improvements (was Re: Google Summer of Code 2008)
Date
Msg-id 47E06FE9.8050703@students.mimuw.edu.pl
Whole thread Raw
In response to Re: Google Summer of Code 2008  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-hackers
OK, here's a more detailed description of the FTS selectivity
improvement idea:


=== Write a typanalyze function for column type tsvector ====

The function would go through the tuples returned by the BlockSampler
and compute the number of times each distinct lexeme appears inside the
tsvectors of those tuples. It would then store the most common lexemes,
along with their frequencies in pg_statistics.
This will likely require adding a new STATISTIC_KIND_* constant, as with
tsvector statistics we won't store the most common values of the
tsvector column based on some kind of equality operator, but rather the
most common lexemes appearing in those tsvectors.
The frequencies will be the fraction of total rows, that contain a
particular lexeme in the tsvector column being analyzed.
Most frequent lexemes would be stored as text[] in stavalues1 and their
frequencies as real[] in stanumbers1.

XXXs:
- Will looking for most common lexemes in just a few sample rows be
enough to do useful selectivity estimation? Maybe the minimum number of
rows returned by the sampler should be raised for this kind of
stat-gathering. Or maybe it should be documented that it is advisable to
SET STATISTICS for a tsvector column to something fairly large if you
are to get good results from the planner.
- There are typically very few or none at all deletes in tables storing
indexed documents. This means that when whe're regularly sampling rows
and computing most common lexemes, maybe we shouldn't throw away the
previous results? Maybe it would be smart to "merge" previous results
with the freshly obtained? If assume no deletes were made between the
ANALYZEs we could do the maths and do MCV and frequencies estimates
based on that assumption and the previous results.
- Right now there seems to be some duplicate code in
compute_minimal_stats and compute_scalar_stats, maybe this could be
cleaned up as a side effect. The custom typanalyze function would also
need to estimate the number of nonnull entries and the average width of
the column, perheaps these could be made into separate functions and
called from all three places (compute_minimal_stats,
compute_scalar_stats, tsvector_typanalyze).
- Maybe there are other interesting statistics we could collect for
tsvectors, something more fancy than just most common lexemes?

=== Write a selectivity estimation function for the @@ operator ===

The function would look at the tsquery and the statistics gathered by
the function described earlier and return a selectivity estimation based
on them.

For example, given
SELECT * FROM documents WHERE doc_vector @@ to_tsquery('dog')
if the lexeme 'dog' appears among the MCV of doc_vector and has a
frequency of 0.7, we would get a 0.7 returned rows estimation.

Of course this is a very simple example, it'll be much harder than this.
First, the function would have to walk the TSQuery and take it's
structure in consideration. For example
SELECT * FROM documents WHERE doc_vector @@ to_tsquery('!dog')
would have to return a 0.3 estimation (or something more subtle than
just 1 - 0.7?). Same goes for other modifiers like &, |.

If no lexemes from the tsquery are among MCV, we return an arbitrary
0.001, as it is done currently for all queries.

=== Deploy these functions ===

This could at first be deployed as a contrib module, that would define
tsvector_typanalyze (or maybe ts_typanalyze, to be consistent with other
ts_* functions) and tsvectorsel and update pg_operator and pg_type so
tsvector would be ANALYZed and @@ restricted with the new method.

So much for the idea, but I might very well have missed some crucial
things that'd have to be done in order to pull this off. Comments,
suggestions, criticism?

Regards,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [COMMITTERS] pgsql: Enable probes to work with Mac OS X Leopard and other OSes that
Next
From: Artem Yazkov
Date:
Subject: fast count(*) through statistics collector