Home > mailing lists

Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: GIN improvements part2: fast scan
Date	February 3, 2014 18:36:55
Msg-id	CAPpHfds6xKm8CtfhG8zhFtD7WXTNW3RZCgdWiFfiwaQ0qcobLw@mail.gmail.com Whole thread Raw
In response to	Re: GIN improvements part2: fast scan (Heikki Linnakangas <hlinnakangas@vmware.com>)
List	pgsql-hackers

Tree view

On Thu, Jan 30, 2014 at 8:38 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 01/30/2014 01:53 AM, Tomas Vondra wrote:
(3) A file with explain plans for 4 queries suffering ~2x slowdown,
and explain plans with 9.4 master and Heikki's patches is available
here:

http://www.fuzzy.cz/tmp/gin/queries.txt

All the queries have 6 common words, and the explain plans look
just fine to me - exactly like the plans for other queries.

Two things now caught my eye. First some of these queries actually
have words repeated - either exactly like "database & database" or
in negated form like "!anything & anything". Second, while
generating the queries, I use "dumb" frequency, where only exact
matches count. I.e. "write != written" etc. But the actual number
of hits may be much higher - for example "write" matches exactly
just 5% documents, but using @@ it matches more than 20%.

I don't know if that's the actual cause though.

I tried these queries with the data set you posted here: http://www.postgresql.org/message-id/52E4141E.60609@fuzzy.cz. The reason for the slowdown is the extra consistent calls it causes. That's expected - the patch certainly does call consistent in situations where it didn't before, and if the "pre-consistent" checks are not able to eliminate many tuples, you lose.

So, what can we do to mitigate that? Some ideas:

1. Implement the catalog changes from Alexander's patch. That ought to remove the overhead, as you only need to call the consistent function once, not "both ways". OTOH, currently we only call the consistent function if there is only one unknown column. If with the catalog changes, we always call the consistent function even if there are more unknown columns, we might end up calling it even more often.

2. Cache the result of the consistent function.

3. Make the consistent function cheaper. (How? Magic?)

4. Use some kind of a heuristic, and stop doing the pre-consistent checks if they're not effective. Not sure what the heuristic would look like.

I'm fiddling around with following idea of such heuristic:

1) Sort entries ascending by estimate of TIDs number. Assume number of entries to be n.

2) Sequentially call ternary consistent with first m of GIN_FALSE and rest n-m of GIN_MAYBE until it returns GIN_FALSE.

3) Decide whether to use fast-scan depending on number of TIDs in first m entries and number of TIDs in rest (n-m) entries. Something like number_of_TIDs_in_frist_m_entries * k < number_of_TIDs_in_rest_n-m_entries where k is some quotient.

For sure, it rough method, but it should be fast. Without natural implementation of ternary consistent function algorithm will be slightly different: only m values close to n should be checked.

I'm going to implement this heuristic against last version of your patch.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

From: Andres Freund
Date: 03 February 2014, 18:27:28
Subject: Re: jsonb and nested hstore

From: Robert Haas
Date: 03 February 2014, 18:49:52
Subject: Re: Regression tests failing if not launched on db "regression"

Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers

Previous

Next