Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: GIN improvements part2: fast scan
Date
Msg-id CAPpHfdu18R9AG8-a_fB-eD2vpWtaydidM9TyRVFVY67R=SP7og@mail.gmail.com
Whole thread Raw
In response to Re: GIN improvements part2: fast scan  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: GIN improvements part2: fast scan  (Tomas Vondra <tv@fuzzy.cz>)
Re: GIN improvements part2: fast scan  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers

On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 02/09/2014 12:11 PM, Alexander Korotkov wrote:
I've rebased catalog changes with last master. Patch is attached. I've
rerun my test suite with both last master ('committed') and attached
patch ('ternary-consistent').

Thanks!


          method         |       sum
------------------------+------------------
  committed              | 143491.715000001
  fast-scan-11           | 126916.111999999
  fast-scan-light        |       137321.211
  fast-scan-light-heikki | 138168.028000001
  master                 |       446976.288
  ternary-consistent     |       125923.514

I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8
to 4.

Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make sure we get similar behavior in Tomas' tests that used 6 search terms. But I always felt that it was too large for real queries, once we have the catalog changes, that's why I lowered to 4 when committing. If an opclass benefits greatly from fast scan, it should provide the ternary consistent function, and not rely on the shim implementation.


I'm not sure about decision to reserve separate procedure number for
ternary consistent. Probably, it would be better to add ginConfig method.
It would be useful for post 9.4 improvements.

Hmm, it might be useful for an opclass to provide both, a boolean and ternary consistent function, if the boolean version is significantly more efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a quick check through the array to see if there are any MAYBE arguments, within the consistent function. But I'm inclined to keep the possibility to provide both versions. As long as we support the boolean version at all, there's not much difference in terms of the amount of code to support having them both for the same opclass.

A ginConfig could be useful for many other things, but I don't think it's worth adding it now.


What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck? We discussed that earlier, but didn't reach any conclusion. That needs to be clarified in the docs. One possibility is to document that they're equivalent. Another is to forbid one of them. Yet another is to assign a different meaning to each.

I've been thinking that it might be useful to define them so that a MAYBE result from the tri-consistent function means that it cannot decide if you have a match or not, because some of the inputs were MAYBE. And TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or FALSE, the result would be the same, TRUE+recheck. The practical difference would be that if the tri-consistent function returns TRUE+recheck, ginget.c wouldn't need to bother fetching the other entries, it could just return the entry with recheck=true immediately. While with MAYBE result, it would fetch the other entries and call tri-consistent again. ginget.c doesn't currently use the tri-consistent function that way - it always fetches all the entries for a potential match before calling tri-consistent, but it could. I had it do that in some of the patch versions, but Tomas' testing showed that it was a big loss on some queries, because the consistent function was called much more often. Still, something like that might be sensible in the future, so it might be good to distinguish those cases in the API now. Note that ginarrayproc is already using the return values like that: in GinContainedStrategy, it always returns TRUE+recheck regardless of the inputs, but in other cases it uses GIN_MAYBE.

Next revision of patch is attached.

In this version opclass should provide at least one consistent function: binary or ternary. It's expected to achieve best performance when opclass provide both of them. However, tests shows opposite :( I've to recheck it carefully.

However, it's not!
This revision of patch completes my test case in 123330 ms. This is slightly faster than previous revision. 

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: jsonb and nested hstore
Next
From: Dimitri Fontaine
Date:
Subject: Re: extension_control_path