Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: GIN improvements part2: fast scan
Date
Msg-id CAPpHfduJZiJAiR7qfOt2mH6kNbuBoQMbYJb9+7LgEb5pv53eHA@mail.gmail.com
Whole thread
In response to Re: GIN improvements part2: fast scan  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: GIN improvements part2: fast scan
List pgsql-hackers
On Wed, Mar 12, 2014 at 8:29 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 03/12/2014 12:09 AM, Tomas Vondra wrote:
Hi all,

a quick question that just occured to me - do you plan to tweak the cost
estimation fot GIN indexes, in this patch?

IMHO it would be appropriate, given the improvements and gains, but it
seems to me gincostestimate() was not touched by this patch.

Good point. We have done two major changes to GIN in this release cycle: changed the data page format and made it possible to skip items without fetching all the keys ("fast scan"). gincostestimate doesn't know about either change.

Adjusting gincostestimate for the more compact data page format seems easy. When I hacked on that, I assumed all along that gincostestimate doesn't need to be changed as the index will just be smaller, which will be taken into account automatically. But now that I look at gincostestimate, it assumes that the size of one item on a posting tree page is a constant 6 bytes (SizeOfIptrData), which is no longer true. I'll go fix that.

Adjusting for the effects of skipping is harder. gincostestimate needs to do the same preparation steps as startScanKey: sort the query keys by frequency, and call consistent function to split the keys intao "required" and "additional" sets. And then model that the "additional" entries only need to be fetched when the other keys match. That's doable in principle, but requires a bunch of extra code.

Alexander, any thoughts on that? It's getting awfully late to add new code for that, but it sure would be nice somehow take fast scan into account.

Preparation we do in startScanKey requires knowledge of estimate size of posting lists/trees. We do this estimate by traversal to leaf pages. I think gincostestimate is expected to be way more cheap. So, we probably need so more rough estimate there, don't we?

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: The case against multixact GUCs
Next
From: Robert Haas
Date:
Subject: Re: COPY table FROM STDIN doesn't show count tag