Re: GIN improvements part 1: additional information - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: GIN improvements part 1: additional information
Date
Msg-id CAPpHfdu2nXxxNCmtMBBiS+3CRdgF8AHCLt2w9Z4s2kKisJSPuA@mail.gmail.com
Whole thread Raw
In response to Re: GIN improvements part 1: additional information  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: GIN improvements part 1: additional information
List pgsql-hackers
On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
<hlinnakangas@vmware.com>wrote:

Even if we use varbyte encoding, I wonder if it would be better to treat
block + offset number as a single 48-bit integer, rather than encode them
separately. That would allow the delta of two items on the same page to be
stored as a single byte, rather than two bytes. Naturally it would be a
loss on other values, but would be nice to see some kind of an analysis on
that. I suspect it might make the code simpler, too.

Yeah, I had that idea, but I thought it's not a better option. Will try to
do some analysis.

The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions.

Attached is a version that does that. Plus some other minor cleanup.

(we should still investigate using a completely different algorithm, though)

I've thought about different algorithms little more. General problem I see is online update. We need it while it is typically not covered by researches at all. We already have to invent small index in the end of page. Different encoding methods adds more challenges. In general, methods can be classified in two groups:
1) Values aren't aligned by bytes (gamma-codes, PFOR etc.)
2) Multiple values are packed together in small group (simple-9, simple-18)
For the first group of methods when inserting in the middle of the page we would have to do not byte-aligned shift of right part of values. I don't know how expensive is this shift but I expect that it would be much slower than memmove.
When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values.

The option I see is to encode bins between item indexes separately. This still might be slower and require much more complicated maintenance. And also this much complicates further work on storing additional information in GIN.

Any other thoughts?

------
With best regards,
Alexander Korotkov. 

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: should we add a XLogRecPtr/LSN SQL type?
Next
From: Tom Lane
Date:
Subject: Re: should we add a XLogRecPtr/LSN SQL type?