Re: GIN improvements part 1: additional information - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: GIN improvements part 1: additional information
Date
Msg-id 524DBAEA.9080908@vmware.com
Whole thread Raw
In response to Re: GIN improvements part 1: additional information  (Bruce Momjian <bruce@momjian.us>)
Responses Re: GIN improvements part 1: additional information
List pgsql-hackers
On 23.09.2013 18:35, Bruce Momjian wrote:
> On Sun, Sep 15, 2013 at 01:14:45PM +0400, Alexander Korotkov wrote:
>> On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas<hlinnakangas@vmware.com>
>> wrote:
>>
>>      There's a few open questions:
>>
>>      1. How are we going to handle pg_upgrade? It would be nice to be able to
>>      read the old page format, or convert on-the-fly. OTOH, if it gets too
>>      complicated, might not be worth it. The indexes are much smaller with the
>>      patch, so anyone using GIN probably wants to rebuild them anyway, sooner or
>>      later. Still, I'd like to give it a shot.
>
> We have broken pg_upgrade index compatibility in the past.
> Specifically, hash and GIN index binary format changed from PG 8.3 to
> 8.4.  I handled it by invalidating the indexes and providing a
> post-upgrade script to REINDEX all the changed indexes.  The user
> message is:
>
>        Your installation contains hash and/or GIN indexes.  These indexes have
>        different internal formats between your old and new clusters, so they
>        must be reindexed with the REINDEX command.  The file:
>
>        ...
>
>        when executed by psql by the database superuser will recreate all invalid
>         indexes; until then, none of these indexes will be used.
>
> It would be very easy to do this from a pg_upgrade perspective.
> However, I know there has been complaints from others about making
> pg_upgrade more restrictive.
>
> In this specific case, even if you write code to read the old file
> format, we might want to create the REINDEX script to allow _optional_
> reindexing to shrink the index files.
>
> If we do require the REINDEX, --check will clearly warn the user that
> this will be required.

It seems we've all but decided that we'll require reindexing GIN indexes 
in 9.4. Let's take the opportunity to change some other annoyances with 
the current GIN on-disk format:

1. There's no explicit "page id" field in the opaque struct, like there 
is in other index types. This is for the benefit of debugging tools like 
pg_filedump. We've managed to tell GIN pages apart from other index 
types by the fact that the special size of GIN pages is 8 and it's not 
using all the high-order bits in the last byte on the page. But an 
explicit page id field would be nice, so let's add that.

2. I'd like to change the way "incomplete splits" are handled. 
Currently, WAL recovery keeps track of incomplete splits, and fixes any 
that remain at the end of recovery. That concept is slightly broken; 
it's not guaranteed that after you've split a leaf page, for example, 
you will succeed in inserting the downlink to its parent. You might e.g 
run out of disk space. To fix that, I'd like to add a flag to the page 
header to indicate if the split has been completed, ie. if the page's 
downlink has been inserted to the parent, and fix them lazily on the 
next insert. I did a similar change to GiST back in 9.1. (Strictly 
speaking this doesn't require changing the on-disk format, though.)

3. I noticed that the GIN b-trees, the main key entry tree and the 
posting trees, use a slightly different arrangement of the downlink than 
our regular nbtree code does. In nbtree, the downlink for a page is the 
*low* key of that page, ie. if the downlink is 10, all the items on that 
child page must be >= 10. But in GIN, we store the *high* key in the 
downlink, ie. all the items on the child page must be <= 10. That makes 
inserting new downlinks at a page split slightly more complicated. For 
example, when splitting a page containing keys between 1-10 into 1-5 and 
5-10, you need to insert a new downlink with key 10 for the new right 
page, and also update the existing downlink to 5. The nbtree code 
doesn't require updating existing entries.

Anything else?

- Heikki



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [RFC] Extend namespace of valid guc names
Next
From: Robert Haas
Date:
Subject: Re: GIN improvements part 1: additional information