Re: GIN improvements part 1: additional information - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: GIN improvements part 1: additional information
Date
Msg-id CAPpHfduXkYv4uUHHFrCntNQD=JyB=xQXtutXeY_KYs6CBLJN0w@mail.gmail.com
Whole thread Raw
In response to Re: GIN improvements part 1: additional information  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: GIN improvements part 1: additional information
List pgsql-hackers
On Thu, Oct 10, 2013 at 3:57 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 09.10.2013 02:04, Tomas Vondra wrote:
On 8.10.2013 21:59, Heikki Linnakangas wrote:
On 08.10.2013 17:47, Alexander Korotkov wrote:
Hi, Tomas!

On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra<tv@fuzzy.cz>   wrote:

I've attempted to rerun the benchmarks tests I did a few weeks ago, but
   I got repeated crashes when loading the data (into a table with
tsvector+gin index).

Right before a crash, theres this message in the log:

     PANIC:  not enough space in leaf page!


Thanks for testing. Heikki's version of patch don't works for me too on
even much more simplier examples. I can try to get it working if he
answer
my question about GinDataLeafPageGetPostingList* macros.

The new macros in that patch version were quite botched. Here's a new
attempt.

Nope, still the same errors :-(

PANIC:  not enough space in leaf page!
LOG:  server process (PID 29722) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: autovacuum: ANALYZE public.messages

I've continued hacking away at the patch, here's yet another version. I've done a lot of cleanup and refactoring to make the code more readable (I hope). I'm not sure what caused the panic you saw, but it's probably fixed now.  Let me know if not.

Some notable changes since my last patch version:

* I changed the ginbtree interface so that isEnoughSpace method is no more. Instead, placeToPage returns false without doing anything if the item doesn't fit. It was slightly simpler this way when working with the compressed posting lists, as placeToPage had to calculate tuple sizes anyway to decide how many items fit on the page.

* I rewrote incrUpdateItemIndexes(). It now operates in two stages: 1. adjust the pageOffsets and 2. split the bin. I found it more readable this way.

* I changed the WAL format of insert records so that there is a separate struct for data-leaf, data-internal, and entry pages. The information recorded for each of those was so different that it was confusing to cram them all into the same struct.


I'm going to step back a bit from the details of the patch now. I think there's enough work for you, Alexander, until the next commitfest:

* Try to make the code also work with the old page format, so that you don't need to REINDEX after pg_upgrade.

* Documentation. The new compressed posting list format needs to be explained somewhere. At the top of ginpostinglist.c would be a good place. The README should be improved too.

* Fix whatever I broke (sorry)

Are we committed to go ahead with this patch in 9.4 timeframe, in one form or another? If we are, then I'd like to start committing refactoring patches that just move code around in preparation for the Patch, to make the review of the core of the patch easier. For example, merging the isEnoughSpace and placeToPage methods in the ginbtree interface. Without the patch, it's unnecessary code churn - it won't do any harm but it won't do any good either. I'm pretty confident that this patch will land in the 9.4 timeframe, so barring objections I'll start committing such refactorings.

Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring.

         event         |     period
-----------------------+-----------------
 index_build           | 00:01:45.416822
 index_build_recovery  | 00:00:06
 index_update          | 00:05:17.263606
 index_update_recovery | 00:01:22
 search_new            | 00:24:07.706482
 search_updated        | 00:26:25.364494
(6 rows)

     label      | blocks_mark
----------------+-------------
 search_new     |   847587636
 search_updated |   881210486
(2 rows)

     label     |   size
---------------+-----------
 new           | 419299328
 after_updates | 715915264
(2 rows)

Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted.

Now, I'm trying to implement support of old page format. Then we can decide which approach to use.

------
With best regards,
Alexander Korotkov.
Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Cube extension point support // GSoC'13
Next
From: Alexander Korotkov
Date:
Subject: Re: Cube extension point support // GSoC'13