Re: Proposal: speeding up GIN build with parallel workers - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Proposal: speeding up GIN build with parallel workers
Date
Msg-id 0a0df4f6-66bb-aeaf-c7a7-60d6d4abaabb@iki.fi
Whole thread Raw
In response to Re: Proposal: speeding up GIN build with parallel workers  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Proposal: speeding up GIN build with parallel workers  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On 01/17/2016 10:03 PM, Jeff Janes wrote:
> On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>>> I have a draft implementation which divides the whole process between
>>> N parallel workers, see the patch attached. Instead of a full scan of
>>> the relation, I give each worker a range of blocks to read.
>>
>> I am currently working on a patch that allows B-Tree index builds to
>> be performed in parallel. I think I'm a week or two away from posting
>> it.
>>
>> Even without parallelism, wouldn't it be better if GIN indexes were
>> built using tuplesort? I know way way less about the gin am than the
>> nbtree am, but I imagine that a prominent cost for GIN index builds is
>> constructing the main B-Tree (the one that's constructed over key
>> values) itself. Couldn't tuplesort.c be adapted to cover this case?
>> That would be much faster in general, particularly with the recent
>> addition of abbreviated keys, while also leaving a clear path forward
>> to performing the build in parallel.
>
> I think it would take a lot of changes to tuple sort to make this be a
> almost-always win.
>
> In the general case each GIN key occurs in many tuples, and the
> in-memory rbtree is good at compressing the tid list for each key to
> maximize the amount of data that can be in memory (although perhaps it
> could be even better, as it doesn't use varbyte encoding on the tid
> list the way the posting lists on disk do--it could do so in the bulk
> build case, where it receives the tid in order, but not feasibly in
> the pending-list clean-up case, where it doesn't get them in order)
>
> When I was testing building an index on a column of text identifiers
> where there were a couple million identifiers occurring a few dozen
> times each, building it as a gin index (using btree_gin so that plain
> text columns could be indexed) was quite a bit faster than building it
> as a regular btree index using tuple sort.  I didn't really
> investigate this difference, but I assume it was due to the better
> memory usage leading to less IO.

I've been wondering about this too, using tuplesort to build GIN 
indexes, for a long time. Surely the highly-optimized quicksort+merge 
code in tuplesort.c is faster than maintaining a red-black tree? Or so I 
thought.

I wrote a quick prototype of using tuplesort.c for GIN index build. I 
tested it with a 500 MB table of integer arrays, so that the sort fit 
completely in memory. It's a lot slower than the current code. It turns 
out eliminating the duplicates early is really really important.

So we want to keep the red-black tree, to eliminate the duplicates. Or 
add that capability to tuplesort.c, which might speed up Sort+Unique and 
aggregates in general, but that's a big effort.

However, I still wonder, if we shouldn't use a merge approach when the 
tree doesn't fit in memory, like tuplesort.c does. Currently, when the 
tree is full, we flush it out to the index, by inserting all the 
entries. That can get quite expensive, I/O-wise. It also generates more 
WAL, compared to writing each page only once.

If we flushed the tree to a tape instead, then we could perhaps use the 
machinery that Peter's parallel B-tree patch is adding to tuplesort.c, 
to merge the tapes. I'm not sure if that works out, but I think it's 
worth some experimentation.

> But I do think this with aggregation would be worth it despite the
> amount of work involved.  In addition to automatically benefiting from
> parallel sorts and any other future sort improvements, I think it
> would also generate better indexes.  I have a theory that a problem
> with very large gin indexes is that repeatedly building
> maintenance_work_mem worth of rbtree entries and then merging them to
> disk yields highly fragmented btrees, in which logical adjacent
> key-space does not occupy physically nearby leaf pages, which then can
> causes problems either with access that follows a pattern (like range
> scans, except gin indexes can't really do those anyway) or further
> bulk operations.

Yeah, there's that too.

- Heikki




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: kqueue
Next
From: Dilip Kumar
Date:
Subject: Re: Speed up Clog Access by increasing CLOG buffers