Home > mailing lists

Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date	September 8, 2016 00:36:29
Msg-id	74573e0f-6267-4483-2b45-438a08e2c4cf@iki.fi Whole thread Raw
In response to	Re: Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>)
Responses	Re: Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>)
List	pgsql-hackers

Tree view

On 09/07/2016 09:17 AM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> The big picture here is that you can't only USEMEM() for tapes as the
>>> need arises for new tapes as new runs are created. You'll just run a
>>> massive availMem deficit, that you have no way of paying back, because
>>> you can't "liquidate assets to pay off your creditors" (e.g., release
>>> a bit of the memtuples memory). The fact is that memtuples growth
>>> doesn't work that way. The memtuples array never shrinks.
>>
>>
>> Hmm. But memtuples is empty, just after we have built the initial runs. Why
>> couldn't we shrink, i.e. free and reallocate, it?
>
> After we've built the initial runs, we do in fact give a FREEMEM()
> refund to those tapes that were not used within beginmerge(), as I
> mentioned just now (with a high workMem, this is often the great
> majority of many thousands of logical tapes -- that's how you get to
> wasting 8% of 5GB of maintenance_work_mem).

I & peter chatted over IM on this. Let me try to summarize the problems, 
and my plan:

1. When we start to build the initial runs, we currently reserve memory 
for tape buffers, maxTapes * TAPE_BUFFER_OVERHEAD. But we only actually 
need the buffers for tapes that are really used. We "refund" the buffers 
for the unused tapes after we've built the initial runs, but we're still 
wasting that while building the initial runs. We didn't actually 
allocate it, but we could've used it for other things. Peter's solution 
to this was to put a cap on maxTapes.

2. My observation is that during the build-runs phase, you only actually 
need those tape buffers for the one tape you're currently writing to. 
When you switch to a different tape, you could flush and free the 
buffers for the old tape. So reserving maxTapes * TAPE_BUFFER_OVERHEAD 
is excessive, 1 * TAPE_BUFFER_OVERHEAD would be enough. logtape.c 
doesn't have an interface for doing that today, but it wouldn't be hard 
to add.

3. If we do that, we'll still have to reserve the tape buffers for all 
the tapes that we use during merge. So after we've built the initial 
runs, we'll need to reserve memory for those buffers. That might require 
shrinking memtuples. But that's OK: after building the initial runs, 
memtuples is empty, so we can shrink it.

- Heikki

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 08 September 2016, 00:31:49
Subject: Re: GiST penalty functions [PoC]

From: Gavin Flower
Date: 08 September 2016, 00:39:39
Subject: Re: Long options for pg_ctl waiting

Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

Previous

Next