Home > mailing lists

Re: Relation extension scalability - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Relation extension scalability
Date	March 29, 2015 22:21:54
Msg-id	27689.1427656904@sss.pgh.pa.us Whole thread Raw
In response to	Relation extension scalability (Andres Freund <andres@2ndquadrant.com>)
Responses	Re: Relation extension scalability (Andres Freund <andres@2ndquadrant.com>)
List	pgsql-hackers

Tree view

Andres Freund <andres@2ndquadrant.com> writes:
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
>    block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized

> The problems come from 4) and 5) potentially each taking a fair
> while.

Right, so basically we want to get those steps out of the exclusive lock
scope.

> There's two things that seem to make sense to me:

> First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases.  And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added".  On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.
        regards, tom lane

pgsql-hackers by date:

From: Pavel Stehule
Date: 29 March 2015, 22:21:31
Subject: Re: proposal: row_to_array function

From: James Cloos
Date: 29 March 2015, 22:52:08
Subject: Re: Rounding to even for numeric data type

Re: Relation extension scalability - Mailing list pgsql-hackers

Previous

Next