Re: Relation extension scalability - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Relation extension scalability
Date
Msg-id 27689.1427656904@sss.pgh.pa.us
Whole thread Raw
In response to Relation extension scalability  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Relation extension scalability  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
>    block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized

> The problems come from 4) and 5) potentially each taking a fair
> while.

Right, so basically we want to get those steps out of the exclusive lock
scope.

> There's two things that seem to make sense to me:

> First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases.  And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added".  On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: row_to_array function
Next
From: James Cloos
Date:
Subject: Re: Rounding to even for numeric data type