Re: Extent Locks - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Extent Locks
Date
Msg-id CA+Tgmoaw1kWArh1oEKWWZsQKAn1z6k59jZwrpd+pMbqn4P_X5Q@mail.gmail.com
Whole thread Raw
In response to Re: Extent Locks  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Extent Locks  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
On Thu, May 16, 2013 at 9:36 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> What we do have is
>> a relation extension lock, but the size of the segment on disk has
>> nothing to do with that: there's only one for the whole relation, and
>> you hold it when adding a block to the relation.
>
> Yes, which is farrr too small.  I'm certainly aware that the segments on
> disk are dealt with in the storage layer- currently.  My proposal was to
> consider how we might change that, a bit, to allow improved throughput
> when there are multiple writers.
>
> Consider this, for example- when we block on the relation extension
> lock, rather than sit and wait or continue to compete with the other
> threads, simply tell the storage layer to give us a dedicated file to
> work with.  Once we're ready to commit, move that file into place as the
> next segment (through some command to the storage layer), using an
> atomic command to ensure that it either works and doesn't overwrite
> anything, or fails and we try again by moving the segment number up.
>
> We would need to work out, at the storage layer, how to handle cases
> where the file is less than 1G and realize that we should just skip over
> those blocks on disk as being known-to-be-empty.  Those blocks would
> also be then put on the free space map and used for later processes
> which need to find somewhere to put new data, etc.

I think it's pretty unrealistic to suppose that this can be made to
work.  The most obvious problem is that a sequential scan is coded to
assume that every block between 0 and the last block in the relation
is worth reading, and changing that would (a) be a lot of work and (b)
render the term "sequential" scan a misnomer, a choice that I think
would have more bad consequences than good.  I suspect there are
slightly less obvious problems that would turn out to be highly
intractable.  The assumption that block numbers are dense is probably
embedded in the system in a lot of subtle ways; if we start trying to
change I think we're dooming ourselves to an unending series of crocks
trying to undo the mess we've created.

Also, I think that's really a red herring anyway.  Relation extension
per se is not slow - we can grow a file by adding zero bytes at a
pretty good clip, and don't really gain anything at the database level
by spreading the growth across multiple files.  The problem is the
relation extension LOCK, and I think that's where we should be
focusing our attention.  I'm pretty confident we can find a way to
take the pressure off the lock without actually changing anything all
at the storage layer.  As a thought experiment, suppose for example
that we have a background process that knows, by magic, how many new
blocks will be needed in each relation.  And it knows this just enough
in advance to have time to extend each such relation by the requisite
number of blocks and add those blocks to the free space map.  Since
only that process ever needs a relation extension lock, there is no
longer any contention for any such lock.  Problem solved!

Actually, I'm not convinced that a background process is the right
approach at all, and of course there's no actual magic that lets us
foresee exact extension needs.  But I still feel like that thought
experiment indicates that there must be a solution here just by
rejiggering the locking, and maybe with a bit of modest pre-extension.The mediocre results of my last couple tries must
indicatethat I
 
wasn't entirely successful in getting the backends out of each others'
way, but I tend to think that's just an indication that I don't
understand exactly what's happening in the contention scenarios yet,
rather than a fundamental difficulty with the approach.

>> This does help...
>> but at least in my tests, extending by 2 blocks instead of 1 was the
>> big winner, and after that you didn't get much further relief.
>
> How many concurrent writers did you have and what kind of filesystem was
> backing this?  Was it a temp filesystem where writes are essentially to
> memory, causing this relation extention lock to be much more
> contentious?

10.  ext4.  No.

>> Another thing I tried was pre-extending the relation to the estimated
>> final size.  That worked a lot better, and might be worth doing (e.g.
>> ALTER TABLE zorp SET MINIMUM SIZE 1GB) but a less manual solution
>> would be preferable if we can come up with one.
>
> Slightly confused here- above you said that '2' was way better than '1',
> but you implied that "more than 2 wasn't really much better"- yet "wayyy
> more than 2 is much better"?  Did I follow that right?  I can certainly
> understand such a case, just want to understand it and make sure it's
> what you meant.  What "small-number" options did you try?

If I took 30 seconds to pre-extend the relation before writing any
data into it, then writing the data went pretty much exactly 10 times
faster with 10 writers than with 1.  But small on-the-fly
pre-extensions during the write didn't work as well.  I don't remember
exactly what formulas I tried, but I do remember that the few I tried
were not really any better than "always pre-extend by 1 extra block";
and that alone eliminated about half the contention, but then I
couldn't do better.  I wonder if I need to use LWLockAcquireOrWait().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: Better handling of archive_command problems
Next
From: Robert Haas
Date:
Subject: Re: Extent Locks