Re: Extent Locks - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Extent Locks
Date
Msg-id 20130517045005.GB4361@tamriel.snowman.net
Whole thread Raw
In response to Re: Extent Locks  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, May 16, 2013 at 11:55 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > You don't change that.  However, when a seq scan asks the storage layer
> > for blocks that it knows don't actually exist, it can simply skip over
> > them or return "empty" records or something equivilant...  Yes, that's
> > hand-wavy, but I also think it's doable.
>
> And slow.  And it will involve locking and shared memory data
> structures of its own, to keep track of which blocks actually exist at
> the storage layer.  I suspect the results would be more kinds of locks
> than we have at present, not less.

I'm not sure that I see that- we already have to figure out where the
"end" of a relation is...  We would just do something similar for each
component of the relation, and if gets extended after you've past it,
too bad, because it'd be just like someone adding blocks into pages
you've already seqscan'd after you're done with them.

Perhaps we'd have to actually do some kind of locking on those
non-existant pages, somewhere, but I'm not entirely sure where?
Figuring out where and what kind of locks we take out for seqscans today
on a page-level (or lower) would be the way to determine that; tbh, I'm
not entirely sure what we do there because I tended to figure we didn't
do much of anything..

> > That's true when the file is on a single filesystem and a single set of
> > drives.  Make them be split across multiple filesystems/volumes where
> > you get more drives involved...
>
> I'd be interested to hear how fast dd if=/dev/zero of=somefile is on
> your machine compared to a single-threaded COPY into a relation.

I can certainly do a bit of that profiling, but it's not an uncommon
issue..  Stefan ran into the exact same problem not 2 months ago, where
it became very clear that it was faster to run a single-thread loading
data than to parallelize it at all, due to the extension locking, for a
single relation.  If you move to multiple relations, where you won't
have the contention, things look much better, of course.

> Dividing those two numbers gives us the level of concurrency at which
> the speed at which we can extend the relation becomes the bottleneck.

Well, we'd want to consider a parallel case, right?  That's where the
bottleneck of this lock really shows itself- otherwise you're just
measuring the overhead from COPY parsing data.  Even with COPY, I've
managed to saturate a 2Gbps FC link between the server and the drives
over on the SAN with PG- when running 10 threads to 10 different
relations on 10 different tablespaces which go to 10 different drive
pairs on the SAN (I can push > 50MB/s to each drive pair, when writing
to only one drive pair, but it drops below that due to the 2Gbps fabric
when I'm writing to all of them at once).

Try getting anywhere close to that with any kind of system you want when
parallelizing writes into a single relation and you'll see where this
lock really kills performance.

> On the system I tested, I think it was in the multiple tens until the
> kernel cache filled up ... and then it dropped way off.  But I don't
> have access to a high-end storage system.

The above tests were done using simple, large and relatively "slow"
(spinning metal) drives- but they can sustain 50-100MB/s of seq writes
without too much trouble.  When you add all those up, either through
tablespaces and partitions or with a RAID10 setup, you can get quite a
bit of overall throughput- enough to easily make COPY be your bottleneck
due to the CPU utilization, making you want to parallelize it, but then
you hit this lock and performance goes into the toilet.

> One sadly relevant detail is that the relation was unlogged.  Even so,
> yes, it's fantastic.

That's not *ideal*, but it's also not the end of the world, since we can
create an unlogged table and have it visible to multiple clients,
allowing for parallel writes.  We don't have any unlogged tables, but I
can't recall if the above performance runs included a truncate (in the
same transaction) before COPY for each of the individual partitions or
not..  I don't *think* it did, but not 100% sure.  We do have wal level
set to minimum on these particular systems.

> >> I wonder if I need to use LWLockAcquireOrWait().
> >
> > I'm not seeing how/why that might help?
>
> Thinking about it more, my guess is that backend A grabs the relation
> extension lock.  Before it actually extends the relation, backends B,
> C, D, and E all notice that no free pages are available and queue for
> the lock.  Backend A pre-extends the relation by some number of pages
> page and then extends it by a second page for its own use.  It then
> releases the relation extension lock.  At this point, however,
> backends B, C, D, and E are already committed to extending the
> relation, even though some or all of them could now satisfy their need
> for free pages from the fsm.  If they used LWLockAcquireOrWait(), then
> they'd all wake up when A released the lock.  One of them would have
> the lock, and the rest could go retry the fsm and requeue on the lock
> if that failed.

Hmmm, yes, that could help then.  I'm surprised to hear that it was ever
set up that way, to be honest..

> But as it is, what I bet is happening is that they each take the lock
> in turn and each extend the relation in turn.  Then on the next block
> they write they all find free pages in the fsm, because they all
> pre-extended the relation, but when those free pages are used up, they
> all queue up on the lock again, practically at the same instant,
> because the fsm becomes empty at the same time for all of them.

Ouch, yea, that could get quite painful.

> I should play around with this a bit more...

That sounds like a fantastic idea... ;)
Thanks!
    Stephen

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Extent Locks
Next
From: Heikki Linnakangas
Date:
Subject: Re: Heap truncation without AccessExclusiveLock (9.4)