Thread: Extent Locks

Extent Locks

From
Stephen Frost
Date:
All,
 Starting a new thread to avoid hijacking Heikki's original, but..

* Heikki Linnakangas (hlinnakangas@vmware.com) wrote:
> Truncating a heap at the end of vacuum, to release unused space back to
> the OS, currently requires taking an AccessExclusiveLock. Although
> it's only held for a short duration, it can be enough to cause a
> hiccup in query processing while it's held. Also, if there is a
> continuous stream of queries on the table, autovacuum never succeeds
> in acquiring the lock, and thus the table never gets truncated.
 Extent locking suffers from very similar problems and we really need to improve this situation.  With today's fast i/o
systems,and massive numbers of CPUs in a single system, it's absolutely trivial to have a whole slew of processes
tryingto add data to a single relation and that access getting nearly serialized due to everyone waiting on the extent
lock.
 Perhaps one really simple approach would be to increase the size of the extent which is created in relation to the
sizeof the relation. I've no clue what level of effort is involved there but I'm hoping such an approach would help.
I'velong thought that it'd be very neat if we could simply give each bulk-inserter process their own 1G chunk to insert
directlyinto w/o having to talk to anyone else.  The creation of the specific 1G piece could, hopefully, be made atomic
easily(either thanks to the OS or with our own locking), etc, etc.
 
 I'm sure it's many bricks shy of a load, but I wanted to raise the issue, again, as I've seen it happening on yet
anotherhigh-volume write-intensive system.
 
     Thanks,
    Stephen

Re: Extent Locks

From
Robert Haas
Date:
On Wed, May 15, 2013 at 8:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
>   Starting a new thread to avoid hijacking Heikki's original, but..
>
> * Heikki Linnakangas (hlinnakangas@vmware.com) wrote:
>> Truncating a heap at the end of vacuum, to release unused space back to
>> the OS, currently requires taking an AccessExclusiveLock. Although
>> it's only held for a short duration, it can be enough to cause a
>> hiccup in query processing while it's held. Also, if there is a
>> continuous stream of queries on the table, autovacuum never succeeds
>> in acquiring the lock, and thus the table never gets truncated.
>
>   Extent locking suffers from very similar problems and we really need
>   to improve this situation.  With today's fast i/o systems, and massive
>   numbers of CPUs in a single system, it's absolutely trivial to have a
>   whole slew of processes trying to add data to a single relation and
>   that access getting nearly serialized due to everyone waiting on the
>   extent lock.
>
>   Perhaps one really simple approach would be to increase the size of
>   the extent which is created in relation to the size of the relation.
>   I've no clue what level of effort is involved there but I'm hoping
>   such an approach would help.  I've long thought that it'd be very neat
>   if we could simply give each bulk-inserter process their own 1G chunk
>   to insert directly into w/o having to talk to anyone else.  The
>   creation of the specific 1G piece could, hopefully, be made atomic
>   easily (either thanks to the OS or with our own locking), etc, etc.
>
>   I'm sure it's many bricks shy of a load, but I wanted to raise the
>   issue, again, as I've seen it happening on yet another high-volume
>   write-intensive system.

I think you might be confused, or else I'm confused, because I don't
believe we have any such thing as an extent lock.  What we do have is
a relation extension lock, but the size of the segment on disk has
nothing to do with that: there's only one for the whole relation, and
you hold it when adding a block to the relation.  The organization of
blocks into 1GB segments happens at a much lower level of the system,
and is completely disconnected from the locking subsystem.  So
changing the segment size wouldn't help with this problem, and would
actually be quite difficult to do, because everything in the system
except at the very lowermost layer just knows about block numbers and
has no idea what "extent" the block is in.

But that having been said, it just so happens that I was recently
playing around with ways of trying to fix the relation extension
bottleneck.  One thing I tried was: every time a particular backend
extends the relation, it extends the relation by more than 1 block at
a time before releasing the relation extension lock.  Then, other
backends can find those blocks in the free space map instead of having
to grab the relation extension lock, so the number of acquire/release
cycles on the relation extension lock goes down.  This does help...
but at least in my tests, extending by 2 blocks instead of 1 was the
big winner, and after that you didn't get much further relief.
Another thing I tried was pre-extending the relation to the estimated
final size.  That worked a lot better, and might be worth doing (e.g.
ALTER TABLE zorp SET MINIMUM SIZE 1GB) but a less manual solution
would be preferable if we can come up with one.

After that, I ran out of time for investigation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Extent Locks

From
Stephen Frost
Date:
Robert,
 For not understanding me, we seem to be in violent agreement. ;)

* Robert Haas (robertmhaas@gmail.com) wrote:
> I think you might be confused, or else I'm confused, because I don't
> believe we have any such thing as an extent lock.

The relation extension lock is what I was referring to.  Apologies for
any confusion there.

> What we do have is
> a relation extension lock, but the size of the segment on disk has
> nothing to do with that: there's only one for the whole relation, and
> you hold it when adding a block to the relation.

Yes, which is farrr too small.  I'm certainly aware that the segments on
disk are dealt with in the storage layer- currently.  My proposal was to
consider how we might change that, a bit, to allow improved throughput
when there are multiple writers.

Consider this, for example- when we block on the relation extension
lock, rather than sit and wait or continue to compete with the other
threads, simply tell the storage layer to give us a dedicated file to
work with.  Once we're ready to commit, move that file into place as the
next segment (through some command to the storage layer), using an
atomic command to ensure that it either works and doesn't overwrite
anything, or fails and we try again by moving the segment number up.

We would need to work out, at the storage layer, how to handle cases
where the file is less than 1G and realize that we should just skip over
those blocks on disk as being known-to-be-empty.  Those blocks would
also be then put on the free space map and used for later processes
which need to find somewhere to put new data, etc.

> But that having been said, it just so happens that I was recently
> playing around with ways of trying to fix the relation extension
> bottleneck.  One thing I tried was: every time a particular backend
> extends the relation, it extends the relation by more than 1 block at
> a time before releasing the relation extension lock.

Right, exactly.  One idea that I was discussing w/ Greg was to do this
using some log(relation-size) approach or similar.

> This does help...
> but at least in my tests, extending by 2 blocks instead of 1 was the
> big winner, and after that you didn't get much further relief.

How many concurrent writers did you have and what kind of filesystem was
backing this?  Was it a temp filesystem where writes are essentially to
memory, causing this relation extention lock to be much more
contentious?

> Another thing I tried was pre-extending the relation to the estimated
> final size.  That worked a lot better, and might be worth doing (e.g.
> ALTER TABLE zorp SET MINIMUM SIZE 1GB) but a less manual solution
> would be preferable if we can come up with one.

Slightly confused here- above you said that '2' was way better than '1',
but you implied that "more than 2 wasn't really much better"- yet "wayyy
more than 2 is much better"?  Did I follow that right?  I can certainly
understand such a case, just want to understand it and make sure it's
what you meant.  What "small-number" options did you try?

> After that, I ran out of time for investigation.

Too bad!  Thanks much for the work in this area, it'd really help if we
could improve this for our data warehouse, in particular, users.
Thanks!
    Stephen

Re: Extent Locks

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> I think it's pretty unrealistic to suppose that this can be made to
> work.  The most obvious problem is that a sequential scan is coded to
> assume that every block between 0 and the last block in the relation
> is worth reading,

You don't change that.  However, when a seq scan asks the storage layer
for blocks that it knows don't actually exist, it can simply skip over
them or return "empty" records or something equivilant...  Yes, that's
hand-wavy, but I also think it's doable.

> I suspect there are
> slightly less obvious problems that would turn out to be highly
> intractable.

Entirely possible. :)

> The assumption that block numbers are dense is probably
> embedded in the system in a lot of subtle ways; if we start trying to
> change I think we're dooming ourselves to an unending series of crocks
> trying to undo the mess we've created.

Perhaps.

> Also, I think that's really a red herring anyway.  Relation extension
> per se is not slow - we can grow a file by adding zero bytes at a
> pretty good clip, and don't really gain anything at the database level
> by spreading the growth across multiple files.

That's true when the file is on a single filesystem and a single set of
drives.  Make them be split across multiple filesystems/volumes where
you get more drives involved...

> The problem is the
> relation extension LOCK, and I think that's where we should be
> focusing our attention.  I'm pretty confident we can find a way to
> take the pressure off the lock without actually changing anything all
> at the storage layer.

That would certainly be very neat and if possible might render my idea
moot, which I would be more than happy with.

> As a thought experiment, suppose for example
> that we have a background process that knows, by magic, how many new
> blocks will be needed in each relation.  And it knows this just enough
> in advance to have time to extend each such relation by the requisite
> number of blocks and add those blocks to the free space map.  Since
> only that process ever needs a relation extension lock, there is no
> longer any contention for any such lock.  Problem solved!

Sounds cute, but perhaps a bit too cute to be realistic (that's
certainly been my opinion when suggested by others, which is has been,
in the past).

> Actually, I'm not convinced that a background process is the right
> approach at all, and of course there's no actual magic that lets us
> foresee exact extension needs.  But I still feel like that thought
> experiment indicates that there must be a solution here just by
> rejiggering the locking, and maybe with a bit of modest pre-extension.
>  The mediocre results of my last couple tries must indicate that I
> wasn't entirely successful in getting the backends out of each others'
> way, but I tend to think that's just an indication that I don't
> understand exactly what's happening in the contention scenarios yet,
> rather than a fundamental difficulty with the approach.

Perhaps.

> > How many concurrent writers did you have and what kind of filesystem was
> > backing this?  Was it a temp filesystem where writes are essentially to
> > memory, causing this relation extention lock to be much more
> > contentious?
>
> 10.  ext4.  No.

Ok.

> If I took 30 seconds to pre-extend the relation before writing any
> data into it, then writing the data went pretty much exactly 10 times
> faster with 10 writers than with 1.

That's rather fantastic..

> But small on-the-fly
> pre-extensions during the write didn't work as well.  I don't remember
> exactly what formulas I tried, but I do remember that the few I tried
> were not really any better than "always pre-extend by 1 extra block";
> and that alone eliminated about half the contention, but then I
> couldn't do better.

That seems quite odd to me- I would have thought extending by more than
2 blocks would have helped with the contention.  Still, it sounds like
extending requires a fair bit of writing, and that sucks in its own
right because we're just going to rewrite that- is that correct?  If so,
I like proposal even more...

> I wonder if I need to use LWLockAcquireOrWait().

I'm not seeing how/why that might help?
Thanks,
    Stephen

Re: Extent Locks

From
Robert Haas
Date:
On Thu, May 16, 2013 at 9:36 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> What we do have is
>> a relation extension lock, but the size of the segment on disk has
>> nothing to do with that: there's only one for the whole relation, and
>> you hold it when adding a block to the relation.
>
> Yes, which is farrr too small.  I'm certainly aware that the segments on
> disk are dealt with in the storage layer- currently.  My proposal was to
> consider how we might change that, a bit, to allow improved throughput
> when there are multiple writers.
>
> Consider this, for example- when we block on the relation extension
> lock, rather than sit and wait or continue to compete with the other
> threads, simply tell the storage layer to give us a dedicated file to
> work with.  Once we're ready to commit, move that file into place as the
> next segment (through some command to the storage layer), using an
> atomic command to ensure that it either works and doesn't overwrite
> anything, or fails and we try again by moving the segment number up.
>
> We would need to work out, at the storage layer, how to handle cases
> where the file is less than 1G and realize that we should just skip over
> those blocks on disk as being known-to-be-empty.  Those blocks would
> also be then put on the free space map and used for later processes
> which need to find somewhere to put new data, etc.

I think it's pretty unrealistic to suppose that this can be made to
work.  The most obvious problem is that a sequential scan is coded to
assume that every block between 0 and the last block in the relation
is worth reading, and changing that would (a) be a lot of work and (b)
render the term "sequential" scan a misnomer, a choice that I think
would have more bad consequences than good.  I suspect there are
slightly less obvious problems that would turn out to be highly
intractable.  The assumption that block numbers are dense is probably
embedded in the system in a lot of subtle ways; if we start trying to
change I think we're dooming ourselves to an unending series of crocks
trying to undo the mess we've created.

Also, I think that's really a red herring anyway.  Relation extension
per se is not slow - we can grow a file by adding zero bytes at a
pretty good clip, and don't really gain anything at the database level
by spreading the growth across multiple files.  The problem is the
relation extension LOCK, and I think that's where we should be
focusing our attention.  I'm pretty confident we can find a way to
take the pressure off the lock without actually changing anything all
at the storage layer.  As a thought experiment, suppose for example
that we have a background process that knows, by magic, how many new
blocks will be needed in each relation.  And it knows this just enough
in advance to have time to extend each such relation by the requisite
number of blocks and add those blocks to the free space map.  Since
only that process ever needs a relation extension lock, there is no
longer any contention for any such lock.  Problem solved!

Actually, I'm not convinced that a background process is the right
approach at all, and of course there's no actual magic that lets us
foresee exact extension needs.  But I still feel like that thought
experiment indicates that there must be a solution here just by
rejiggering the locking, and maybe with a bit of modest pre-extension.The mediocre results of my last couple tries must
indicatethat I
 
wasn't entirely successful in getting the backends out of each others'
way, but I tend to think that's just an indication that I don't
understand exactly what's happening in the contention scenarios yet,
rather than a fundamental difficulty with the approach.

>> This does help...
>> but at least in my tests, extending by 2 blocks instead of 1 was the
>> big winner, and after that you didn't get much further relief.
>
> How many concurrent writers did you have and what kind of filesystem was
> backing this?  Was it a temp filesystem where writes are essentially to
> memory, causing this relation extention lock to be much more
> contentious?

10.  ext4.  No.

>> Another thing I tried was pre-extending the relation to the estimated
>> final size.  That worked a lot better, and might be worth doing (e.g.
>> ALTER TABLE zorp SET MINIMUM SIZE 1GB) but a less manual solution
>> would be preferable if we can come up with one.
>
> Slightly confused here- above you said that '2' was way better than '1',
> but you implied that "more than 2 wasn't really much better"- yet "wayyy
> more than 2 is much better"?  Did I follow that right?  I can certainly
> understand such a case, just want to understand it and make sure it's
> what you meant.  What "small-number" options did you try?

If I took 30 seconds to pre-extend the relation before writing any
data into it, then writing the data went pretty much exactly 10 times
faster with 10 writers than with 1.  But small on-the-fly
pre-extensions during the write didn't work as well.  I don't remember
exactly what formulas I tried, but I do remember that the few I tried
were not really any better than "always pre-extend by 1 extra block";
and that alone eliminated about half the contention, but then I
couldn't do better.  I wonder if I need to use LWLockAcquireOrWait().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Extent Locks

From
Robert Haas
Date:
On Thu, May 16, 2013 at 11:55 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> I think it's pretty unrealistic to suppose that this can be made to
>> work.  The most obvious problem is that a sequential scan is coded to
>> assume that every block between 0 and the last block in the relation
>> is worth reading,
>
> You don't change that.  However, when a seq scan asks the storage layer
> for blocks that it knows don't actually exist, it can simply skip over
> them or return "empty" records or something equivilant...  Yes, that's
> hand-wavy, but I also think it's doable.

And slow.  And it will involve locking and shared memory data
structures of its own, to keep track of which blocks actually exist at
the storage layer.  I suspect the results would be more kinds of locks
than we have at present, not less.

>> Also, I think that's really a red herring anyway.  Relation extension
>> per se is not slow - we can grow a file by adding zero bytes at a
>> pretty good clip, and don't really gain anything at the database level
>> by spreading the growth across multiple files.
>
> That's true when the file is on a single filesystem and a single set of
> drives.  Make them be split across multiple filesystems/volumes where
> you get more drives involved...

I'd be interested to hear how fast dd if=/dev/zero of=somefile is on
your machine compared to a single-threaded COPY into a relation.
Dividing those two numbers gives us the level of concurrency at which
the speed at which we can extend the relation becomes the bottleneck.
On the system I tested, I think it was in the multiple tens until the
kernel cache filled up ... and then it dropped way off.  But I don't
have access to a high-end storage system.

>> If I took 30 seconds to pre-extend the relation before writing any
>> data into it, then writing the data went pretty much exactly 10 times
>> faster with 10 writers than with 1.
>
> That's rather fantastic..

One sadly relevant detail is that the relation was unlogged.  Even so,
yes, it's fantastic.

>> But small on-the-fly
>> pre-extensions during the write didn't work as well.  I don't remember
>> exactly what formulas I tried, but I do remember that the few I tried
>> were not really any better than "always pre-extend by 1 extra block";
>> and that alone eliminated about half the contention, but then I
>> couldn't do better.
>
> That seems quite odd to me- I would have thought extending by more than
> 2 blocks would have helped with the contention.  Still, it sounds like
> extending requires a fair bit of writing, and that sucks in its own
> right because we're just going to rewrite that- is that correct?  If so,
> I like proposal even more...
>
>> I wonder if I need to use LWLockAcquireOrWait().
>
> I'm not seeing how/why that might help?

Thinking about it more, my guess is that backend A grabs the relation
extension lock.  Before it actually extends the relation, backends B,
C, D, and E all notice that no free pages are available and queue for
the lock.  Backend A pre-extends the relation by some number of pages
page and then extends it by a second page for its own use.  It then
releases the relation extension lock.  At this point, however,
backends B, C, D, and E are already committed to extending the
relation, even though some or all of them could now satisfy their need
for free pages from the fsm.  If they used LWLockAcquireOrWait(), then
they'd all wake up when A released the lock.  One of them would have
the lock, and the rest could go retry the fsm and requeue on the lock
if that failed.

But as it is, what I bet is happening is that they each take the lock
in turn and each extend the relation in turn.  Then on the next block
they write they all find free pages in the fsm, because they all
pre-extended the relation, but when those free pages are used up, they
all queue up on the lock again, practically at the same instant,
because the fsm becomes empty at the same time for all of them.

I should play around with this a bit more...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Extent Locks

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, May 16, 2013 at 11:55 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > You don't change that.  However, when a seq scan asks the storage layer
> > for blocks that it knows don't actually exist, it can simply skip over
> > them or return "empty" records or something equivilant...  Yes, that's
> > hand-wavy, but I also think it's doable.
>
> And slow.  And it will involve locking and shared memory data
> structures of its own, to keep track of which blocks actually exist at
> the storage layer.  I suspect the results would be more kinds of locks
> than we have at present, not less.

I'm not sure that I see that- we already have to figure out where the
"end" of a relation is...  We would just do something similar for each
component of the relation, and if gets extended after you've past it,
too bad, because it'd be just like someone adding blocks into pages
you've already seqscan'd after you're done with them.

Perhaps we'd have to actually do some kind of locking on those
non-existant pages, somewhere, but I'm not entirely sure where?
Figuring out where and what kind of locks we take out for seqscans today
on a page-level (or lower) would be the way to determine that; tbh, I'm
not entirely sure what we do there because I tended to figure we didn't
do much of anything..

> > That's true when the file is on a single filesystem and a single set of
> > drives.  Make them be split across multiple filesystems/volumes where
> > you get more drives involved...
>
> I'd be interested to hear how fast dd if=/dev/zero of=somefile is on
> your machine compared to a single-threaded COPY into a relation.

I can certainly do a bit of that profiling, but it's not an uncommon
issue..  Stefan ran into the exact same problem not 2 months ago, where
it became very clear that it was faster to run a single-thread loading
data than to parallelize it at all, due to the extension locking, for a
single relation.  If you move to multiple relations, where you won't
have the contention, things look much better, of course.

> Dividing those two numbers gives us the level of concurrency at which
> the speed at which we can extend the relation becomes the bottleneck.

Well, we'd want to consider a parallel case, right?  That's where the
bottleneck of this lock really shows itself- otherwise you're just
measuring the overhead from COPY parsing data.  Even with COPY, I've
managed to saturate a 2Gbps FC link between the server and the drives
over on the SAN with PG- when running 10 threads to 10 different
relations on 10 different tablespaces which go to 10 different drive
pairs on the SAN (I can push > 50MB/s to each drive pair, when writing
to only one drive pair, but it drops below that due to the 2Gbps fabric
when I'm writing to all of them at once).

Try getting anywhere close to that with any kind of system you want when
parallelizing writes into a single relation and you'll see where this
lock really kills performance.

> On the system I tested, I think it was in the multiple tens until the
> kernel cache filled up ... and then it dropped way off.  But I don't
> have access to a high-end storage system.

The above tests were done using simple, large and relatively "slow"
(spinning metal) drives- but they can sustain 50-100MB/s of seq writes
without too much trouble.  When you add all those up, either through
tablespaces and partitions or with a RAID10 setup, you can get quite a
bit of overall throughput- enough to easily make COPY be your bottleneck
due to the CPU utilization, making you want to parallelize it, but then
you hit this lock and performance goes into the toilet.

> One sadly relevant detail is that the relation was unlogged.  Even so,
> yes, it's fantastic.

That's not *ideal*, but it's also not the end of the world, since we can
create an unlogged table and have it visible to multiple clients,
allowing for parallel writes.  We don't have any unlogged tables, but I
can't recall if the above performance runs included a truncate (in the
same transaction) before COPY for each of the individual partitions or
not..  I don't *think* it did, but not 100% sure.  We do have wal level
set to minimum on these particular systems.

> >> I wonder if I need to use LWLockAcquireOrWait().
> >
> > I'm not seeing how/why that might help?
>
> Thinking about it more, my guess is that backend A grabs the relation
> extension lock.  Before it actually extends the relation, backends B,
> C, D, and E all notice that no free pages are available and queue for
> the lock.  Backend A pre-extends the relation by some number of pages
> page and then extends it by a second page for its own use.  It then
> releases the relation extension lock.  At this point, however,
> backends B, C, D, and E are already committed to extending the
> relation, even though some or all of them could now satisfy their need
> for free pages from the fsm.  If they used LWLockAcquireOrWait(), then
> they'd all wake up when A released the lock.  One of them would have
> the lock, and the rest could go retry the fsm and requeue on the lock
> if that failed.

Hmmm, yes, that could help then.  I'm surprised to hear that it was ever
set up that way, to be honest..

> But as it is, what I bet is happening is that they each take the lock
> in turn and each extend the relation in turn.  Then on the next block
> they write they all find free pages in the fsm, because they all
> pre-extended the relation, but when those free pages are used up, they
> all queue up on the lock again, practically at the same instant,
> because the fsm becomes empty at the same time for all of them.

Ouch, yea, that could get quite painful.

> I should play around with this a bit more...

That sounds like a fantastic idea... ;)
Thanks!
    Stephen

Re: Extent Locks

From
Josh Berkus
Date:
Robert,

> But I still feel like that thought
> experiment indicates that there must be a solution here just by
> rejiggering the locking, and maybe with a bit of modest pre-extension.
>  The mediocre results of my last couple tries must indicate that I
> wasn't entirely successful in getting the backends out of each others'
> way, but I tend to think that's just an indication that I don't
> understand exactly what's happening in the contention scenarios yet,
> rather than a fundamental difficulty with the approach.

Well, our practice of extending relations 8K-at-a-time is suboptimal on
quite a number of storage platforms.  It leads to increased file
fragmentation, and increases write sizes on SSDs which have a default
128K block size.  Also, on a large bulk load we spend way too much time
extending the relation.

My suggestion would be to have a storage parameter which defined the new
extent size for growing the table, and allocate that much free space in
the form of empty pages whenever we need new pages.  The default would
be 1MB, but users could adjust it to anywhere between 8K and 1GB.

We'd still need an extent lock to add the 1MB (or whatever), but there's
a 128X difference between allocating 8K and 1MB.

The drawback to this is whatever size we choose is liable to be wrong
for some users.  Users who currently have a lot of 16K tables would see
their databases grow alarmingly.  But a default of 8K or 16K or 32K
wouldn't improve the current behavior except for the very advanced users
who know how to tinker with storage parameters.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Extent Locks

From
Craig Ringer
Date:
On 05/17/2013 11:38 AM, Robert Haas wrote:
> maybe with a bit of modest pre-extension.
When it comes to pre-extension, is it realistic to get a count of
backends waiting on the lock and extend the relation by (say) 2x the
number of waiting backends?

Getting a list of lock waiters is always a racey proposition, but in
this case we don't need an accurate count, only an estimate, and the
count can only grow between getting the count and completing the
relation extension. Assuming it's even remotely feasible to get a count
of lock waiters at all.

If there are lots of procs waiting to extend the relation a fair chunk
could be preallocated with posix_fallocate on supported platforms.

If it's possible this would avoid the need to attempt any
recency-of-last-extension based preallocation with the associated
problem of how to store and access the last-extended time efficiently,
while still hopefully reducing contention on the relation extension lock
and without delaying the backend doing the extension too much more.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services




Re: Extent Locks

From
Craig Ringer
Date:
On 05/18/2013 03:15 AM, Josh Berkus wrote:
> The drawback to this is whatever size we choose is liable to be wrong
> for some users. Users who currently have a lot of 16K tables would see
> their databases grow alarmingly. 

This only becomes a problem for tables that're tiny, right? If your
table is already 20MB you don't care if it grows to 20.1MB or 21MB next
time it's extended.

What about applying the relation extent size only *after* an extent's
worth of blocks have been allocated in small blocks, per current
behaviour? So their 32k tables stay 32k, but once they step over the 1MB
barrier (or whatever) in table size the allocation mode switches to
bulk-allocating large extents? Or just setting an size threshold after
which extent-sized preallocation is used?

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services




Re: Extent Locks

From
Stephen Frost
Date:
* Craig Ringer (craig@2ndquadrant.com) wrote:
> On 05/17/2013 11:38 AM, Robert Haas wrote:
> > maybe with a bit of modest pre-extension.

> When it comes to pre-extension, is it realistic to get a count of
> backends waiting on the lock and extend the relation by (say) 2x the
> number of waiting backends?

Having the process which has the lock do more work before releasing it,
and having the other processes realize that there is room available
after blocking on the lock (and not trying to extend the relation
themselves..), might help.  One concern that came up in Ottawa is
over autovacuum coming along and discovering empty pages at the end of
the relation and deciding to try and truncate it.  I'm not convinced
that would happen due to the locks involved but if we actually extend
the relation by enough that the individual processes can continue
writing for a while before another extension is needed, then perhaps it
could.

On the other hand, I do feel like people are worried about
over-extending a relation and wasting disk space- but with the way that
vacuum can clean up pages at the end, that would only be a temporary
situation anyway.

> If it's possible this would avoid the need to attempt any
> recency-of-last-extension based preallocation with the associated
> problem of how to store and access the last-extended time efficiently,
> while still hopefully reducing contention on the relation extension lock
> and without delaying the backend doing the extension too much more.

I do like the idea of getting an idea of how many blocks are being asked
for, based on how many other backends are trying to write, but I've been
thinking a simple algorithm might also work well, eg:

alloc_size = 1 page
extend_time = 0
while(writing)   if(blocked and extend_time < 5s)       alloc_size *= 2   extend_start_time = now()
extend(alloc_size)extend_time= now() - extend_start_time
 

Thanks,
Stephen

Re: Extent Locks

From
Jaime Casanova
Date:
On Tue, May 28, 2013 at 7:36 AM, Stephen Frost <sfrost@snowman.net> wrote:
>
> On the other hand, I do feel like people are worried about
> over-extending a relation and wasting disk space- but with the way that
> vacuum can clean up pages at the end, that would only be a temporary
> situation anyway.
>

Hi,

Maybe i'm wrong but this should be easily solved by an
autovacuum_no_truncate_empty_pages or an autovacuum_empty_pages_limit
GUC/reloption.
Just to clarify the second one autovacuum will allow until that limit
of empty pages, and will remove excess from there

We can also think in GUC/reloption for next_extend_blocks so formula
is needed, or of course the automated calculation that has been
proposed

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157



Re: Extent Locks

From
Jaime Casanova
Date:
On Tue, May 28, 2013 at 8:38 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
>
> We can also think in GUC/reloption for next_extend_blocks so formula
> is needed, or of course the automated calculation that has been
> proposed
>

s/so formula is needed/so *no* formula is needed

btw, we can also use a next_extend_blocks GUC/reloption as a limit for
autovacuum so it will allow that empty pages at the end of the table

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157



Re: Extent Locks

From
Stephen Frost
Date:
* Jaime Casanova (jaime@2ndquadrant.com) wrote:
> btw, we can also use a next_extend_blocks GUC/reloption as a limit for
> autovacuum so it will allow that empty pages at the end of the table

I'm really not, at all, excited about adding in GUCs for this.  We just
need to realize when the only available space in the relation is at the
end and people are writing to it and avoid truncating pages off the end-
if we don't already have locks that prevent vacuum from doing this
already.  I'd want to see where it's actually happening before stressing
over it terribly much.
Thanks,
    Stephen

Re: Extent Locks

From
Merlin Moncure
Date:
On Tue, May 28, 2013 at 9:07 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Jaime Casanova (jaime@2ndquadrant.com) wrote:
>> btw, we can also use a next_extend_blocks GUC/reloption as a limit for
>> autovacuum so it will allow that empty pages at the end of the table
>
> I'm really not, at all, excited about adding in GUCs for this.  We just
> need to realize when the only available space in the relation is at the
> end and people are writing to it and avoid truncating pages off the end-
> if we don't already have locks that prevent vacuum from doing this
> already.  I'd want to see where it's actually happening before stressing
> over it terribly much.

+1   autovacuum configuration is already much too complex as it
is...we should be removing/consolidating options, not adding them.

merlin



Re: Extent Locks

From
Andres Freund
Date:
On 2013-05-28 10:07:06 -0400, Stephen Frost wrote:
> * Jaime Casanova (jaime@2ndquadrant.com) wrote:
> > btw, we can also use a next_extend_blocks GUC/reloption as a limit for
> > autovacuum so it will allow that empty pages at the end of the table
> 
> I'm really not, at all, excited about adding in GUCs for this.

But I thought you were in favor of doing complex stuff like mapping
segments filled somewhere else into place :P

But I agree. This needs to work without much manual intervention. I
think we just need to make autovacuum truncate only if it finds more
free space than whatever amount we might have added at that relation
size plus some slop.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Extent Locks

From
Stephen Frost
Date:
* Andres Freund (andres@2ndquadrant.com) wrote:
> On 2013-05-28 10:07:06 -0400, Stephen Frost wrote:
> > I'm really not, at all, excited about adding in GUCs for this.
>
> But I thought you were in favor of doing complex stuff like mapping
> segments filled somewhere else into place :P

That wouldn't require a GUC.. ;)

> But I agree. This needs to work without much manual intervention. I
> think we just need to make autovacuum truncate only if it finds more
> free space than whatever amount we might have added at that relation
> size plus some slop.

Agreed.
Thanks,
    Stephen

Re: Extent Locks

From
Jaime Casanova
Date:
On Tue, May 28, 2013 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> But I agree. This needs to work without much manual intervention. I
> think we just need to make autovacuum truncate only if it finds more
> free space than whatever amount we might have added at that relation
> size plus some slop.
>

And how do you decide the amount of that "slop"?

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157



Re: Extent Locks

From
Stephen Frost
Date:
* Jaime Casanova (jaime@2ndquadrant.com) wrote:
> On Tue, May 28, 2013 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > But I agree. This needs to work without much manual intervention. I
> > think we just need to make autovacuum truncate only if it finds more
> > free space than whatever amount we might have added at that relation
> > size plus some slop.
>
> And how do you decide the amount of that "slop"?

How about % of table size?
Thanks,
    Stephen