Thread: extending relations more efficiently

extending relations more efficiently

From
Robert Haas
Date:
We've previously discussed the possible desirability of extending
relations in larger increments, rather than one block at a time, for
performance reasons.  I attempted to determine how much performance we
could possibly buy this way, and found that, as far as I can see, the
answer is, basically, none.  I wrote a test program which does writes
until it reaches 1GB, and times how long the writes take in aggregate.
 Then it performs a single fdatasync at the end and times that as
well.  On some of the machines it is slightly faster in the aggregate
to extend in larger chunks, but the magnitude of the change is little
enough that, at least to me, it seems entirely not worth bothering
with.  Some results are below.  Now, one thing that this test doesn't
help much with is the theory that it's better to extend a file in
larger chunks because the file will become less fragmented on disk.  I
don't really know how to factor that effect into the test - any ideas?

I also considered two other methods of extending a file.  First, there
is ftruncate().  It's really fast.  Unfortunately, it's unsuitable for
our purposes because it will cheerfully leave holes in the file, and
part of the reason for our current implementation is to make sure that
there are no holes, so that later writes to the file can't fail for
lack of disk space.  So that's no good.  Second, and more
interestingly, there is a function called posix_fallocate().  It is
present on Linux but not on MacOS X; I haven't checked any other
platforms.  It claims that it will extend a file out to a particular
size, forcing disk blocks to be allocated so that later writes won't
fail.  Testing (more details below) shows that posix_fallocate() is
quite efficient for large chunks.  For example, extending a file to
1GB in size 64 blocks at a time (that is, 256kB at a time) took only
~60 ms and the subsequent fdatasync took almost no time at all,
whereas zero-filling the file out 1GB using write() took 600-700 ms
and the subsequent fdatasync took another 4-5 seconds.  That seems
like a pretty sizable win, and it's not too hard to imagine that it
could be even better when the I/O subsystem is busy.  Unfortunately,
using posix_fallocate() for 8kB chunks seems to be significantly less
efficient than our current method - I'm guessing that it actually
writes the updated metadata back to disk, where write() does not (this
makes one wonder how safe it is to count on write to have the behavior
we need here in the first place).  So in this case it seems we would
probably want to do it in larger chunks.  (We could possibly also use
it when creating new WAL files, to extend all the way out to 16MB in
one shot, at a considerable savings in I/O.)

Any thoughts about where to go from here would be much appreciated.
Test results follow.

Some results from the IBM POWER7 box (ext4):

write 1 8K blocks at a time: write=782.408 fdatasync=4400.984
write 2 8K blocks at a time: write=560.569 fdatasync=4389.413
write 4 8K blocks at a time: write=479.647 fdatasync=4290.753
write 8 8K blocks at a time: write=627.038 fdatasync=4292.920
write 16 8K blocks at a time: write=619.882 fdatasync=4288.984
write 32 8K blocks at a time: write=613.037 fdatasync=4289.069
write 64 8K blocks at a time: write=608.669 fdatasync=4594.534
write 64 8K blocks at a time: write=608.475 fdatasync=4342.934
write 32 8K blocks at a time: write=612.506 fdatasync=4297.969
write 16 8K blocks at a time: write=621.387 fdatasync=4430.693
write 8 8K blocks at a time: write=629.576 fdatasync=4296.472
write 4 8K blocks at a time: write=674.419 fdatasync=4359.290
write 2 8K blocks at a time: write=652.029 fdatasync=4327.876
write 1 8K blocks at a time: write=800.973 fdatasync=4472.197

Some results from Nate Boley's 64-core box (xfs):

write 1 8K blocks at a time: write=1284.834 fdatasync=3538.361
write 2 8K blocks at a time: write=1176.082 fdatasync=3498.968
write 4 8K blocks at a time: write=1115.419 fdatasync=3634.673
write 8 8K blocks at a time: write=1088.404 fdatasync=3670.018
write 16 8K blocks at a time: write=1082.480 fdatasync=3778.763
write 32 8K blocks at a time: write=1075.875 fdatasync=3757.716
write 64 8K blocks at a time: write=1076.076 fdatasync=3996.997

Some results from Nate Boley's 32-core box (xfs):

write 1 8K blocks at a time: write=968.351 fdatasync=6013.304
write 2 8K blocks at a time: write=902.288 fdatasync=6810.980
write 4 8K blocks at a time: write=900.520 fdatasync=4886.449
write 8 8K blocks at a time: write=889.970 fdatasync=6096.856
write 16 8K blocks at a time: write=882.891 fdatasync=8136.211
write 32 8K blocks at a time: write=892.914 fdatasync=10898.796
write 64 8K blocks at a time: write=917.326 fdatasync=11223.696

And finally, from the IBM POWER7 machine, a  few posix_fallocate results:

posix_fallocate 1 8K blocks at a time: write=3021.177 fdatasync=0.029
posix_fallocate 2 8K blocks at a time: write=1124.638 fdatasync=0.029
posix_fallocate 4 8K blocks at a time: write=9290.490 fdatasync=0.028
posix_fallocate 8 8K blocks at a time: write=477.831 fdatasync=0.029
posix_fallocate 16 8K blocks at a time: write=425.341 fdatasync=0.032
posix_fallocate 32 8K blocks at a time: write=122.499 fdatasync=0.034
posix_fallocate 64 8K blocks at a time: write=60.789 fdatasync=0.023
posix_fallocate 64 8K blocks at a time: write=102.867 fdatasync=0.029
posix_fallocate 32 8K blocks at a time: write=107.753 fdatasync=0.032
posix_fallocate 16 8K blocks at a time: write=900.674 fdatasync=0.031
posix_fallocate 8 8K blocks at a time: write=690.407 fdatasync=0.030
posix_fallocate 4 8K blocks at a time: write=550.454 fdatasync=0.035
posix_fallocate 2 8K blocks at a time: write=3447.778 fdatasync=0.031
posix_fallocate 1 8K blocks at a time: write=8753.767 fdatasync=0.030
posix_fallocate 64 8K blocks at a time: write=42.779 fdatasync=0.029
posix_fallocate 32 8K blocks at a time: write=110.344 fdatasync=0.031
posix_fallocate 16 8K blocks at a time: write=181.700 fdatasync=0.030
posix_fallocate 8 8K blocks at a time: write=1599.181 fdatasync=0.032
posix_fallocate 4 8K blocks at a time: write=1076.495 fdatasync=0.029
posix_fallocate 2 8K blocks at a time: write=17192.049 fdatasync=0.028
posix_fallocate 1 8K blocks at a time: write=6244.441 fdatasync=0.028

Test program is attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: extending relations more efficiently

From
Simon Riggs
Date:
On Tue, May 1, 2012 at 3:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> We've previously discussed the possible desirability of extending
> relations in larger increments, rather than one block at a time, for
> performance reasons.  I attempted to determine how much performance we
> could possibly buy this way, and found that, as far as I can see, the
> answer is, basically, none.

Fair enough, but my understanding was that tests showed that the
extension lock was a bottleneck, so doing extensions in larger chunks
should reduce the time we spend waiting for a lock and thus improve
performance. So while your results here show no gain, there is gain to
be had elsewhere as a result.

The algorithm needs to be adaptive, so we don't waste space on smaller
tables or lower loads.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: extending relations more efficiently

From
Andres Freund
Date:
On Tuesday, May 01, 2012 04:08:27 PM Robert Haas wrote:
> We've previously discussed the possible desirability of extending
> relations in larger increments, rather than one block at a time, for
> performance reasons.  I attempted to determine how much performance we
> could possibly buy this way, and found that, as far as I can see, the
> answer is, basically, none.  I wrote a test program which does writes
> until it reaches 1GB, and times how long the writes take in aggregate.
>  Then it performs a single fdatasync at the end and times that as
> well.  On some of the machines it is slightly faster in the aggregate
> to extend in larger chunks, but the magnitude of the change is little
> enough that, at least to me, it seems entirely not worth bothering
> with.  Some results are below.  Now, one thing that this test doesn't
> help much with is the theory that it's better to extend a file in
> larger chunks because the file will become less fragmented on disk.  I
> don't really know how to factor that effect into the test - any ideas?
I think that test disregards the fact that were holding an exclusive lock 
during the file extension which happens rather frequently if you have multiple 
COPYs or similar running at the same time. Extending in bigger chunks reduces 
the frequency of taking that lock.

I really would *love* to see improvements in that kind of workload.

> I also considered two other methods of extending a file.  First, there
> is ftruncate().  It's really fast.  Unfortunately, it's unsuitable for
> our purposes because it will cheerfully leave holes in the file, and
> part of the reason for our current implementation is to make sure that
> there are no holes, so that later writes to the file can't fail for
> lack of disk space.  So that's no good.  Second, and more
> interestingly, there is a function called posix_fallocate().  It is
> present on Linux but not on MacOS X; I haven't checked any other
> platforms.  It claims that it will extend a file out to a particular
> size, forcing disk blocks to be allocated so that later writes won't
> fail.  Testing (more details below) shows that posix_fallocate() is
> quite efficient for large chunks.  For example, extending a file to
> 1GB in size 64 blocks at a time (that is, 256kB at a time) took only
> ~60 ms and the subsequent fdatasync took almost no time at all,
> whereas zero-filling the file out 1GB using write() took 600-700 ms
> and the subsequent fdatasync took another 4-5 seconds.That seems
> like a pretty sizable win, and it's not too hard to imagine that it
> could be even better when the I/O subsystem is busy.  Unfortunately,
> using posix_fallocate() for 8kB chunks seems to be significantly less
> efficient than our current method - I'm guessing that it actually
> writes the updated metadata back to disk, where write() does not (this
> makes one wonder how safe it is to count on write to have the behavior
> we need here in the first place).
Currently the write() doesn't need to be crashsafe because it will be repeated 
on crash-recovery and a checkpoint will fsync the file.

I don't really see why it would need to compare in the 8kb case. What reason 
would there be to further extend in that small increments?

There is the question whether this should be done in the background though, so 
the relation extension lock is never hit in anything time-critical...

Andres


Re: extending relations more efficiently

From
Robert Haas
Date:
On Tue, May 1, 2012 at 10:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Fair enough, but my understanding was that tests showed that the
> extension lock was a bottleneck, so doing extensions in larger chunks
> should reduce the time we spend waiting for a lock and thus improve
> performance. So while your results here show no gain, there is gain to
> be had elsewhere as a result.

Maybe, but I'm skeptical.  There are a few cases - such as the problem
I fixed with SInvalReadLock - where the actual overhead of taking and
releasing the lock is the problem.  ProcArrayLock has some
as-yet-unfixed issues in this area as well.  But most of our locking
bottlenecks are caused by doing too much work while holding the lock,
not by acquiring and releasing it too frequently.  A single lock
manager partition can cope with upwards of 30k acquire/release cycles
per second, which would amount to >240MB/sec of file extension.  It
seems very unlikely that we're hitting our head against that ceiling,
although maybe you'd like to suggest a test case.   Rather, I suspect
that it's just plain taking too long to perform the actual extension.
If we could extend by 8 blocks in only 4 times the time it takes to
extend by 1 block, then obviously there would be a win available, but
the test results suggest that isn't the case.

<...thinks...>

Maybe the solution here isn't extending in larger chunks, but allowing
several extensions to proceed in parallel.  It seems likely that a big
chunk of the system call time is being used up waiting for write() to
copy data from user space to kernel space, and there's no reason
several backends couldn't do that in parallel.  I think it would be
sufficient to ensure that nobody starts using block N until the
initial writes of all blocks < N have completed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Robert Haas
Date:
On Tue, May 1, 2012 at 10:31 AM, Andres Freund <andres@anarazel.de> wrote:
>> efficient than our current method - I'm guessing that it actually
>> writes the updated metadata back to disk, where write() does not (this
>> makes one wonder how safe it is to count on write to have the behavior
>> we need here in the first place).
> Currently the write() doesn't need to be crashsafe because it will be repeated
> on crash-recovery and a checkpoint will fsync the file.

That's not what I'm worried about.  If the write() succeeds and then a
subsequent close() on the filehandle reports an ENOSPC condition that
means the write didn't really write after all, I am concerned that we
might not handle that cleanly.

> I don't really see why it would need to compare in the 8kb case. What reason
> would there be to further extend in that small increments?

In previous discussions, the concern has been that holding the
relation extension lock across a multi-block extension would cause
latency spikes for both the process doing the extensions and any other
concurrent processes that need the lock.  Obviously if it were
possible to extend by 64kB in the same time it takes to extend by 8kB
that would be awesome, but if it takes eight times longer then things
don't look so good.

> There is the question whether this should be done in the background though, so
> the relation extension lock is never hit in anything time-critical...

Yeah, although I'm fuzzy on how and whether that can be made to work,
which is not to say that it can't.

It might also be interesting to provide a mechanism to pre-extend a
relation to a certain number of blocks, though if we did that we'd
have to make sure that autovac got the memo not to truncate those
pages away again.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Andres Freund
Date:
On Tuesday, May 01, 2012 05:06:11 PM Robert Haas wrote:
> On Tue, May 1, 2012 at 10:31 AM, Andres Freund <andres@anarazel.de> wrote:
> >> efficient than our current method - I'm guessing that it actually
> >> writes the updated metadata back to disk, where write() does not (this
> >> makes one wonder how safe it is to count on write to have the behavior
> >> we need here in the first place).
> > 
> > Currently the write() doesn't need to be crashsafe because it will be
> > repeated on crash-recovery and a checkpoint will fsync the file.
> 
> That's not what I'm worried about.  If the write() succeeds and then a
> subsequent close() on the filehandle reports an ENOSPC condition that
> means the write didn't really write after all, I am concerned that we
> might not handle that cleanly.
Hm. While write() might not write its state to disk I don't think that can 
imply than that the *in memory* state is inconsistent.
Posix doesn't allow ENOSPC for close() as far as I can see.

> > I don't really see why it would need to compare in the 8kb case. What
> > reason would there be to further extend in that small increments?
> In previous discussions, the concern has been that holding the
> relation extension lock across a multi-block extension would cause
> latency spikes for both the process doing the extensions and any other
> concurrent processes that need the lock.  Obviously if it were
> possible to extend by 64kB in the same time it takes to extend by 8kB
> that would be awesome, but if it takes eight times longer then things
> don't look so good.
Yes, sure.

> > There is the question whether this should be done in the background
> > though, so the relation extension lock is never hit in anything
> > time-critical...
> Yeah, although I'm fuzzy on how and whether that can be made to work,
> which is not to say that it can't.
The biggest problem I see is knowing when to trigger the extension of which 
file without scanning files all the time.

Using some limited size shm-queue of {reltblspc, relfilenode} of to-be-
extended files + a latch is the first thing I can think of. Every time a 
backend initializes a page with offset % EXTEND_SIZE == 0 it adds that table 
to the queue. The background writer extends the file by EXTEND_SIZE * 2 if 
necessary. If the queue is overflown all files are checked. Or the backends 
extend themselves again...
EXTEND_SIZE should probably scale with the table size up to 64MB or so...

> It might also be interesting to provide a mechanism to pre-extend a
> relation to a certain number of blocks, though if we did that we'd
> have to make sure that autovac got the memo not to truncate those
> pages away again.
Hm. I have to say I don't really see a big need to do this if the size of 
preallocation is adaptive to the file size. Sounds like it would add to much 
complications for little benefit.

Andres


Re: extending relations more efficiently

From
Robert Haas
Date:
On Tue, May 1, 2012 at 11:42 AM, Andres Freund <andres@anarazel.de> wrote:
>> > There is the question whether this should be done in the background
>> > though, so the relation extension lock is never hit in anything
>> > time-critical...
>> Yeah, although I'm fuzzy on how and whether that can be made to work,
>> which is not to say that it can't.
> The biggest problem I see is knowing when to trigger the extension of which
> file without scanning files all the time.
>
> Using some limited size shm-queue of {reltblspc, relfilenode} of to-be-
> extended files + a latch is the first thing I can think of.

Perhaps.  An in-memory cache of file sizes would also let us eliminate
a pile of lseek system calls, but the trick is that the
synchronization can't be anything simple like a spinlock - Linux has
does it that way in the kernel in versions <= 3.2 and it's
catastrophic on short read-only transactions.

I think the first thing we need here is a good test case, so we're
clear on what we're trying to solve.  I was just hoping to make file
extension *faster* and what you and Simon are talking about is making
it scale better in the face of heavy parallelism; obviously it would
be nice to do both things, but they are different problems.  Any old
bulk-loading test will benefit from a raw performance improvement, but
to test a scalability improvement we would need some kind of test case
involving parallel bulk loads, or some other kind of parallel activity
that causes rapid table growth.  That's not something I've frequently
run into, but I'd be willing to put a bit of time into it if we can
nail down what we're talking about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Christopher Browne
Date:
On Tue, May 1, 2012 at 10:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, May 1, 2012 at 3:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> We've previously discussed the possible desirability of extending
>> relations in larger increments, rather than one block at a time, for
>> performance reasons.  I attempted to determine how much performance we
>> could possibly buy this way, and found that, as far as I can see, the
>> answer is, basically, none.
>
> Fair enough, but my understanding was that tests showed that the
> extension lock was a bottleneck, so doing extensions in larger chunks
> should reduce the time we spend waiting for a lock and thus improve
> performance. So while your results here show no gain, there is gain to
> be had elsewhere as a result.

Try to make sure that one of the scenarios involves there being
multiple writers.

One of the relevant cases is where there are multiple writers, where
they may wind up fighting over the last page in the table.  If free
space is in that one page, they might serialize on the request for
access to that page.

Extending by several pages at a time may help *one* of the writers,
but if that adds some I/O work that takes place while other
connections are blocked, waiting, then this might worsen things
somewhat for the other writers.  If this reduces the frequency of that
sort of collision, it might nonetheless be a win.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: extending relations more efficiently

From
Tom Lane
Date:
Christopher Browne <cbbrowne@gmail.com> writes:
> Try to make sure that one of the scenarios involves there being
> multiple writers.

> One of the relevant cases is where there are multiple writers, where
> they may wind up fighting over the last page in the table.  If free
> space is in that one page, they might serialize on the request for
> access to that page.

The system is supposed to be designed to avoid that; we try to hand out
pages with free space to different backends.  One of the advantages of
the current page-at-a-time setup is that when there is *no* free space
according to the FSM, each such backend will create and fill its own
page independently.  They do serialize on the extension lock, but the
intervening tuple additions are parallelized.  We have to be careful to
not make that case worse in a multi-page-extension scheme.
        regards, tom lane


Re: extending relations more efficiently

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mar may 01 13:19:48 -0400 2012:
> Christopher Browne <cbbrowne@gmail.com> writes:
> > Try to make sure that one of the scenarios involves there being
> > multiple writers.
>
> > One of the relevant cases is where there are multiple writers, where
> > they may wind up fighting over the last page in the table.  If free
> > space is in that one page, they might serialize on the request for
> > access to that page.
>
> The system is supposed to be designed to avoid that; we try to hand out
> pages with free space to different backends.  One of the advantages of
> the current page-at-a-time setup is that when there is *no* free space
> according to the FSM, each such backend will create and fill its own
> page independently.  They do serialize on the extension lock, but the
> intervening tuple additions are parallelized.  We have to be careful to
> not make that case worse in a multi-page-extension scheme.

This didn't work all that well for toast tables that have lots of very
large tuples inserted concurrently, at least with older releases.  Each
backend would acquire the extension lock many times per tuple inserted.
The result was really high contention on the extension lock.

I haven't tested it recently to see if things have already improved,
though.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: extending relations more efficiently

From
Robert Haas
Date:
On Tue, May 1, 2012 at 3:01 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>> The system is supposed to be designed to avoid that; we try to hand out
>> pages with free space to different backends.  One of the advantages of
>> the current page-at-a-time setup is that when there is *no* free space
>> according to the FSM, each such backend will create and fill its own
>> page independently.  They do serialize on the extension lock, but the
>> intervening tuple additions are parallelized.  We have to be careful to
>> not make that case worse in a multi-page-extension scheme.
>
> This didn't work all that well for toast tables that have lots of very
> large tuples inserted concurrently, at least with older releases.  Each
> backend would acquire the extension lock many times per tuple inserted.
> The result was really high contention on the extension lock.

Hmm, this sounds like a good test case - a small number of SQL
statements triggering a large amount of relation extension work.
Using an unlogged table would probably make it easier to see the
relation-extension contention, too, since you'd get WALInsertLock
mostly out of the way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> I think the first thing we need here is a good test case, so we're
> clear on what we're trying to solve.  I was just hoping to make file
> extension *faster* and what you and Simon are talking about is making
> it scale better in the face of heavy parallelism; obviously it would
> be nice to do both things, but they are different problems.  Any old
> bulk-loading test will benefit from a raw performance improvement, but
> to test a scalability improvement we would need some kind of test case
> involving parallel bulk loads, or some other kind of parallel activity
> that causes rapid table growth.  That's not something I've frequently
> run into, but I'd be willing to put a bit of time into it if we can
> nail down what we're talking about.

Try loading the TIGER 2011 data into a single table, where you load each
county (or perhaps state) in a separate, parallel, process.  That's what
I was doing when I hit this lock full-force and bitched about it on this
list.

I'd be happy to help construct that case, as well as test any changes in
this area which might help address it (on a 10G SSD-backed SAN..).
Thanks,
    Stephen

Re: extending relations more efficiently

From
Robert Haas
Date:
On Tue, May 1, 2012 at 3:48 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Try loading the TIGER 2011 data into a single table, where you load each
> county (or perhaps state) in a separate, parallel, process.  That's what
> I was doing when I hit this lock full-force and bitched about it on this
> list.
>
> I'd be happy to help construct that case, as well as test any changes in
> this area which might help address it (on a 10G SSD-backed SAN..).

Hmm, can you ship me one of those 10G SSD-backed SANs?  That'd be awesome.

In all seriousness, this is not a great test case unless you can
provide some scripts to make it easy to run it in a reproducible
fashion.  Can you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, May 1, 2012 at 3:48 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > I'd be happy to help construct that case, as well as test any changes in
> > this area which might help address it (on a 10G SSD-backed SAN..).
>
> Hmm, can you ship me one of those 10G SSD-backed SANs?  That'd be awesome.

Yeah...  Not cheap. :)

> In all seriousness, this is not a great test case unless you can
> provide some scripts to make it easy to run it in a reproducible
> fashion.  Can you?

Yeah, sure, I'll do that.  The PostGIS folks have scripts, but they're
kind of ugly, tbh..  I'll give you what I used.
Thanks,
    Stephen

Re: extending relations more efficiently

From
Jeroen Vermeulen
Date:
On 2012-05-01 22:06, Robert Haas wrote:

> It might also be interesting to provide a mechanism to pre-extend a
> relation to a certain number of blocks, though if we did that we'd
> have to make sure that autovac got the memo not to truncate those
> pages away again.

Good point.  And just to check before skipping over it, do we know that 
autovacuum not leaving enough slack space is not a significant cause of 
the bottlenecks in the first place?


Jeroen


Re: extending relations more efficiently

From
Robert Haas
Date:
On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote:
> On 2012-05-01 22:06, Robert Haas wrote:
>> It might also be interesting to provide a mechanism to pre-extend a
>> relation to a certain number of blocks, though if we did that we'd
>> have to make sure that autovac got the memo not to truncate those
>> pages away again.
>
> Good point.  And just to check before skipping over it, do we know that
> autovacuum not leaving enough slack space is not a significant cause of the
> bottlenecks in the first place?

I'm not sure exactly what you mean by this: autovacuum doesn't need
any slack space.  Regular DML operations can certainly benefit from
slack space, both within each page and overall within the relation.
But there's no evidence that vacuum is doing too good a job cleaning
up the mess, forcing the relation to be re-extended.  Rather, the
usual (and frequent) complaint is that vacuum is leaving way too much
slack space - i.e. bloat.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012:
> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote:
> > On 2012-05-01 22:06, Robert Haas wrote:
> >> It might also be interesting to provide a mechanism to pre-extend a
> >> relation to a certain number of blocks, though if we did that we'd
> >> have to make sure that autovac got the memo not to truncate those
> >> pages away again.
> >
> > Good point.  And just to check before skipping over it, do we know that
> > autovacuum not leaving enough slack space is not a significant cause of the
> > bottlenecks in the first place?
>
> I'm not sure exactly what you mean by this: autovacuum doesn't need
> any slack space.  Regular DML operations can certainly benefit from
> slack space, both within each page and overall within the relation.
> But there's no evidence that vacuum is doing too good a job cleaning
> up the mess, forcing the relation to be re-extended.  Rather, the
> usual (and frequent) complaint is that vacuum is leaving way too much
> slack space - i.e. bloat.

Hm.  I see those two things as different -- to me, bloat is unremoved
dead tuples, whereas slack space would be free space that can be reused
by new tuples.  Slack space is useful as it avoids relation extension;
bloat is not.

I wonder, though, if we should set a less-than-100 fillfactor for heap
relations.  Just like default_statistic_target, it seems that the
default value should be a conservative tradeoff between two extremes.
This doesn't help extension for bulk insertions a lot, of course, but
it'd be useful for tables where HOT updates happen with some regularity.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: extending relations more efficiently

From
Robert Haas
Date:
On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012:
>> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote:
>> > On 2012-05-01 22:06, Robert Haas wrote:
>> >> It might also be interesting to provide a mechanism to pre-extend a
>> >> relation to a certain number of blocks, though if we did that we'd
>> >> have to make sure that autovac got the memo not to truncate those
>> >> pages away again.
>> >
>> > Good point.  And just to check before skipping over it, do we know that
>> > autovacuum not leaving enough slack space is not a significant cause of the
>> > bottlenecks in the first place?
>>
>> I'm not sure exactly what you mean by this: autovacuum doesn't need
>> any slack space.  Regular DML operations can certainly benefit from
>> slack space, both within each page and overall within the relation.
>> But there's no evidence that vacuum is doing too good a job cleaning
>> up the mess, forcing the relation to be re-extended.  Rather, the
>> usual (and frequent) complaint is that vacuum is leaving way too much
>> slack space - i.e. bloat.
>
> Hm.  I see those two things as different -- to me, bloat is unremoved
> dead tuples, whereas slack space would be free space that can be reused
> by new tuples.  Slack space is useful as it avoids relation extension;
> bloat is not.

I guess I think of bloat as including both unremoved dead tuples and
unwanted internal free space.  If you create a giant table, delete 9
out of every 10 tuples, and vacuum, the table is still "bloated", IMV.

> I wonder, though, if we should set a less-than-100 fillfactor for heap
> relations.  Just like default_statistic_target, it seems that the
> default value should be a conservative tradeoff between two extremes.
> This doesn't help extension for bulk insertions a lot, of course, but
> it'd be useful for tables where HOT updates happen with some regularity.

Perhaps, but in theory that should be self-correcting: the data should
spread itself onto the number of pages where HOT pruning is able to
prevent further relation extension.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> Hm.  I see those two things as different -- to me, bloat is unremoved
>> dead tuples, whereas slack space would be free space that can be reused
>> by new tuples.  Slack space is useful as it avoids relation extension;
>> bloat is not.

> I guess I think of bloat as including both unremoved dead tuples and
> unwanted internal free space.  If you create a giant table, delete 9
> out of every 10 tuples, and vacuum, the table is still "bloated", IMV.

The difficulty is to tell the difference between useless free space and
useful free space.  If there's a reasonable probability of putting new
data into a given chunk of free space in the near future, it's not
bloat.
        regards, tom lane


Re: extending relations more efficiently

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mié may 02 12:37:35 -0400 2012:
>
> On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> > Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012:
> >> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote:
> >> > On 2012-05-01 22:06, Robert Haas wrote:
> >> >> It might also be interesting to provide a mechanism to pre-extend a
> >> >> relation to a certain number of blocks, though if we did that we'd
> >> >> have to make sure that autovac got the memo not to truncate those
> >> >> pages away again.
> >> >
> >> > Good point.  And just to check before skipping over it, do we know that
> >> > autovacuum not leaving enough slack space is not a significant cause of the
> >> > bottlenecks in the first place?
> >>
> >> I'm not sure exactly what you mean by this: autovacuum doesn't need
> >> any slack space.  Regular DML operations can certainly benefit from
> >> slack space, both within each page and overall within the relation.
> >> But there's no evidence that vacuum is doing too good a job cleaning
> >> up the mess, forcing the relation to be re-extended.  Rather, the
> >> usual (and frequent) complaint is that vacuum is leaving way too much
> >> slack space - i.e. bloat.
> >
> > Hm.  I see those two things as different -- to me, bloat is unremoved
> > dead tuples, whereas slack space would be free space that can be reused
> > by new tuples.  Slack space is useful as it avoids relation extension;
> > bloat is not.
>
> I guess I think of bloat as including both unremoved dead tuples and
> unwanted internal free space.  If you create a giant table, delete 9
> out of every 10 tuples, and vacuum, the table is still "bloated", IMV.

Agreed.  Perhaps to solve this issue what we need is a way to migrate
tuples from later pages into earlier ones.  (This was one of the points,
never resolved, that we discussed during the VACUUM FULL rework.)

> > I wonder, though, if we should set a less-than-100 fillfactor for heap
> > relations.  Just like default_statistic_target, it seems that the
> > default value should be a conservative tradeoff between two extremes.
> > This doesn't help extension for bulk insertions a lot, of course, but
> > it'd be useful for tables where HOT updates happen with some regularity.
>
> Perhaps, but in theory that should be self-correcting: the data should
> spread itself onto the number of pages where HOT pruning is able to
> prevent further relation extension.

True.  I gather you consider that the cases where it doesn't happen due
to particular conditions are the ones that need manual tweaking.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: extending relations more efficiently

From
Robert Haas
Date:
On Wed, May 2, 2012 at 12:46 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Agreed.  Perhaps to solve this issue what we need is a way to migrate
> tuples from later pages into earlier ones.  (This was one of the points,
> never resolved, that we discussed during the VACUUM FULL rework.)

Yeah, I agree.  And frankly, we need to find a way to make it work
without taking AccessExclusiveLock on the relation.  Having to run
VACUUM FULL is killing actual users and scaring off potential ones.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: extending relations more efficiently

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mié may 02 12:55:17 -0400 2012:
>
> On Wed, May 2, 2012 at 12:46 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> > Agreed.  Perhaps to solve this issue what we need is a way to migrate
> > tuples from later pages into earlier ones.  (This was one of the points,
> > never resolved, that we discussed during the VACUUM FULL rework.)
>
> Yeah, I agree.  And frankly, we need to find a way to make it work
> without taking AccessExclusiveLock on the relation.  Having to run
> VACUUM FULL is killing actual users and scaring off potential ones.

And ideally without bloating the indexes while at it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: extending relations more efficiently

From
Stephen Frost
Date:
Robert,

* Stephen Frost (sfrost@snowman.net) wrote:
> > In all seriousness, this is not a great test case unless you can
> > provide some scripts to make it easy to run it in a reproducible
> > fashion.  Can you?
>
> Yeah, sure, I'll do that.  The PostGIS folks have scripts, but they're
> kind of ugly, tbh..  I'll give you what I used.

Alright, I made it dirt simple.  Get a DB set up, get PostGIS installed
in to it (or not, if you don't really want to..) and then download this:

http://tamriel.snowman.net/~sfrost/testcase.tar.gz
(Note: it's ~3.7G)

It'll untar in to a 'testcase' directory.  If you have PostGIS
instealled already, just run testcase/create_linearwater.sql first, then
run all the other .sql files in there in parallel by passing them to
psql, eg:

for file in tl*.sql; dopsql -d mydb -f $file &
done

It shouldn't take long to see stuff start blocking on that extension
lock, since the table starts out empty.

If you don't want to install PostGIS, just change the
create_linearwater.sql script to have a column at the end named
'the_geom' and nuke the 'AddGeometryColumn' call.  If you run into
trouble getting PG to exhibit the expected behavior, you might try
combining sets of files, like so:

for state in `ls tl_2011_* | cut -f3 -d_ | cut -c1,2 | sort -u`; docat tl_2011_${state}*.sql | psql -d mydb &
done

Which will reduce the number of processes to just the number of states
and territories (eg: 56), while increasing the amount of work each is
doing.
Thanks,
    Stephen

Re: extending relations more efficiently

From
Jeff Davis
Date:
On Tue, 2012-05-01 at 10:08 -0400, Robert Haas wrote:
> We've previously discussed the possible desirability of extending
> relations in larger increments, rather than one block at a time, for
> performance reasons.  I attempted to determine how much performance we
> could possibly buy this way, and found that, as far as I can see, the
> answer is, basically, none.

Another point here is that with checksums, we will want to make sure
that zero pages can be treated as corrupt. That will probably involve
using the WAL for extension operations, and we'll want to mitigate that
cost somehow. Extending in larger chunks would probably be necessary.

There are some challenges there, but I think it's worth pursuing.

Regards,Jeff Davis