Thread: Detrimental performance impact of ringbuffers on performance

Detrimental performance impact of ringbuffers on performance

From
Andres Freund
Date:
Hi,

While benchmarking on hydra
(c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
which has quite slow IO, I was once more annoyed by how incredibly long
the vacuum at the the end of a pgbench -i takes.

The issue is that, even for an entirely shared_buffers resident scale,
essentially no data is cached in shared buffers. The COPY to load data
uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
that copy immediately writes and evicts all data. Then vacuum reads &
writes the data in small chunks; again evicting nearly all buffers. Then
the creation of the ringbuffer has to read that data *again*.

That's fairly idiotic.

While it's not easy to fix this in the general case, we introduced those
ringbuffers for a reason after all, I think we at least should add a
special case for loads where shared_buffers isn't fully used yet.  Why
not skip using buffers from the ringbuffer if there's buffers on the
freelist? If we add buffers gathered from there to the ringlist, we
should have few cases that regress.

Additionally, maybe we ought to increase the ringbuffer sizes again one
of these days? 256kb for VACUUM is pretty damn low.

Greetings,

Andres Freund



Re: Detrimental performance impact of ringbuffers on performance

From
Robert Haas
Date:
On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote:
> While benchmarking on hydra
> (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> which has quite slow IO, I was once more annoyed by how incredibly long
> the vacuum at the the end of a pgbench -i takes.
>
> The issue is that, even for an entirely shared_buffers resident scale,
> essentially no data is cached in shared buffers. The COPY to load data
> uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> that copy immediately writes and evicts all data. Then vacuum reads &
> writes the data in small chunks; again evicting nearly all buffers. Then
> the creation of the ringbuffer has to read that data *again*.
>
> That's fairly idiotic.
>
> While it's not easy to fix this in the general case, we introduced those
> ringbuffers for a reason after all, I think we at least should add a
> special case for loads where shared_buffers isn't fully used yet.  Why
> not skip using buffers from the ringbuffer if there's buffers on the
> freelist? If we add buffers gathered from there to the ringlist, we
> should have few cases that regress.

That does not seem like a good idea from here.  One of the ideas I
still want to explore at some point is having a background process
identify the buffers that are just about to be evicted and stick them
on the freelist so that the backends don't have to run the clock sweep
themselves on a potentially huge number of buffers, at perhaps
substantial CPU cost.  Amit's last attempt at this didn't really pan
out, but I'm not convinced that the approach is without merit.

And, on the other hand, if we don't do something like that, it will be
quite an exceptional case to find anything on the free list.  Doing it
just to speed up developer benchmarking runs seems like the wrong
idea.

> Additionally, maybe we ought to increase the ringbuffer sizes again one
> of these days? 256kb for VACUUM is pretty damn low.

But all that does is force the backend to write to the operating
system, which is where the real buffering happens.  The bottom line
here, IMHO, is not that there's anything wrong with our ring buffer
implementation, but that if you run PostgreSQL on a system where the
I/O is hitting a 5.25" floppy (not to say 8") the performance may be
less than ideal.  I really appreciate IBM donating hydra - it's been
invaluable over the years for improving PostgreSQL performance - but I
sure wish they had donated a better I/O subsystem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Detrimental performance impact of ringbuffers on performance

From
Andres Freund
Date:
On 2016-04-12 14:29:10 -0400, Robert Haas wrote:
> On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote:
> > While benchmarking on hydra
> > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> > which has quite slow IO, I was once more annoyed by how incredibly long
> > the vacuum at the the end of a pgbench -i takes.
> >
> > The issue is that, even for an entirely shared_buffers resident scale,
> > essentially no data is cached in shared buffers. The COPY to load data
> > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> > that copy immediately writes and evicts all data. Then vacuum reads &
> > writes the data in small chunks; again evicting nearly all buffers. Then
> > the creation of the ringbuffer has to read that data *again*.
> >
> > That's fairly idiotic.
> >
> > While it's not easy to fix this in the general case, we introduced those
> > ringbuffers for a reason after all, I think we at least should add a
> > special case for loads where shared_buffers isn't fully used yet.  Why
> > not skip using buffers from the ringbuffer if there's buffers on the
> > freelist? If we add buffers gathered from there to the ringlist, we
> > should have few cases that regress.
> 
> That does not seem like a good idea from here.  One of the ideas I
> still want to explore at some point is having a background process
> identify the buffers that are just about to be evicted and stick them
> on the freelist so that the backends don't have to run the clock sweep
> themselves on a potentially huge number of buffers, at perhaps
> substantial CPU cost.  Amit's last attempt at this didn't really pan
> out, but I'm not convinced that the approach is without merit.

FWIW, I've posted an implementation of this in the checkpoint flushing
thread; I saw quite substantial gains with it. It was just entirely
unrealistic to push that into 9.6.


> And, on the other hand, if we don't do something like that, it will be
> quite an exceptional case to find anything on the free list.  Doing it
> just to speed up developer benchmarking runs seems like the wrong
> idea.

I don't think it's just developer benchmarks. I've seen a number of
customer systems where significant portions of shared buffers were
unused due to this.

Unless you have an OLTP system, you can right now easily end up in a
situation where, after a restart, you'll never fill shared_buffers.
Just because sequential scans for OLAP and COPY use ringbuffers. It sure
isn't perfect to address the problem while there's free space in s_b,
but it sure is better than to just continue to have significant portions
of s_b unused.


> > Additionally, maybe we ought to increase the ringbuffer sizes again one
> > of these days? 256kb for VACUUM is pretty damn low.
> 
> But all that does is force the backend to write to the operating
> system, which is where the real buffering happens.

Relying on that has imo proven to be a pretty horrible idea.


> The bottom line
> here, IMHO, is not that there's anything wrong with our ring buffer
> implementation, but that if you run PostgreSQL on a system where the
> I/O is hitting a 5.25" floppy (not to say 8") the performance may be
> less than ideal.  I really appreciate IBM donating hydra - it's been
> invaluable over the years for improving PostgreSQL performance - but I
> sure wish they had donated a better I/O subsystem.

It's really not just hydra. I've seen the same problem on 24 disk raid-0
type installations. The small ringbuffer leads to reads/writes being
constantly interspersed, apparently defeating readahead.

Greetings,

Andres Freund



Re: Detrimental performance impact of ringbuffers on performance

From
Stephen Frost
Date:
Robert, Andres,

* Andres Freund (andres@anarazel.de) wrote:
> On 2016-04-12 14:29:10 -0400, Robert Haas wrote:
> > On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote:
> > That does not seem like a good idea from here.  One of the ideas I
> > still want to explore at some point is having a background process
> > identify the buffers that are just about to be evicted and stick them
> > on the freelist so that the backends don't have to run the clock sweep
> > themselves on a potentially huge number of buffers, at perhaps
> > substantial CPU cost.  Amit's last attempt at this didn't really pan
> > out, but I'm not convinced that the approach is without merit.
>
> FWIW, I've posted an implementation of this in the checkpoint flushing
> thread; I saw quite substantial gains with it. It was just entirely
> unrealistic to push that into 9.6.

That is fantastic to hear and I certainly agree that we should be
working on that approach.

> > And, on the other hand, if we don't do something like that, it will be
> > quite an exceptional case to find anything on the free list.  Doing it
> > just to speed up developer benchmarking runs seems like the wrong
> > idea.
>
> I don't think it's just developer benchmarks. I've seen a number of
> customer systems where significant portions of shared buffers were
> unused due to this.

Ditto.

I agree that we should be smarter when we have a bunch of free
shared_buffers space and we're doing sequential work.  I don't think we
want to immediately grab all that free space for the sequential work but
perhaps there's a reasonable heuristic we could use- such as if the free
space available is twice what we expect our sequential read to be, then
go ahead and load it into shared buffers?

The point here isn't to get rid of the ring buffers but rather to use
the shared buffer space when we have plenty of it and there isn't
contention for it.

Thanks!

Stephen

Re: Detrimental performance impact of ringbuffers on performance

From
Robert Haas
Date:
On Tue, Apr 12, 2016 at 2:38 PM, Andres Freund <andres@anarazel.de> wrote:
>> And, on the other hand, if we don't do something like that, it will be
>> quite an exceptional case to find anything on the free list.  Doing it
>> just to speed up developer benchmarking runs seems like the wrong
>> idea.
>
> I don't think it's just developer benchmarks. I've seen a number of
> customer systems where significant portions of shared buffers were
> unused due to this.
>
> Unless you have an OLTP system, you can right now easily end up in a
> situation where, after a restart, you'll never fill shared_buffers.
> Just because sequential scans for OLAP and COPY use ringbuffers. It sure
> isn't perfect to address the problem while there's free space in s_b,
> but it sure is better than to just continue to have significant portions
> of s_b unused.

You will eventually, because each scan will pick a new ring buffer,
and gradually more and more of the relation will get cached.  But it
can take a while.

I'd be more inclined to try to fix this by prewarming the buffers that
were in shared_buffers at shutdown.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Detrimental performance impact of ringbuffers on performance

From
Andres Freund
Date:
On 2016-04-13 06:57:15 -0400, Robert Haas wrote:
> You will eventually, because each scan will pick a new ring buffer,
> and gradually more and more of the relation will get cached.  But it
> can take a while.

You really don't need much new data to make that an unobtainable goal
... :/


> I'd be more inclined to try to fix this by prewarming the buffers that
> were in shared_buffers at shutdown.

That doesn't solve the problem of not reacting to actual new data? It's
not that uncommon to regularly load new data with copy and drop old
partitions, just to keep the workload memory resident...

Andres



Re: Detrimental performance impact of ringbuffers on performance

From
Amit Kapila
Date:
On Wed, Apr 13, 2016 at 12:08 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-04-12 14:29:10 -0400, Robert Haas wrote:
> > On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote:
> > > While benchmarking on hydra
> > > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> > > which has quite slow IO, I was once more annoyed by how incredibly long
> > > the vacuum at the the end of a pgbench -i takes.
> > >
> > > The issue is that, even for an entirely shared_buffers resident scale,
> > > essentially no data is cached in shared buffers. The COPY to load data
> > > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> > > that copy immediately writes and evicts all data. Then vacuum reads &
> > > writes the data in small chunks; again evicting nearly all buffers. Then
> > > the creation of the ringbuffer has to read that data *again*.
> > >
> > > That's fairly idiotic.
> > >
> > > While it's not easy to fix this in the general case, we introduced those
> > > ringbuffers for a reason after all, I think we at least should add a
> > > special case for loads where shared_buffers isn't fully used yet.  Why
> > > not skip using buffers from the ringbuffer if there's buffers on the
> > > freelist? If we add buffers gathered from there to the ringlist, we
> > > should have few cases that regress.
> >
> > That does not seem like a good idea from here.  One of the ideas I
> > still want to explore at some point is having a background process
> > identify the buffers that are just about to be evicted and stick them
> > on the freelist so that the backends don't have to run the clock sweep
> > themselves on a potentially huge number of buffers, at perhaps
> > substantial CPU cost.  Amit's last attempt at this didn't really pan
> > out, but I'm not convinced that the approach is without merit.
>

Yeah and IIRC, I observed that there was lot of contention in dynahash table (when data doesn't fit in shared buffers) due to which the improvement hasn't shown measurable gain in terms of TPS.  As now in 9.6, we have reduced the contention (spinlocks) in dynahash tables, it might be interesting to run the tests again.

> FWIW, I've posted an implementation of this in the checkpoint flushing
> thread; I saw quite substantial gains with it. It was just entirely
> unrealistic to push that into 9.6.
>

Sounds good.  I remember last time you mentioned that such an idea could benefit bulk load case when data doesn't fit in shared buffers, is it the same case where you saw benefit or other cases like read-only and read-write tests as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Detrimental performance impact of ringbuffers on performance

From
Peter Geoghegan
Date:
On Tue, Apr 12, 2016 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote:
>> And, on the other hand, if we don't do something like that, it will be
>> quite an exceptional case to find anything on the free list.  Doing it
>> just to speed up developer benchmarking runs seems like the wrong
>> idea.
>
> I don't think it's just developer benchmarks. I've seen a number of
> customer systems where significant portions of shared buffers were
> unused due to this.
>
> Unless you have an OLTP system, you can right now easily end up in a
> situation where, after a restart, you'll never fill shared_buffers.
> Just because sequential scans for OLAP and COPY use ringbuffers. It sure
> isn't perfect to address the problem while there's free space in s_b,
> but it sure is better than to just continue to have significant portions
> of s_b unused.

I agree that the ringbuffer heuristics are rather unhelpful in many
real-world scenarios. This is definitely a real problem that we should
try to solve soon.

An adaptive strategy based on actual cache pressure in the recent past
would be better. Maybe that would be as simple as not using a
ringbuffer based on simply not having used up all of shared_buffers
yet. That might not be good enough, but it would probably still be
better than what we have.

Separately, I agree that 256KB is way too low for VACUUM these days.
There is a comment in the buffer directory README about that being
"small enough to fit in L2 cache". I'm pretty sure that that's still
true at least one time over with the latest Raspberry Pi model, so it
should be revisited.

-- 
Peter Geoghegan



Re: Detrimental performance impact of ringbuffers on performance

From
Jeff Janes
Date:
On Tue, Apr 12, 2016 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote:


>
>> The bottom line
>> here, IMHO, is not that there's anything wrong with our ring buffer
>> implementation, but that if you run PostgreSQL on a system where the
>> I/O is hitting a 5.25" floppy (not to say 8") the performance may be
>> less than ideal.  I really appreciate IBM donating hydra - it's been
>> invaluable over the years for improving PostgreSQL performance - but I
>> sure wish they had donated a better I/O subsystem.

When I had this problem some years ago, I traced it down to the fact
you have to sync the WAL before you can evict a dirty page.  If your
vacuum is doing a meaningful amount of cleaning, you encounter a dirty
page with a not-already-synced LSN about once per trip around the ring
buffer.   That really destroys your vacuuming performance with a 256kB
ring if your fsync actually has to reach spinning disk.  What I ended
up doing is hacking it so that it used a BAS_BULKWRITE when the vacuum
was being run with a zero vacuum cost delay.

> It's really not just hydra. I've seen the same problem on 24 disk raid-0
> type installations. The small ringbuffer leads to reads/writes being
> constantly interspersed, apparently defeating readahead.

Was their a BBU on that?  I would think slow fsyncs are more likely
than defeated readahead.  On the other hand, I don't hear about too
many 24-disk RAIDS without a BBU.



Re: Detrimental performance impact of ringbuffers on performance

From
Amit Kapila
Date:
On Thu, Apr 14, 2016 at 10:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Tue, Apr 12, 2016 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote:
> >> And, on the other hand, if we don't do something like that, it will be
> >> quite an exceptional case to find anything on the free list.  Doing it
> >> just to speed up developer benchmarking runs seems like the wrong
> >> idea.
> >
> > I don't think it's just developer benchmarks. I've seen a number of
> > customer systems where significant portions of shared buffers were
> > unused due to this.
> >
> > Unless you have an OLTP system, you can right now easily end up in a
> > situation where, after a restart, you'll never fill shared_buffers.
> > Just because sequential scans for OLAP and COPY use ringbuffers. It sure
> > isn't perfect to address the problem while there's free space in s_b,
> > but it sure is better than to just continue to have significant portions
> > of s_b unused.
>
> I agree that the ringbuffer heuristics are rather unhelpful in many
> real-world scenarios. This is definitely a real problem that we should
> try to solve soon.
>
> An adaptive strategy based on actual cache pressure in the recent past
> would be better. Maybe that would be as simple as not using a
> ringbuffer based on simply not having used up all of shared_buffers
> yet. That might not be good enough, but it would probably still be
> better than what we have.
>

I think that such a strategy could be helpful in certain cases, but not sure every time using it can be beneficial.  There could be cases where we extend ring buffers to use unused buffers in shared buffer pool for bulk processing workloads and immediately after that there is a demand for buffers from other statements.   Not sure, but I think an idea of different kind of buffer pools can be helpful for some such cases.   Different kind of buffer pools could be ring buffers, extended ring buffers (relations associated with such buffer pools can bypass ring buffers and use unused shared buffers), retain or keep buffers (relations that are frequently accessed can be associated with this kind of buffer pool where buffers can stay for longer time) and a default buffer pool (all relations by default will be associated with default buffer pool where the behaviour will be same as current).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Detrimental performance impact of ringbuffers on performance

From
Bruce Momjian
Date:
On Wed, Apr  6, 2016 at 12:57:16PM +0200, Andres Freund wrote:
> Hi,
> 
> While benchmarking on hydra
> (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> which has quite slow IO, I was once more annoyed by how incredibly long
> the vacuum at the the end of a pgbench -i takes.
> 
> The issue is that, even for an entirely shared_buffers resident scale,
> essentially no data is cached in shared buffers. The COPY to load data
> uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> that copy immediately writes and evicts all data. Then vacuum reads &
> writes the data in small chunks; again evicting nearly all buffers. Then
> the creation of the ringbuffer has to read that data *again*.
> 
> That's fairly idiotic.
> 
> While it's not easy to fix this in the general case, we introduced those
> ringbuffers for a reason after all, I think we at least should add a
> special case for loads where shared_buffers isn't fully used yet.  Why
> not skip using buffers from the ringbuffer if there's buffers on the
> freelist? If we add buffers gathered from there to the ringlist, we
> should have few cases that regress.
> 
> Additionally, maybe we ought to increase the ringbuffer sizes again one
> of these days? 256kb for VACUUM is pretty damn low.

Is this a TODO?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Detrimental performance impact of ringbuffers on performance

From
Robert Haas
Date:
On Fri, Apr 29, 2016 at 7:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Apr  6, 2016 at 12:57:16PM +0200, Andres Freund wrote:
>> While benchmarking on hydra
>> (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
>> which has quite slow IO, I was once more annoyed by how incredibly long
>> the vacuum at the the end of a pgbench -i takes.
>>
>> The issue is that, even for an entirely shared_buffers resident scale,
>> essentially no data is cached in shared buffers. The COPY to load data
>> uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
>> that copy immediately writes and evicts all data. Then vacuum reads &
>> writes the data in small chunks; again evicting nearly all buffers. Then
>> the creation of the ringbuffer has to read that data *again*.
>>
>> That's fairly idiotic.
>>
>> While it's not easy to fix this in the general case, we introduced those
>> ringbuffers for a reason after all, I think we at least should add a
>> special case for loads where shared_buffers isn't fully used yet.  Why
>> not skip using buffers from the ringbuffer if there's buffers on the
>> freelist? If we add buffers gathered from there to the ringlist, we
>> should have few cases that regress.
>>
>> Additionally, maybe we ought to increase the ringbuffer sizes again one
>> of these days? 256kb for VACUUM is pretty damn low.
>
> Is this a TODO?

I think we are in agreement that some changes may be needed, but I
don't think we necessarily know what the changes are.  So you could
say something like "improve VACUUM ring buffer logic", for example,
but I think something specific like "increase size of the VACUUM ring
buffer" will just encourage someone to do it as a beginner project,
which it really isn't.  Maybe others disagree, but I don't think this
is a slam-dunk where we can just change the behavior in 10 minutes and
expect to have winners but no losers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Detrimental performance impact of ringbuffers onperformance

From
Andres Freund
Date:
Hi,

On 2016-04-06 12:57:16 +0200, Andres Freund wrote:
> While benchmarking on hydra
> (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> which has quite slow IO, I was once more annoyed by how incredibly long
> the vacuum at the the end of a pgbench -i takes.
> 
> The issue is that, even for an entirely shared_buffers resident scale,
> essentially no data is cached in shared buffers. The COPY to load data
> uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> that copy immediately writes and evicts all data. Then vacuum reads &
> writes the data in small chunks; again evicting nearly all buffers. Then
> the creation of the ringbuffer has to read that data *again*.
> 
> That's fairly idiotic.
> 
> While it's not easy to fix this in the general case, we introduced those
> ringbuffers for a reason after all, I think we at least should add a
> special case for loads where shared_buffers isn't fully used yet.  Why
> not skip using buffers from the ringbuffer if there's buffers on the
> freelist? If we add buffers gathered from there to the ringlist, we
> should have few cases that regress.
> 
> Additionally, maybe we ought to increase the ringbuffer sizes again one
> of these days? 256kb for VACUUM is pretty damn low.

Just to attach some numbers for this. On my laptop, with a pretty fast
disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
these results.

I initialized a cluster with pgbench -q -i -s 1000, and VACUUM FREEZEd
pgbenc_accounts. I ensured that there's enough WAL files pre-allocated
that neither of the tests run into having to allocate WAL files.

I first benchmarked master, and then in a second run neutered
GetAccessStrategy(), by returning NULL in the BAS_BULKWRITE, BAS_VACUUM
cases.

master:

postgres[949][1]=# CREATE TABLE pgbench_accounts_copy AS SELECT * FROM pgbench_accounts ;
SELECT 100000000
Time: 199803.198 ms (03:19.803)
postgres[949][1]=# VACUUM VERBOSE pgbench_accounts_copy;
INFO:  00000: vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:535
INFO:  00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345
pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888968
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 13.31 s, system: 12.82 s, elapsed: 57.86 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
VACUUM
Time: 57890.969 ms (00:57.891)
postgres[949][1]=# VACUUM FREEZE VERBOSE pgbench_accounts_copy;
INFO:  00000: aggressively vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:530
INFO:  00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345
pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888968
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 25.21 s, system: 33.45 s, elapsed: 185.76 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
Time: 185786.829 ms (03:05.787)

So 199803.198 + 57890.969 + 185786.829 ms


no-copy/vacuum-ringbuffers:

postgres[5372][1]=# CREATE TABLE pgbench_accounts_copy AS SELECT * FROM pgbench_accounts ;
SELECT 100000000
Time: 143109.959 ms (02:23.110)
postgres[5372][1]=# VACUUM VERBOSE pgbench_accounts_copy;
INFO:  00000: vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:535
INFO:  00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345
pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888971
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 8.43 s, system: 0.01 s, elapsed: 8.49 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
VACUUM
Time: 8504.410 ms (00:08.504)
postgres[5372][1]=# VACUUM FREEZE VERBOSE pgbench_accounts_copy;
INFO:  00000: aggressively vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:530
INFO:  00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345
pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888971
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 9.07 s, system: 0.78 s, elapsed: 14.22 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
VACUUM
Time: 14235.619 ms (00:14.236)

So 143109.959 + 8504.410 + 14235.619 ms.


The relative improvements are:
CREATE TABLE AS: 199803.198 -> 143109.959: 39% improvement
VACUUM: 57890.969 -> 8504.410: 580% improvement
VACUUM FREEZE: 1205% improvement

And even if you were to argue - which I don't find entirely convincing -
that the checkpoint's time should be added afterwards, that's *still*
*much* faster:

postgres[5372][1]=# CHECKPOINT ;
Time: 33592.877 ms (00:33.593)


We probably can't remove the ringbuffer concept from these places, but I
think we should allow users to disable them. Forcing bulk-loads, vacuum,
analytics queries to go to the OS/disk, just because of a heuristic that
can't be disabled, yielding massive slowdowns, really sucks.


Small aside: It really sucks that we right now force each relation to
essentially be written twice, even leaving hint bits and freezing
aside. Once we fill it with zeroes (smgrextend call in
ReadBuffer_common()), and then later with the actual contents.

Greetings,

Andres Freund



On Tue, May 7, 2019 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
> Just to attach some numbers for this. On my laptop, with a pretty fast
> disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
> these results.
>
>  [ results showing ring buffers massively hurting performance ]

Links to some previous discussions:

http://postgr.es/m/8737e9bddb82501da1134f021bf4929a@postgrespro.ru
http://postgr.es/m/CAMkU=1yV=Zq8sHviv5Nwajv5woWOvZb7bx45rgDvtxs4P6W1Pw@mail.gmail.com

> We probably can't remove the ringbuffer concept from these places, but I
> think we should allow users to disable them. Forcing bulk-loads, vacuum,
> analytics queries to go to the OS/disk, just because of a heuristic that
> can't be disabled, yielding massive slowdowns, really sucks.

The discussions to which I linked above seem to suggest that one of
the big issues is that the ring buffer must be large enough that WAL
flush for a buffer can complete before we go all the way around the
ring and get back to the same buffer.  It doesn't seem unlikely that
the size necessary for that to be true has changed over the years, or
even that it's different on different hardware.  When I did some
benchmarking in this area many years ago, I found that there as you
increase the ring buffer size, performance improves for a while and
then more or less levels off at a certain point.  And at that point
performance is not much worse than it would be with no ring buffer,
but you maintain some protection against cache-trashing.  Your
scenario assumes that the system has no concurrent activity which will
suffer as a result of blowing out the cache, but in general that's
probably not true.

It seems to me that it might be time to bite the bullet and add GUCs
for the ring buffer sizes.  Then, we could make the default sizes big
enough that on normal-ish hardware the performance penalty is not too
severe (like, it's measured as a percentage rather than a multiple),
and we could make a 0 value disable the ring buffer altogether.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Detrimental performance impact of ringbuffers onperformance

From
Tomas Vondra
Date:
On Wed, May 08, 2019 at 10:08:03AM -0400, Robert Haas wrote:
>On Tue, May 7, 2019 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
>> Just to attach some numbers for this. On my laptop, with a pretty fast
>> disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
>> these results.
>>
>>  [ results showing ring buffers massively hurting performance ]
>
>Links to some previous discussions:
>
>http://postgr.es/m/8737e9bddb82501da1134f021bf4929a@postgrespro.ru
>http://postgr.es/m/CAMkU=1yV=Zq8sHviv5Nwajv5woWOvZb7bx45rgDvtxs4P6W1Pw@mail.gmail.com
>
>> We probably can't remove the ringbuffer concept from these places, but I
>> think we should allow users to disable them. Forcing bulk-loads, vacuum,
>> analytics queries to go to the OS/disk, just because of a heuristic that
>> can't be disabled, yielding massive slowdowns, really sucks.
>
>The discussions to which I linked above seem to suggest that one of
>the big issues is that the ring buffer must be large enough that WAL
>flush for a buffer can complete before we go all the way around the
>ring and get back to the same buffer.  It doesn't seem unlikely that
>the size necessary for that to be true has changed over the years, or
>even that it's different on different hardware.  When I did some
>benchmarking in this area many years ago, I found that there as you
>increase the ring buffer size, performance improves for a while and
>then more or less levels off at a certain point.  And at that point
>performance is not much worse than it would be with no ring buffer,
>but you maintain some protection against cache-trashing.  Your
>scenario assumes that the system has no concurrent activity which will
>suffer as a result of blowing out the cache, but in general that's
>probably not true.
>
>It seems to me that it might be time to bite the bullet and add GUCs
>for the ring buffer sizes.  Then, we could make the default sizes big
>enough that on normal-ish hardware the performance penalty is not too
>severe (like, it's measured as a percentage rather than a multiple),
>and we could make a 0 value disable the ring buffer altogether.
>

IMO adding such GUC would be useful for testing, which is something we
should probably do anyway, and then based on the results we could either
keep the GUC, modify the default somehow, or do nothing.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [HACKERS] Detrimental performance impact of ringbuffers onperformance

From
Andrey Borodin
Date:

> 8 мая 2019 г., в 1:16, Andres Freund <andres@anarazel.de> написал(а):
>
> We probably can't remove the ringbuffer concept from these places, but I
> think we should allow users to disable them. Forcing bulk-loads, vacuum,
> analytics queries to go to the OS/disk, just because of a heuristic that
> can't be disabled, yielding massive slowdowns, really sucks.

If we will have scan-resistant shared buffers eviction strategy [0] - we will not need ring buffers unconditionally.
Are there any other reasons to have these rings?

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/flat/89A121E3-B593-4D65-98D9-BBC210B87268%40yandex-team.ru


Re: [HACKERS] Detrimental performance impact of ringbuffers onperformance

From
Andres Freund
Date:
Hi,

On 2019-05-08 10:08:03 -0400, Robert Haas wrote:
> On Tue, May 7, 2019 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
> > Just to attach some numbers for this. On my laptop, with a pretty fast
> > disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
> > these results.
> >
> >  [ results showing ring buffers massively hurting performance ]
>
> Links to some previous discussions:
>
> http://postgr.es/m/8737e9bddb82501da1134f021bf4929a@postgrespro.ru
> http://postgr.es/m/CAMkU=1yV=Zq8sHviv5Nwajv5woWOvZb7bx45rgDvtxs4P6W1Pw@mail.gmail.com
>
> > We probably can't remove the ringbuffer concept from these places, but I
> > think we should allow users to disable them. Forcing bulk-loads, vacuum,
> > analytics queries to go to the OS/disk, just because of a heuristic that
> > can't be disabled, yielding massive slowdowns, really sucks.
>
> The discussions to which I linked above seem to suggest that one of
> the big issues is that the ring buffer must be large enough that WAL
> flush for a buffer can complete before we go all the way around the
> ring and get back to the same buffer.

That is some of the problem, true. But even on unlogged tables the
ringbuffers cause quite the massive performance deterioration. Without
the ringbuffers we write twice the size of the releation (once with
zeroes for the file extension, once with with the actual data). With the
ringbuffer we do so two or three additional times (hint bits + normal
vacuum, then freezing).

On a test-cluster that replaced the smgrextend() for heap with
posix_fallocate() (to avoid the unnecesary write), I measured the
performance of CTAS UNLOGGED SELECT * FROM pgbench_accounts_scale_1000
with and without ringbuffers:

With ringbuffers:
CREATE UNLOGGED TABLE AS: Time: 67808.643 ms (01:07.809)
VACUUM: Time: 53020.848 ms (00:53.021)
VACUUM FREEZE: Time: 55809.247 ms (00:55.809)

Without ringbuffers:
CREATE UNLOGGED TABLE AS: Time: 45981.237 ms (00:45.981)
VACUUM: Time: 23386.818 ms (00:23.387)
VACUUM FREEZE: Time: 5892.204 ms (00:05.892)



> It doesn't seem unlikely that
> the size necessary for that to be true has changed over the years, or
> even that it's different on different hardware.  When I did some
> benchmarking in this area many years ago, I found that there as you
> increase the ring buffer size, performance improves for a while and
> then more or less levels off at a certain point.  And at that point
> performance is not much worse than it would be with no ring buffer,
> but you maintain some protection against cache-trashing.  Your
> scenario assumes that the system has no concurrent activity which will
> suffer as a result of blowing out the cache, but in general that's
> probably not true.

Well, I noted that I'm not proposing to actually just rip out the
ringbuffers.

But I also don't think it's just a question of concurrent activity. It's
a question of having concurrent activity *and* workloads that are
smaller than shared buffers.

Given current memory sizes a *lot* of workloads fit entirely in shared
buffers - but for vacuum, seqscans (including copy), it's basically
impossible to ever take advantage of that memory, unless your workload
otherwise forces it into s_b entirely (or you manually load the data
into s_b).


> It seems to me that it might be time to bite the bullet and add GUCs
> for the ring buffer sizes.  Then, we could make the default sizes big
> enough that on normal-ish hardware the performance penalty is not too
> severe (like, it's measured as a percentage rather than a multiple),
> and we could make a 0 value disable the ring buffer altogether.

Yea, it'd be considerably better than today. It'd importantly allow us
to more easily benchmark a lot of this.

Think it might make sense to have a VACUUM option for disabling the
ringbuffer too, especially for cases where VACUUMING is urgent.


I think what we ought to do to fix this issue in a bit more principled
manner (afterwards) is:

1) For ringbuffer'ed scans, if there are unused buffers, use them,
   instead of recycling a buffer from the ring. If so, replace the
   previous member of the ring with the previously unused one.  When
   doing so, just reduce the usagecount by one (unless already zero), so
   it readily can be replaced.

   I think we should do so even when the to-be-replaced ringbuffer entry
   is currently dirty. But even if we couldn't agree on that, it'd
   already be a significant improvement if we only did this for clean buffers.

   That'd fix a good chunk of the "my shared buffers is never actually
   used" type issues. I personally think it's indefensible that
   we don't do that today.

2) When a valid buffer in the ringbuffer is dirty when about to be
   replaced, instead of doing the FlushBuffer ourselves (and thus
   waiting for an XLogFlush in many cases), put into a separate
   ringbuffer/qeueue that's processed by bgwriter. And have that then
   invalidate the buffer and put it on the freelist (unless usagecount
   was bumped since, of course).

   That'd fix the issue that we're slowed down by constantly doing
   XLogFlush() for fairly small chunks of WAL.

3) When, for a ringbuffer scan, there are no unused buffers, but buffers
   with a zero-usagecount, use them too without evicting the previous
   ringbuffer entry. But do so without advancing the normal clock sweep
   (i.e. decrementing usagecounts). That allows to slowly replace buffer
   contents with data accessed during ringbuffer scans.


Regards,

Andres



Re: [HACKERS] Detrimental performance impact of ringbuffers onperformance

From
Andres Freund
Date:
Hi,

On 2019-05-08 21:35:06 +0500, Andrey Borodin wrote:
> > 8 мая 2019 г., в 1:16, Andres Freund <andres@anarazel.de> написал(а):
> > 
> > We probably can't remove the ringbuffer concept from these places, but I
> > think we should allow users to disable them. Forcing bulk-loads, vacuum,
> > analytics queries to go to the OS/disk, just because of a heuristic that
> > can't be disabled, yielding massive slowdowns, really sucks.
> 
> If we will have scan-resistant shared buffers eviction strategy [0] -
> we will not need ring buffers unconditionally.

For me that's a fairly big if, fwiw. But it'd be cool.


> Are there any other reasons to have these rings?

Currently they also limit the amount of dirty data added to the
system. I don't think that's a generally good property (e.g. because
it'll cause a lot of writes that'll again happen later), but e.g. for
initial data loads with COPY FREEZE it's helpful. It slows down the
backend(s) causing the work (i.e. doing COPY), rather than other
backends (e.g. because they need to evict the buffers, therefore first
needing to clean them).

Greetings,

Andres Freund