Thread: Detrimental performance impact of ringbuffers on performance
Hi, While benchmarking on hydra (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de), which has quite slow IO, I was once more annoyed by how incredibly long the vacuum at the the end of a pgbench -i takes. The issue is that, even for an entirely shared_buffers resident scale, essentially no data is cached in shared buffers. The COPY to load data uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means that copy immediately writes and evicts all data. Then vacuum reads & writes the data in small chunks; again evicting nearly all buffers. Then the creation of the ringbuffer has to read that data *again*. That's fairly idiotic. While it's not easy to fix this in the general case, we introduced those ringbuffers for a reason after all, I think we at least should add a special case for loads where shared_buffers isn't fully used yet. Why not skip using buffers from the ringbuffer if there's buffers on the freelist? If we add buffers gathered from there to the ringlist, we should have few cases that regress. Additionally, maybe we ought to increase the ringbuffer sizes again one of these days? 256kb for VACUUM is pretty damn low. Greetings, Andres Freund
On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote: > While benchmarking on hydra > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de), > which has quite slow IO, I was once more annoyed by how incredibly long > the vacuum at the the end of a pgbench -i takes. > > The issue is that, even for an entirely shared_buffers resident scale, > essentially no data is cached in shared buffers. The COPY to load data > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means > that copy immediately writes and evicts all data. Then vacuum reads & > writes the data in small chunks; again evicting nearly all buffers. Then > the creation of the ringbuffer has to read that data *again*. > > That's fairly idiotic. > > While it's not easy to fix this in the general case, we introduced those > ringbuffers for a reason after all, I think we at least should add a > special case for loads where shared_buffers isn't fully used yet. Why > not skip using buffers from the ringbuffer if there's buffers on the > freelist? If we add buffers gathered from there to the ringlist, we > should have few cases that regress. That does not seem like a good idea from here. One of the ideas I still want to explore at some point is having a background process identify the buffers that are just about to be evicted and stick them on the freelist so that the backends don't have to run the clock sweep themselves on a potentially huge number of buffers, at perhaps substantial CPU cost. Amit's last attempt at this didn't really pan out, but I'm not convinced that the approach is without merit. And, on the other hand, if we don't do something like that, it will be quite an exceptional case to find anything on the free list. Doing it just to speed up developer benchmarking runs seems like the wrong idea. > Additionally, maybe we ought to increase the ringbuffer sizes again one > of these days? 256kb for VACUUM is pretty damn low. But all that does is force the backend to write to the operating system, which is where the real buffering happens. The bottom line here, IMHO, is not that there's anything wrong with our ring buffer implementation, but that if you run PostgreSQL on a system where the I/O is hitting a 5.25" floppy (not to say 8") the performance may be less than ideal. I really appreciate IBM donating hydra - it's been invaluable over the years for improving PostgreSQL performance - but I sure wish they had donated a better I/O subsystem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-04-12 14:29:10 -0400, Robert Haas wrote: > On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote: > > While benchmarking on hydra > > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de), > > which has quite slow IO, I was once more annoyed by how incredibly long > > the vacuum at the the end of a pgbench -i takes. > > > > The issue is that, even for an entirely shared_buffers resident scale, > > essentially no data is cached in shared buffers. The COPY to load data > > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means > > that copy immediately writes and evicts all data. Then vacuum reads & > > writes the data in small chunks; again evicting nearly all buffers. Then > > the creation of the ringbuffer has to read that data *again*. > > > > That's fairly idiotic. > > > > While it's not easy to fix this in the general case, we introduced those > > ringbuffers for a reason after all, I think we at least should add a > > special case for loads where shared_buffers isn't fully used yet. Why > > not skip using buffers from the ringbuffer if there's buffers on the > > freelist? If we add buffers gathered from there to the ringlist, we > > should have few cases that regress. > > That does not seem like a good idea from here. One of the ideas I > still want to explore at some point is having a background process > identify the buffers that are just about to be evicted and stick them > on the freelist so that the backends don't have to run the clock sweep > themselves on a potentially huge number of buffers, at perhaps > substantial CPU cost. Amit's last attempt at this didn't really pan > out, but I'm not convinced that the approach is without merit. FWIW, I've posted an implementation of this in the checkpoint flushing thread; I saw quite substantial gains with it. It was just entirely unrealistic to push that into 9.6. > And, on the other hand, if we don't do something like that, it will be > quite an exceptional case to find anything on the free list. Doing it > just to speed up developer benchmarking runs seems like the wrong > idea. I don't think it's just developer benchmarks. I've seen a number of customer systems where significant portions of shared buffers were unused due to this. Unless you have an OLTP system, you can right now easily end up in a situation where, after a restart, you'll never fill shared_buffers. Just because sequential scans for OLAP and COPY use ringbuffers. It sure isn't perfect to address the problem while there's free space in s_b, but it sure is better than to just continue to have significant portions of s_b unused. > > Additionally, maybe we ought to increase the ringbuffer sizes again one > > of these days? 256kb for VACUUM is pretty damn low. > > But all that does is force the backend to write to the operating > system, which is where the real buffering happens. Relying on that has imo proven to be a pretty horrible idea. > The bottom line > here, IMHO, is not that there's anything wrong with our ring buffer > implementation, but that if you run PostgreSQL on a system where the > I/O is hitting a 5.25" floppy (not to say 8") the performance may be > less than ideal. I really appreciate IBM donating hydra - it's been > invaluable over the years for improving PostgreSQL performance - but I > sure wish they had donated a better I/O subsystem. It's really not just hydra. I've seen the same problem on 24 disk raid-0 type installations. The small ringbuffer leads to reads/writes being constantly interspersed, apparently defeating readahead. Greetings, Andres Freund
Robert, Andres, * Andres Freund (andres@anarazel.de) wrote: > On 2016-04-12 14:29:10 -0400, Robert Haas wrote: > > On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote: > > That does not seem like a good idea from here. One of the ideas I > > still want to explore at some point is having a background process > > identify the buffers that are just about to be evicted and stick them > > on the freelist so that the backends don't have to run the clock sweep > > themselves on a potentially huge number of buffers, at perhaps > > substantial CPU cost. Amit's last attempt at this didn't really pan > > out, but I'm not convinced that the approach is without merit. > > FWIW, I've posted an implementation of this in the checkpoint flushing > thread; I saw quite substantial gains with it. It was just entirely > unrealistic to push that into 9.6. That is fantastic to hear and I certainly agree that we should be working on that approach. > > And, on the other hand, if we don't do something like that, it will be > > quite an exceptional case to find anything on the free list. Doing it > > just to speed up developer benchmarking runs seems like the wrong > > idea. > > I don't think it's just developer benchmarks. I've seen a number of > customer systems where significant portions of shared buffers were > unused due to this. Ditto. I agree that we should be smarter when we have a bunch of free shared_buffers space and we're doing sequential work. I don't think we want to immediately grab all that free space for the sequential work but perhaps there's a reasonable heuristic we could use- such as if the free space available is twice what we expect our sequential read to be, then go ahead and load it into shared buffers? The point here isn't to get rid of the ring buffers but rather to use the shared buffer space when we have plenty of it and there isn't contention for it. Thanks! Stephen
On Tue, Apr 12, 2016 at 2:38 PM, Andres Freund <andres@anarazel.de> wrote: >> And, on the other hand, if we don't do something like that, it will be >> quite an exceptional case to find anything on the free list. Doing it >> just to speed up developer benchmarking runs seems like the wrong >> idea. > > I don't think it's just developer benchmarks. I've seen a number of > customer systems where significant portions of shared buffers were > unused due to this. > > Unless you have an OLTP system, you can right now easily end up in a > situation where, after a restart, you'll never fill shared_buffers. > Just because sequential scans for OLAP and COPY use ringbuffers. It sure > isn't perfect to address the problem while there's free space in s_b, > but it sure is better than to just continue to have significant portions > of s_b unused. You will eventually, because each scan will pick a new ring buffer, and gradually more and more of the relation will get cached. But it can take a while. I'd be more inclined to try to fix this by prewarming the buffers that were in shared_buffers at shutdown. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-04-13 06:57:15 -0400, Robert Haas wrote: > You will eventually, because each scan will pick a new ring buffer, > and gradually more and more of the relation will get cached. But it > can take a while. You really don't need much new data to make that an unobtainable goal ... :/ > I'd be more inclined to try to fix this by prewarming the buffers that > were in shared_buffers at shutdown. That doesn't solve the problem of not reacting to actual new data? It's not that uncommon to regularly load new data with copy and drop old partitions, just to keep the workload memory resident... Andres
On Wed, Apr 13, 2016 at 12:08 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-04-12 14:29:10 -0400, Robert Haas wrote:
> > On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote:
> > > While benchmarking on hydra
> > > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> > > which has quite slow IO, I was once more annoyed by how incredibly long
> > > the vacuum at the the end of a pgbench -i takes.
> > >
> > > The issue is that, even for an entirely shared_buffers resident scale,
> > > essentially no data is cached in shared buffers. The COPY to load data
> > > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> > > that copy immediately writes and evicts all data. Then vacuum reads &
> > > writes the data in small chunks; again evicting nearly all buffers. Then
> > > the creation of the ringbuffer has to read that data *again*.
> > >
> > > That's fairly idiotic.
> > >
> > > While it's not easy to fix this in the general case, we introduced those
> > > ringbuffers for a reason after all, I think we at least should add a
> > > special case for loads where shared_buffers isn't fully used yet. Why
> > > not skip using buffers from the ringbuffer if there's buffers on the
> > > freelist? If we add buffers gathered from there to the ringlist, we
> > > should have few cases that regress.
> >
> > That does not seem like a good idea from here. One of the ideas I
> > still want to explore at some point is having a background process
> > identify the buffers that are just about to be evicted and stick them
> > on the freelist so that the backends don't have to run the clock sweep
> > themselves on a potentially huge number of buffers, at perhaps
> > substantial CPU cost. Amit's last attempt at this didn't really pan
> > out, but I'm not convinced that the approach is without merit.
>
>
> On 2016-04-12 14:29:10 -0400, Robert Haas wrote:
> > On Wed, Apr 6, 2016 at 6:57 AM, Andres Freund <andres@anarazel.de> wrote:
> > > While benchmarking on hydra
> > > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> > > which has quite slow IO, I was once more annoyed by how incredibly long
> > > the vacuum at the the end of a pgbench -i takes.
> > >
> > > The issue is that, even for an entirely shared_buffers resident scale,
> > > essentially no data is cached in shared buffers. The COPY to load data
> > > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> > > that copy immediately writes and evicts all data. Then vacuum reads &
> > > writes the data in small chunks; again evicting nearly all buffers. Then
> > > the creation of the ringbuffer has to read that data *again*.
> > >
> > > That's fairly idiotic.
> > >
> > > While it's not easy to fix this in the general case, we introduced those
> > > ringbuffers for a reason after all, I think we at least should add a
> > > special case for loads where shared_buffers isn't fully used yet. Why
> > > not skip using buffers from the ringbuffer if there's buffers on the
> > > freelist? If we add buffers gathered from there to the ringlist, we
> > > should have few cases that regress.
> >
> > That does not seem like a good idea from here. One of the ideas I
> > still want to explore at some point is having a background process
> > identify the buffers that are just about to be evicted and stick them
> > on the freelist so that the backends don't have to run the clock sweep
> > themselves on a potentially huge number of buffers, at perhaps
> > substantial CPU cost. Amit's last attempt at this didn't really pan
> > out, but I'm not convinced that the approach is without merit.
>
Yeah and IIRC, I observed that there was lot of contention in dynahash table (when data doesn't fit in shared buffers) due to which the improvement hasn't shown measurable gain in terms of TPS. As now in 9.6, we have reduced the contention (spinlocks) in dynahash tables, it might be interesting to run the tests again.
> FWIW, I've posted an implementation of this in the checkpoint flushing
> thread; I saw quite substantial gains with it. It was just entirely
> unrealistic to push that into 9.6.
>
Sounds good. I remember last time you mentioned that such an idea could benefit bulk load case when data doesn't fit in shared buffers, is it the same case where you saw benefit or other cases like read-only and read-write tests as well.
On Tue, Apr 12, 2016 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote: >> And, on the other hand, if we don't do something like that, it will be >> quite an exceptional case to find anything on the free list. Doing it >> just to speed up developer benchmarking runs seems like the wrong >> idea. > > I don't think it's just developer benchmarks. I've seen a number of > customer systems where significant portions of shared buffers were > unused due to this. > > Unless you have an OLTP system, you can right now easily end up in a > situation where, after a restart, you'll never fill shared_buffers. > Just because sequential scans for OLAP and COPY use ringbuffers. It sure > isn't perfect to address the problem while there's free space in s_b, > but it sure is better than to just continue to have significant portions > of s_b unused. I agree that the ringbuffer heuristics are rather unhelpful in many real-world scenarios. This is definitely a real problem that we should try to solve soon. An adaptive strategy based on actual cache pressure in the recent past would be better. Maybe that would be as simple as not using a ringbuffer based on simply not having used up all of shared_buffers yet. That might not be good enough, but it would probably still be better than what we have. Separately, I agree that 256KB is way too low for VACUUM these days. There is a comment in the buffer directory README about that being "small enough to fit in L2 cache". I'm pretty sure that that's still true at least one time over with the latest Raspberry Pi model, so it should be revisited. -- Peter Geoghegan
On Tue, Apr 12, 2016 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote: > >> The bottom line >> here, IMHO, is not that there's anything wrong with our ring buffer >> implementation, but that if you run PostgreSQL on a system where the >> I/O is hitting a 5.25" floppy (not to say 8") the performance may be >> less than ideal. I really appreciate IBM donating hydra - it's been >> invaluable over the years for improving PostgreSQL performance - but I >> sure wish they had donated a better I/O subsystem. When I had this problem some years ago, I traced it down to the fact you have to sync the WAL before you can evict a dirty page. If your vacuum is doing a meaningful amount of cleaning, you encounter a dirty page with a not-already-synced LSN about once per trip around the ring buffer. That really destroys your vacuuming performance with a 256kB ring if your fsync actually has to reach spinning disk. What I ended up doing is hacking it so that it used a BAS_BULKWRITE when the vacuum was being run with a zero vacuum cost delay. > It's really not just hydra. I've seen the same problem on 24 disk raid-0 > type installations. The small ringbuffer leads to reads/writes being > constantly interspersed, apparently defeating readahead. Was their a BBU on that? I would think slow fsyncs are more likely than defeated readahead. On the other hand, I don't hear about too many 24-disk RAIDS without a BBU.
On Thu, Apr 14, 2016 at 10:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Tue, Apr 12, 2016 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote:
> >> And, on the other hand, if we don't do something like that, it will be
> >> quite an exceptional case to find anything on the free list. Doing it
> >> just to speed up developer benchmarking runs seems like the wrong
> >> idea.
> >
> > I don't think it's just developer benchmarks. I've seen a number of
> > customer systems where significant portions of shared buffers were
> > unused due to this.
> >
> > Unless you have an OLTP system, you can right now easily end up in a
> > situation where, after a restart, you'll never fill shared_buffers.
> > Just because sequential scans for OLAP and COPY use ringbuffers. It sure
> > isn't perfect to address the problem while there's free space in s_b,
> > but it sure is better than to just continue to have significant portions
> > of s_b unused.
>
> I agree that the ringbuffer heuristics are rather unhelpful in many
> real-world scenarios. This is definitely a real problem that we should
> try to solve soon.
>
> An adaptive strategy based on actual cache pressure in the recent past
> would be better. Maybe that would be as simple as not using a
> ringbuffer based on simply not having used up all of shared_buffers
> yet. That might not be good enough, but it would probably still be
> better than what we have.
>
I think that such a strategy could be helpful in certain cases, but not sure every time using it can be beneficial. There could be cases where we extend ring buffers to use unused buffers in shared buffer pool for bulk processing workloads and immediately after that there is a demand for buffers from other statements. Not sure, but I think an idea of different kind of buffer pools can be helpful for some such cases. Different kind of buffer pools could be ring buffers, extended ring buffers (relations associated with such buffer pools can bypass ring buffers and use unused shared buffers), retain or keep buffers (relations that are frequently accessed can be associated with this kind of buffer pool where buffers can stay for longer time) and a default buffer pool (all relations by default will be associated with default buffer pool where the behaviour will be same as current).
>
> On Tue, Apr 12, 2016 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote:
> >> And, on the other hand, if we don't do something like that, it will be
> >> quite an exceptional case to find anything on the free list. Doing it
> >> just to speed up developer benchmarking runs seems like the wrong
> >> idea.
> >
> > I don't think it's just developer benchmarks. I've seen a number of
> > customer systems where significant portions of shared buffers were
> > unused due to this.
> >
> > Unless you have an OLTP system, you can right now easily end up in a
> > situation where, after a restart, you'll never fill shared_buffers.
> > Just because sequential scans for OLAP and COPY use ringbuffers. It sure
> > isn't perfect to address the problem while there's free space in s_b,
> > but it sure is better than to just continue to have significant portions
> > of s_b unused.
>
> I agree that the ringbuffer heuristics are rather unhelpful in many
> real-world scenarios. This is definitely a real problem that we should
> try to solve soon.
>
> An adaptive strategy based on actual cache pressure in the recent past
> would be better. Maybe that would be as simple as not using a
> ringbuffer based on simply not having used up all of shared_buffers
> yet. That might not be good enough, but it would probably still be
> better than what we have.
>
I think that such a strategy could be helpful in certain cases, but not sure every time using it can be beneficial. There could be cases where we extend ring buffers to use unused buffers in shared buffer pool for bulk processing workloads and immediately after that there is a demand for buffers from other statements. Not sure, but I think an idea of different kind of buffer pools can be helpful for some such cases. Different kind of buffer pools could be ring buffers, extended ring buffers (relations associated with such buffer pools can bypass ring buffers and use unused shared buffers), retain or keep buffers (relations that are frequently accessed can be associated with this kind of buffer pool where buffers can stay for longer time) and a default buffer pool (all relations by default will be associated with default buffer pool where the behaviour will be same as current).
On Wed, Apr 6, 2016 at 12:57:16PM +0200, Andres Freund wrote: > Hi, > > While benchmarking on hydra > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de), > which has quite slow IO, I was once more annoyed by how incredibly long > the vacuum at the the end of a pgbench -i takes. > > The issue is that, even for an entirely shared_buffers resident scale, > essentially no data is cached in shared buffers. The COPY to load data > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means > that copy immediately writes and evicts all data. Then vacuum reads & > writes the data in small chunks; again evicting nearly all buffers. Then > the creation of the ringbuffer has to read that data *again*. > > That's fairly idiotic. > > While it's not easy to fix this in the general case, we introduced those > ringbuffers for a reason after all, I think we at least should add a > special case for loads where shared_buffers isn't fully used yet. Why > not skip using buffers from the ringbuffer if there's buffers on the > freelist? If we add buffers gathered from there to the ringlist, we > should have few cases that regress. > > Additionally, maybe we ought to increase the ringbuffer sizes again one > of these days? 256kb for VACUUM is pretty damn low. Is this a TODO? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Apr 29, 2016 at 7:08 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Apr 6, 2016 at 12:57:16PM +0200, Andres Freund wrote: >> While benchmarking on hydra >> (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de), >> which has quite slow IO, I was once more annoyed by how incredibly long >> the vacuum at the the end of a pgbench -i takes. >> >> The issue is that, even for an entirely shared_buffers resident scale, >> essentially no data is cached in shared buffers. The COPY to load data >> uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means >> that copy immediately writes and evicts all data. Then vacuum reads & >> writes the data in small chunks; again evicting nearly all buffers. Then >> the creation of the ringbuffer has to read that data *again*. >> >> That's fairly idiotic. >> >> While it's not easy to fix this in the general case, we introduced those >> ringbuffers for a reason after all, I think we at least should add a >> special case for loads where shared_buffers isn't fully used yet. Why >> not skip using buffers from the ringbuffer if there's buffers on the >> freelist? If we add buffers gathered from there to the ringlist, we >> should have few cases that regress. >> >> Additionally, maybe we ought to increase the ringbuffer sizes again one >> of these days? 256kb for VACUUM is pretty damn low. > > Is this a TODO? I think we are in agreement that some changes may be needed, but I don't think we necessarily know what the changes are. So you could say something like "improve VACUUM ring buffer logic", for example, but I think something specific like "increase size of the VACUUM ring buffer" will just encourage someone to do it as a beginner project, which it really isn't. Maybe others disagree, but I don't think this is a slam-dunk where we can just change the behavior in 10 minutes and expect to have winners but no losers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-04-06 12:57:16 +0200, Andres Freund wrote: > While benchmarking on hydra > (c.f. http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de), > which has quite slow IO, I was once more annoyed by how incredibly long > the vacuum at the the end of a pgbench -i takes. > > The issue is that, even for an entirely shared_buffers resident scale, > essentially no data is cached in shared buffers. The COPY to load data > uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means > that copy immediately writes and evicts all data. Then vacuum reads & > writes the data in small chunks; again evicting nearly all buffers. Then > the creation of the ringbuffer has to read that data *again*. > > That's fairly idiotic. > > While it's not easy to fix this in the general case, we introduced those > ringbuffers for a reason after all, I think we at least should add a > special case for loads where shared_buffers isn't fully used yet. Why > not skip using buffers from the ringbuffer if there's buffers on the > freelist? If we add buffers gathered from there to the ringlist, we > should have few cases that regress. > > Additionally, maybe we ought to increase the ringbuffer sizes again one > of these days? 256kb for VACUUM is pretty damn low. Just to attach some numbers for this. On my laptop, with a pretty fast disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get these results. I initialized a cluster with pgbench -q -i -s 1000, and VACUUM FREEZEd pgbenc_accounts. I ensured that there's enough WAL files pre-allocated that neither of the tests run into having to allocate WAL files. I first benchmarked master, and then in a second run neutered GetAccessStrategy(), by returning NULL in the BAS_BULKWRITE, BAS_VACUUM cases. master: postgres[949][1]=# CREATE TABLE pgbench_accounts_copy AS SELECT * FROM pgbench_accounts ; SELECT 100000000 Time: 199803.198 ms (03:19.803) postgres[949][1]=# VACUUM VERBOSE pgbench_accounts_copy; INFO: 00000: vacuuming "public.pgbench_accounts_copy" LOCATION: lazy_scan_heap, vacuumlazy.c:535 INFO: 00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 4888968 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 13.31 s, system: 12.82 s, elapsed: 57.86 s. LOCATION: lazy_scan_heap, vacuumlazy.c:1500 VACUUM Time: 57890.969 ms (00:57.891) postgres[949][1]=# VACUUM FREEZE VERBOSE pgbench_accounts_copy; INFO: 00000: aggressively vacuuming "public.pgbench_accounts_copy" LOCATION: lazy_scan_heap, vacuumlazy.c:530 INFO: 00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 4888968 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 25.21 s, system: 33.45 s, elapsed: 185.76 s. LOCATION: lazy_scan_heap, vacuumlazy.c:1500 Time: 185786.829 ms (03:05.787) So 199803.198 + 57890.969 + 185786.829 ms no-copy/vacuum-ringbuffers: postgres[5372][1]=# CREATE TABLE pgbench_accounts_copy AS SELECT * FROM pgbench_accounts ; SELECT 100000000 Time: 143109.959 ms (02:23.110) postgres[5372][1]=# VACUUM VERBOSE pgbench_accounts_copy; INFO: 00000: vacuuming "public.pgbench_accounts_copy" LOCATION: lazy_scan_heap, vacuumlazy.c:535 INFO: 00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 4888971 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 8.43 s, system: 0.01 s, elapsed: 8.49 s. LOCATION: lazy_scan_heap, vacuumlazy.c:1500 VACUUM Time: 8504.410 ms (00:08.504) postgres[5372][1]=# VACUUM FREEZE VERBOSE pgbench_accounts_copy; INFO: 00000: aggressively vacuuming "public.pgbench_accounts_copy" LOCATION: lazy_scan_heap, vacuumlazy.c:530 INFO: 00000: "pgbench_accounts_copy": found 0 removable, 100000000 nonremovable row versions in 1639345 out of 1639345 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 4888971 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 9.07 s, system: 0.78 s, elapsed: 14.22 s. LOCATION: lazy_scan_heap, vacuumlazy.c:1500 VACUUM Time: 14235.619 ms (00:14.236) So 143109.959 + 8504.410 + 14235.619 ms. The relative improvements are: CREATE TABLE AS: 199803.198 -> 143109.959: 39% improvement VACUUM: 57890.969 -> 8504.410: 580% improvement VACUUM FREEZE: 1205% improvement And even if you were to argue - which I don't find entirely convincing - that the checkpoint's time should be added afterwards, that's *still* *much* faster: postgres[5372][1]=# CHECKPOINT ; Time: 33592.877 ms (00:33.593) We probably can't remove the ringbuffer concept from these places, but I think we should allow users to disable them. Forcing bulk-loads, vacuum, analytics queries to go to the OS/disk, just because of a heuristic that can't be disabled, yielding massive slowdowns, really sucks. Small aside: It really sucks that we right now force each relation to essentially be written twice, even leaving hint bits and freezing aside. Once we fill it with zeroes (smgrextend call in ReadBuffer_common()), and then later with the actual contents. Greetings, Andres Freund
On Tue, May 7, 2019 at 4:16 PM Andres Freund <andres@anarazel.de> wrote: > Just to attach some numbers for this. On my laptop, with a pretty fast > disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get > these results. > > [ results showing ring buffers massively hurting performance ] Links to some previous discussions: http://postgr.es/m/8737e9bddb82501da1134f021bf4929a@postgrespro.ru http://postgr.es/m/CAMkU=1yV=Zq8sHviv5Nwajv5woWOvZb7bx45rgDvtxs4P6W1Pw@mail.gmail.com > We probably can't remove the ringbuffer concept from these places, but I > think we should allow users to disable them. Forcing bulk-loads, vacuum, > analytics queries to go to the OS/disk, just because of a heuristic that > can't be disabled, yielding massive slowdowns, really sucks. The discussions to which I linked above seem to suggest that one of the big issues is that the ring buffer must be large enough that WAL flush for a buffer can complete before we go all the way around the ring and get back to the same buffer. It doesn't seem unlikely that the size necessary for that to be true has changed over the years, or even that it's different on different hardware. When I did some benchmarking in this area many years ago, I found that there as you increase the ring buffer size, performance improves for a while and then more or less levels off at a certain point. And at that point performance is not much worse than it would be with no ring buffer, but you maintain some protection against cache-trashing. Your scenario assumes that the system has no concurrent activity which will suffer as a result of blowing out the cache, but in general that's probably not true. It seems to me that it might be time to bite the bullet and add GUCs for the ring buffer sizes. Then, we could make the default sizes big enough that on normal-ish hardware the performance penalty is not too severe (like, it's measured as a percentage rather than a multiple), and we could make a 0 value disable the ring buffer altogether. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 08, 2019 at 10:08:03AM -0400, Robert Haas wrote: >On Tue, May 7, 2019 at 4:16 PM Andres Freund <andres@anarazel.de> wrote: >> Just to attach some numbers for this. On my laptop, with a pretty fast >> disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get >> these results. >> >> [ results showing ring buffers massively hurting performance ] > >Links to some previous discussions: > >http://postgr.es/m/8737e9bddb82501da1134f021bf4929a@postgrespro.ru >http://postgr.es/m/CAMkU=1yV=Zq8sHviv5Nwajv5woWOvZb7bx45rgDvtxs4P6W1Pw@mail.gmail.com > >> We probably can't remove the ringbuffer concept from these places, but I >> think we should allow users to disable them. Forcing bulk-loads, vacuum, >> analytics queries to go to the OS/disk, just because of a heuristic that >> can't be disabled, yielding massive slowdowns, really sucks. > >The discussions to which I linked above seem to suggest that one of >the big issues is that the ring buffer must be large enough that WAL >flush for a buffer can complete before we go all the way around the >ring and get back to the same buffer. It doesn't seem unlikely that >the size necessary for that to be true has changed over the years, or >even that it's different on different hardware. When I did some >benchmarking in this area many years ago, I found that there as you >increase the ring buffer size, performance improves for a while and >then more or less levels off at a certain point. And at that point >performance is not much worse than it would be with no ring buffer, >but you maintain some protection against cache-trashing. Your >scenario assumes that the system has no concurrent activity which will >suffer as a result of blowing out the cache, but in general that's >probably not true. > >It seems to me that it might be time to bite the bullet and add GUCs >for the ring buffer sizes. Then, we could make the default sizes big >enough that on normal-ish hardware the performance penalty is not too >severe (like, it's measured as a percentage rather than a multiple), >and we could make a 0 value disable the ring buffer altogether. > IMO adding such GUC would be useful for testing, which is something we should probably do anyway, and then based on the results we could either keep the GUC, modify the default somehow, or do nothing. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 8 мая 2019 г., в 1:16, Andres Freund <andres@anarazel.de> написал(а): > > We probably can't remove the ringbuffer concept from these places, but I > think we should allow users to disable them. Forcing bulk-loads, vacuum, > analytics queries to go to the OS/disk, just because of a heuristic that > can't be disabled, yielding massive slowdowns, really sucks. If we will have scan-resistant shared buffers eviction strategy [0] - we will not need ring buffers unconditionally. Are there any other reasons to have these rings? Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/89A121E3-B593-4D65-98D9-BBC210B87268%40yandex-team.ru
Hi, On 2019-05-08 10:08:03 -0400, Robert Haas wrote: > On Tue, May 7, 2019 at 4:16 PM Andres Freund <andres@anarazel.de> wrote: > > Just to attach some numbers for this. On my laptop, with a pretty fast > > disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get > > these results. > > > > [ results showing ring buffers massively hurting performance ] > > Links to some previous discussions: > > http://postgr.es/m/8737e9bddb82501da1134f021bf4929a@postgrespro.ru > http://postgr.es/m/CAMkU=1yV=Zq8sHviv5Nwajv5woWOvZb7bx45rgDvtxs4P6W1Pw@mail.gmail.com > > > We probably can't remove the ringbuffer concept from these places, but I > > think we should allow users to disable them. Forcing bulk-loads, vacuum, > > analytics queries to go to the OS/disk, just because of a heuristic that > > can't be disabled, yielding massive slowdowns, really sucks. > > The discussions to which I linked above seem to suggest that one of > the big issues is that the ring buffer must be large enough that WAL > flush for a buffer can complete before we go all the way around the > ring and get back to the same buffer. That is some of the problem, true. But even on unlogged tables the ringbuffers cause quite the massive performance deterioration. Without the ringbuffers we write twice the size of the releation (once with zeroes for the file extension, once with with the actual data). With the ringbuffer we do so two or three additional times (hint bits + normal vacuum, then freezing). On a test-cluster that replaced the smgrextend() for heap with posix_fallocate() (to avoid the unnecesary write), I measured the performance of CTAS UNLOGGED SELECT * FROM pgbench_accounts_scale_1000 with and without ringbuffers: With ringbuffers: CREATE UNLOGGED TABLE AS: Time: 67808.643 ms (01:07.809) VACUUM: Time: 53020.848 ms (00:53.021) VACUUM FREEZE: Time: 55809.247 ms (00:55.809) Without ringbuffers: CREATE UNLOGGED TABLE AS: Time: 45981.237 ms (00:45.981) VACUUM: Time: 23386.818 ms (00:23.387) VACUUM FREEZE: Time: 5892.204 ms (00:05.892) > It doesn't seem unlikely that > the size necessary for that to be true has changed over the years, or > even that it's different on different hardware. When I did some > benchmarking in this area many years ago, I found that there as you > increase the ring buffer size, performance improves for a while and > then more or less levels off at a certain point. And at that point > performance is not much worse than it would be with no ring buffer, > but you maintain some protection against cache-trashing. Your > scenario assumes that the system has no concurrent activity which will > suffer as a result of blowing out the cache, but in general that's > probably not true. Well, I noted that I'm not proposing to actually just rip out the ringbuffers. But I also don't think it's just a question of concurrent activity. It's a question of having concurrent activity *and* workloads that are smaller than shared buffers. Given current memory sizes a *lot* of workloads fit entirely in shared buffers - but for vacuum, seqscans (including copy), it's basically impossible to ever take advantage of that memory, unless your workload otherwise forces it into s_b entirely (or you manually load the data into s_b). > It seems to me that it might be time to bite the bullet and add GUCs > for the ring buffer sizes. Then, we could make the default sizes big > enough that on normal-ish hardware the performance penalty is not too > severe (like, it's measured as a percentage rather than a multiple), > and we could make a 0 value disable the ring buffer altogether. Yea, it'd be considerably better than today. It'd importantly allow us to more easily benchmark a lot of this. Think it might make sense to have a VACUUM option for disabling the ringbuffer too, especially for cases where VACUUMING is urgent. I think what we ought to do to fix this issue in a bit more principled manner (afterwards) is: 1) For ringbuffer'ed scans, if there are unused buffers, use them, instead of recycling a buffer from the ring. If so, replace the previous member of the ring with the previously unused one. When doing so, just reduce the usagecount by one (unless already zero), so it readily can be replaced. I think we should do so even when the to-be-replaced ringbuffer entry is currently dirty. But even if we couldn't agree on that, it'd already be a significant improvement if we only did this for clean buffers. That'd fix a good chunk of the "my shared buffers is never actually used" type issues. I personally think it's indefensible that we don't do that today. 2) When a valid buffer in the ringbuffer is dirty when about to be replaced, instead of doing the FlushBuffer ourselves (and thus waiting for an XLogFlush in many cases), put into a separate ringbuffer/qeueue that's processed by bgwriter. And have that then invalidate the buffer and put it on the freelist (unless usagecount was bumped since, of course). That'd fix the issue that we're slowed down by constantly doing XLogFlush() for fairly small chunks of WAL. 3) When, for a ringbuffer scan, there are no unused buffers, but buffers with a zero-usagecount, use them too without evicting the previous ringbuffer entry. But do so without advancing the normal clock sweep (i.e. decrementing usagecounts). That allows to slowly replace buffer contents with data accessed during ringbuffer scans. Regards, Andres
Hi, On 2019-05-08 21:35:06 +0500, Andrey Borodin wrote: > > 8 мая 2019 г., в 1:16, Andres Freund <andres@anarazel.de> написал(а): > > > > We probably can't remove the ringbuffer concept from these places, but I > > think we should allow users to disable them. Forcing bulk-loads, vacuum, > > analytics queries to go to the OS/disk, just because of a heuristic that > > can't be disabled, yielding massive slowdowns, really sucks. > > If we will have scan-resistant shared buffers eviction strategy [0] - > we will not need ring buffers unconditionally. For me that's a fairly big if, fwiw. But it'd be cool. > Are there any other reasons to have these rings? Currently they also limit the amount of dirty data added to the system. I don't think that's a generally good property (e.g. because it'll cause a lot of writes that'll again happen later), but e.g. for initial data loads with COPY FREEZE it's helpful. It slows down the backend(s) causing the work (i.e. doing COPY), rather than other backends (e.g. because they need to evict the buffers, therefore first needing to clean them). Greetings, Andres Freund