Thread: incremental-checkopints
Hi Hackes: I found this page : https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL no incremental checkpoints have been implemented so far. When a checkpoint is triggered, the performance jitter of PostgreSQL is very noticeable. I think incremental checkpoints should be implemented as soon as possible
Best whish
Thomas wen
Best whish
Thomas wen
On 7/26/23 09:21, Thomas wen wrote: > Hi Hackes: I found this page : > https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL <https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL> noincremental checkpoints havebeen implemented so far. When a checkpoint is triggered, the performance jitter of PostgreSQL is very noticeable. I thinkincremental checkpoints should be implemented as soon as possible > Well, that thread is 12 years old, and no one followed on that proposal. So it seems people have different priorities, working on other stuff that they consider is more valuable ... You can either work on this yourself and write a patch, or try to convince others it's worth working on. But you didn't provide any information that'd demonstrate the jitter and that incremental checkpoints would improve that. For the record, the thread in our archives is: https://www.postgresql.org/message-id/8a867f1ffea72091bf3cd6a49ba68a97.squirrel%40mail.go-link.net regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello On 2023-Jul-26, Thomas wen wrote: > Hi Hackes: I found this page : > https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL > no incremental checkpoints have been implemented so far. When a > checkpoint is triggered, the performance jitter of PostgreSQL is very > noticeable. I think incremental checkpoints should be implemented as > soon as possible I think my first question is why do you think that is necessary; there are probably other tools to achieve better performance. For example, you may want to try making checkpoint_completion_target closer to 1, and the checkpoint interval longer (both checkpoint_timeout and max_wal_size). Also, changing shared_buffers may improve things. You can try adding more RAM to the machine. Tuning the overall performance of a Postgres server is still black magic to some extent, but there are a few well-known things to play with, without having to write any patches. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hello > > On 2023-Jul-26, Thomas wen wrote: > > > Hi Hackes: I found this page : > > https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL > > no incremental checkpoints have been implemented so far. When a > > checkpoint is triggered, the performance jitter of PostgreSQL is very > > noticeable. I think incremental checkpoints should be implemented as > > soon as possible > > I think my first question is why do you think that is necessary; there > are probably other tools to achieve better performance. For example, > you may want to try making checkpoint_completion_target closer to 1, and > the checkpoint interval longer (both checkpoint_timeout and > max_wal_size). Also, changing shared_buffers may improve things. You > can try adding more RAM to the machine. Even with all those tuning options, a significant portion of a checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in general) will most often appear at the start of each checkpoint due to each first update to a page after a checkpoint needing an FPI. If instead we WAL-logged only the pages we are about to write to disk (like MySQL's double-write buffer, but in WAL instead of a separate cyclical buffer file), then a checkpoint_completion_target close to 1 would probably solve the issue, but with "WAL-logged torn page protection at first update after checkpoint" we'll probably always have higher-than-average FPI load just after a new checkpoint. Kind regards, Matthias van de Meent Neon (https://neon.tech/)
On 7/26/23 15:16, Matthias van de Meent wrote: > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> >> Hello >> >> On 2023-Jul-26, Thomas wen wrote: >> >>> Hi Hackes: I found this page : >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL >>> no incremental checkpoints have been implemented so far. When a >>> checkpoint is triggered, the performance jitter of PostgreSQL is very >>> noticeable. I think incremental checkpoints should be implemented as >>> soon as possible >> >> I think my first question is why do you think that is necessary; there >> are probably other tools to achieve better performance. For example, >> you may want to try making checkpoint_completion_target closer to 1, and >> the checkpoint interval longer (both checkpoint_timeout and >> max_wal_size). Also, changing shared_buffers may improve things. You >> can try adding more RAM to the machine. > > Even with all those tuning options, a significant portion of a > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in > general) will most often appear at the start of each checkpoint due to > each first update to a page after a checkpoint needing an FPI. Yeah, FPIs are certainly expensive and can represent huge part of the WAL produced. But how would incremental checkpoints make that step unnecessary? > If instead we WAL-logged only the pages we are about to write to disk > (like MySQL's double-write buffer, but in WAL instead of a separate > cyclical buffer file), then a checkpoint_completion_target close to 1 > would probably solve the issue, but with "WAL-logged torn page > protection at first update after checkpoint" we'll probably always > have higher-than-average FPI load just after a new checkpoint. > So essentially instead of WAL-logging the FPI on the first change, we'd only do that later when actually writing-out the page (either during a checkpoint or because of memory pressure)? How would you make sure there's enough WAL space until the next checkpoint? I mean, FPIs are a huge write amplification source ... Imagine the system has max_wal_size set to 1GB, and does 1M updates before writing 512MB of WAL and thus triggering a checkpoint. Now it needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with indexes. What then? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Starting from increments checkpoint is approaching the problem from the wrong end. What you actually want is Atomic Disk Writes which will allow turning off full_page_writes . Without this you really can not do incremental checkpoints efficiently as checkpoints are currently what is used to determine when is "the first write to a page after checkpoint" and thereby when the full page write is needed. On Wed, Jul 26, 2023 at 8:58 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 7/26/23 15:16, Matthias van de Meent wrote: > > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> > >> Hello > >> > >> On 2023-Jul-26, Thomas wen wrote: > >> > >>> Hi Hackes: I found this page : > >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL > >>> no incremental checkpoints have been implemented so far. When a > >>> checkpoint is triggered, the performance jitter of PostgreSQL is very > >>> noticeable. I think incremental checkpoints should be implemented as > >>> soon as possible > >> > >> I think my first question is why do you think that is necessary; there > >> are probably other tools to achieve better performance. For example, > >> you may want to try making checkpoint_completion_target closer to 1, and > >> the checkpoint interval longer (both checkpoint_timeout and > >> max_wal_size). Also, changing shared_buffers may improve things. You > >> can try adding more RAM to the machine. > > > > Even with all those tuning options, a significant portion of a > > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in > > general) will most often appear at the start of each checkpoint due to > > each first update to a page after a checkpoint needing an FPI. > > Yeah, FPIs are certainly expensive and can represent huge part of the > WAL produced. But how would incremental checkpoints make that step > unnecessary? > > > If instead we WAL-logged only the pages we are about to write to disk > > (like MySQL's double-write buffer, but in WAL instead of a separate > > cyclical buffer file), then a checkpoint_completion_target close to 1 > > would probably solve the issue, but with "WAL-logged torn page > > protection at first update after checkpoint" we'll probably always > > have higher-than-average FPI load just after a new checkpoint. > > > > So essentially instead of WAL-logging the FPI on the first change, we'd > only do that later when actually writing-out the page (either during a > checkpoint or because of memory pressure)? How would you make sure > there's enough WAL space until the next checkpoint? I mean, FPIs are a > huge write amplification source ... > > Imagine the system has max_wal_size set to 1GB, and does 1M updates > before writing 512MB of WAL and thus triggering a checkpoint. Now it > needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with > indexes. What then? > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > >
On Wed, 26 Jul 2023 at 20:58, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 7/26/23 15:16, Matthias van de Meent wrote: > > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> > >> Hello > >> > >> On 2023-Jul-26, Thomas wen wrote: > >> > >>> Hi Hackes: I found this page : > >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL > >>> no incremental checkpoints have been implemented so far. When a > >>> checkpoint is triggered, the performance jitter of PostgreSQL is very > >>> noticeable. I think incremental checkpoints should be implemented as > >>> soon as possible > >> > >> I think my first question is why do you think that is necessary; there > >> are probably other tools to achieve better performance. For example, > >> you may want to try making checkpoint_completion_target closer to 1, and > >> the checkpoint interval longer (both checkpoint_timeout and > >> max_wal_size). Also, changing shared_buffers may improve things. You > >> can try adding more RAM to the machine. > > > > Even with all those tuning options, a significant portion of a > > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in > > general) will most often appear at the start of each checkpoint due to > > each first update to a page after a checkpoint needing an FPI. > > Yeah, FPIs are certainly expensive and can represent huge part of the > WAL produced. But how would incremental checkpoints make that step > unnecessary? > > > If instead we WAL-logged only the pages we are about to write to disk > > (like MySQL's double-write buffer, but in WAL instead of a separate > > cyclical buffer file), then a checkpoint_completion_target close to 1 > > would probably solve the issue, but with "WAL-logged torn page > > protection at first update after checkpoint" we'll probably always > > have higher-than-average FPI load just after a new checkpoint. > > > > So essentially instead of WAL-logging the FPI on the first change, we'd > only do that later when actually writing-out the page (either during a > checkpoint or because of memory pressure)? How would you make sure > there's enough WAL space until the next checkpoint? I mean, FPIs are a > huge write amplification source ... You don't make sure that there's enough space for the modifications, but does it matter from a durability point of view? As long as the page isn't written to disk before the FPI, we can replay non-FPI (but fsynced) WAL on top of the old version of the page that you read from disk, instead of only trusting FPIs from WAL. > Imagine the system has max_wal_size set to 1GB, and does 1M updates > before writing 512MB of WAL and thus triggering a checkpoint. Now it > needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with > indexes. What then? Then you ignore the max_wal_size GUC as PostgreSQL so often already does. At least, it doesn't do what I expect it to do at face value - limit the size of the WAL directory to the given size. But more reasonably, you'd keep track of the count of modified pages that are yet to be fully WAL-logged, and keep that into account as a debt that you have to the current WAL insert pointer when considering checkpoint distances and max_wal_size. --- The main issue that I see with "WAL-logging the FPI only when you write the dirty page to disk" is that dirty page flushing also happens with buffer eviction in ReadBuffer(). This change in behaviour would add a WAL insertion penalty to this write, and make it a very common occurrance that we'd have to write WAL + fsync the WAL when we have to write the dirty page. It would thus add significant latency to the dirty write mechanism, which is probably a unpopular change. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Wed, Jul 26, 2023 at 9:54 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > Then you ignore the max_wal_size GUC as PostgreSQL so often already > does. At least, it doesn't do what I expect it to do at face value - > limit the size of the WAL directory to the given size. That would require stopping any new writes at wal size == max_wal_size until the checkpoint is completed. I don't think anybody would want that. > But more reasonably, you'd keep track of the count of modified pages > that are yet to be fully WAL-logged, and keep that into account as a > debt that you have to the current WAL insert pointer when considering > checkpoint distances and max_wal_size. I think Peter Geoghegan has worked on somewhat similar approach to account for "accumulated work needed until some desired outcome" though I think it was on the VACUUM side of things.
On 7/26/23 21:53, Matthias van de Meent wrote: > On Wed, 26 Jul 2023 at 20:58, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> >> >> On 7/26/23 15:16, Matthias van de Meent wrote: >>> On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >>>> >>>> Hello >>>> >>>> On 2023-Jul-26, Thomas wen wrote: >>>> >>>>> Hi Hackes: I found this page : >>>>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL >>>>> no incremental checkpoints have been implemented so far. When a >>>>> checkpoint is triggered, the performance jitter of PostgreSQL is very >>>>> noticeable. I think incremental checkpoints should be implemented as >>>>> soon as possible >>>> >>>> I think my first question is why do you think that is necessary; there >>>> are probably other tools to achieve better performance. For example, >>>> you may want to try making checkpoint_completion_target closer to 1, and >>>> the checkpoint interval longer (both checkpoint_timeout and >>>> max_wal_size). Also, changing shared_buffers may improve things. You >>>> can try adding more RAM to the machine. >>> >>> Even with all those tuning options, a significant portion of a >>> checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in >>> general) will most often appear at the start of each checkpoint due to >>> each first update to a page after a checkpoint needing an FPI. >> >> Yeah, FPIs are certainly expensive and can represent huge part of the >> WAL produced. But how would incremental checkpoints make that step >> unnecessary? >> >>> If instead we WAL-logged only the pages we are about to write to disk >>> (like MySQL's double-write buffer, but in WAL instead of a separate >>> cyclical buffer file), then a checkpoint_completion_target close to 1 >>> would probably solve the issue, but with "WAL-logged torn page >>> protection at first update after checkpoint" we'll probably always >>> have higher-than-average FPI load just after a new checkpoint. >>> >> >> So essentially instead of WAL-logging the FPI on the first change, we'd >> only do that later when actually writing-out the page (either during a >> checkpoint or because of memory pressure)? How would you make sure >> there's enough WAL space until the next checkpoint? I mean, FPIs are a >> huge write amplification source ... > > You don't make sure that there's enough space for the modifications, > but does it matter from a durability point of view? As long as the > page isn't written to disk before the FPI, we can replay non-FPI (but > fsynced) WAL on top of the old version of the page that you read from > disk, instead of only trusting FPIs from WAL. > It does not matter from durability point of view, I think. But I was thinking more about how this affects scheduling of checkpoints - how would you know when the next checkpoint is likely to happen, when you don't know how many FPIs you're going to write? >> Imagine the system has max_wal_size set to 1GB, and does 1M updates >> before writing 512MB of WAL and thus triggering a checkpoint. Now it >> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with >> indexes. What then? > > Then you ignore the max_wal_size GUC as PostgreSQL so often already > does. At least, it doesn't do what I expect it to do at face value - > limit the size of the WAL directory to the given size. > I agree the soft-limit nature of max_wal_size (i.e. best effort, not a strict limit) is not great. But just ignoring the limit altogether seems like a step in the wrong direction - we should try not to exceed it. I wonder if we'd actually need / want to write the FPIs into WAL. AFAICS we only need the FPI until the page is written and flushed - since that moment it shouldn't be possible to tear the page. So a small cyclic buffer separate from WAL would be better ... > But more reasonably, you'd keep track of the count of modified pages > that are yet to be fully WAL-logged, and keep that into account as a > debt that you have to the current WAL insert pointer when considering > checkpoint distances and max_wal_size. > Yeah, that might work. It'd likely be just estimates, but probably good enough for pacing the writes. > --- > > The main issue that I see with "WAL-logging the FPI only when you > write the dirty page to disk" is that dirty page flushing also happens > with buffer eviction in ReadBuffer(). This change in behaviour would > add a WAL insertion penalty to this write, and make it a very common > occurrance that we'd have to write WAL + fsync the WAL when we have to > write the dirty page. It would thus add significant latency to the > dirty write mechanism, which is probably a unpopular change. Yeah, it certainly move the latencies from one place to another. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company