Thread: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
One of the most useful bits of feedback on how well checkpoint I/O is going is the amount of time taken to sync files to disk. Right now the only way to get that is to parse the logs. The attached patch publishes the most useful three bits of data you could only get from log_checkpoints before out to pg_stat_bgwriter. Easiest to just show an example: $ createdb pgbench $ psql -c "select pg_stat_reset_shared('bgwriter')" $ pgbench -i -s 10 pgbench $ psql -c "checkpoint" $ tail $PGLOG LOG: checkpoint complete: wrote 590 buffers (14.4%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.117 s, total=1.746 s; sync files=31, longest=0.104 s, average=0.003 s $ psql -c "select write_ms,sync_ms,sync_files from pg_stat_bgwriter" write_ms | sync_ms | sync_files ----------+---------+------------ 2 | 117 | 31 Like the other checkpoint statistics, these are updated in one lump when the checkpoint finishes. This includes rough documentation and the code isn't very complicated; a cat version bump is required. Some open concerns I still have about this: -The total amount of time in the sync phase is not the same as the total of all sync wait times. I have published the total elapsed sync time. The total of the times spent waiting on individual sync calls is also available at the point when the above data is collected, and it could be published as well. I don't think that's necessary yet, since they are almost identical right now. But any feature that staggers sync calls around might make that distinction important. -The total checkpoint time is larger than the total of write+sync. You can see that here: 2ms write + 117ms sync = 119ms; total checkpoint time is actually 1746ms though. While I don't normally find the total useful from a tuning perspective, it may be sensible to publish it just to further reduce any perceived need for log scraping here. -None of the debug1 level information about individual sync calls can be published this way usefully. I'd like that to come out somewhere eventually, but really what's needed there is a performance event logging history collector, not pg_stats. I'm content that this patch gets the big picture data out; it's sufficient to let you rule out checkpoint problems if they aren't happening. If you only have to drill into the logs when long times start showing up in this data, that's not so bad to live with. -I can't tell for sure if this is working properly when log_checkpoints is off. This now collects checkpoint end time data in all cases, whereas before it ignored that work if log_checkpoints was off. I think I will have to fake the logging in this area, make it log some things even when log_checkpoints is off, to confirm it gives the right results. I haven't done that yet. I would like to be able to tell people they need never turn on log_checkpoints if they monitor pg_stat_bgwriter instead, because that sets a good precedent for what direction the database is going. It would be nice for pg_stat_bgwriter's format to settle down for a long period too. In addition to my small list of concerns here, there seem to be an increasing number of write path feature patches floating around here lately. That all works against this being stable. The main consumer of this data I use regularly is the pg_stat_bgwriter graphs that Munin produces, and my expectation is that myself and/or Magnus will make sure it's compatible with any changes made here for 9.2. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachment
On 01/16/2012 01:28 AM, Greg Smith wrote: > -I can't tell for sure if this is working properly when > log_checkpoints is off. This now collects checkpoint end time data in > all cases, whereas before it ignored that work if log_checkpoints was off. ...and there's at least one I missed located already: inside of md.c. I'd forgotten how many spots where timing calls are optimized out are floating around this code path. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Hitoshi Harada
Date:
On Sun, Jan 15, 2012 at 10:46 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 01/16/2012 01:28 AM, Greg Smith wrote: >> >> -I can't tell for sure if this is working properly when log_checkpoints is >> off. This now collects checkpoint end time data in all cases, whereas >> before it ignored that work if log_checkpoints was off. > > > ...and there's at least one I missed located already: inside of md.c. I'd > forgotten how many spots where timing calls are optimized out are floating > around this code path. I think CF app page for this patch has missing link with wrong message-id. Thanks, -- Hitoshi Harada
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Alvaro Herrera
Date:
Excerpts from Hitoshi Harada's message of jue ene 19 07:08:52 -0300 2012: > On Sun, Jan 15, 2012 at 10:46 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > On 01/16/2012 01:28 AM, Greg Smith wrote: > >> > >> -I can't tell for sure if this is working properly when log_checkpoints is > >> off. This now collects checkpoint end time data in all cases, whereas > >> before it ignored that work if log_checkpoints was off. > > > > > > ...and there's at least one I missed located already: inside of md.c. I'd > > forgotten how many spots where timing calls are optimized out are floating > > around this code path. > > I think CF app page for this patch has missing link with wrong message-id. Fixed, thanks -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Robert Haas
Date:
On Mon, Jan 16, 2012 at 1:28 AM, Greg Smith <greg@2ndquadrant.com> wrote: > One of the most useful bits of feedback on how well checkpoint I/O is going > is the amount of time taken to sync files to disk. Right now the only way > to get that is to parse the logs. The attached patch publishes the most > useful three bits of data you could only get from log_checkpoints before out > to pg_stat_bgwriter. It's not quite clear from your email, but I gather that the way that this is intended to work is that these values increment every time we checkpoint? Also, forgive for asking this possibly-stupid question, but of what use is this information? I can't imagine why I'd care about a running total of the number of files fsync'd to disk. I also can't really imagine why I'd care about the length of the write phase, which surely will almost always be a function of checkpoint_completion_target and checkpoint_timeout unless I manage to overrun the number of checkpoint_segments I've allocated. The only number that really seems useful to me is the time spent syncing. I have a clear idea what to look for there: smaller numbers are better than bigger ones. For the rest I'm mystified. And, it doesn't seem like it's necessarily going to safe me a whole lot either, because if it turns out that my sync phases are long, the first question out of my mouth is going to be "what percentage of my total sync time is accounted for by the longest sync?". And so right there I'm back to the logs. It's not clear how such information could be usefully exposed in pg_stat_bgwriter either, since you probably want to know only the last few values, not a total over all time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Simon Riggs
Date:
On Thu, Jan 19, 2012 at 3:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > And, it doesn't seem like it's necessarily going to safe me a whole > lot either, because if it turns out that my sync phases are long, the > first question out of my mouth is going to be "what percentage of my > total sync time is accounted for by the longest sync?". And so right > there I'm back to the logs. It seems clear that the purpose of this is to quickly and easily decide whether there is a potential problem. If you decide there is a potential problem, you may wish to look at more detailed information. So there is a huge time saving from avoiding the need to look at the detail when its unnecessary and possibly not even enabled. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 01/19/2012 10:52 AM, Robert Haas wrote: > It's not quite clear from your email, but I gather that the way that > this is intended to work is that these values increment every time we > checkpoint? Right--they get updated in the same atomic bump that moves up things like buffers_checkpoint > Also, forgive for asking this possibly-stupid question, but of what > use is this information? I can't imagine why I'd care about a running > total of the number of files fsync'd to disk. I also can't really > imagine why I'd care about the length of the write phase, which surely > will almost always be a function of checkpoint_completion_target and > checkpoint_timeout unless I manage to overrun the number of > checkpoint_segments I've allocated. The only number that really seems > useful to me is the time spent syncing. I have a clear idea what to > look for there: smaller numbers are better than bigger ones. For the > rest I'm mystified. Priority #1 here is to reduce (but, admittedly, not always eliminate) the need for log file parsing of this particular area, so including all the major bits from the existing log message that can be published this way would include the write phase time. You mentioned one reason why the write phase time might be interesting; there could be others. One of the things expected here is that Munin will expand its graphing of values from pg_stat_bgwriter to include all these fields. Most of the time the graph of time spent in the write phase will be boring and useless. Making it easy for a look at a graph to spot those rare times when it isn't is one motivation for including it. As for why to include the number of files being sync'd, one reason is again simply wanting to include everything that can easily be published. A second is that it helps support ideas like my "Checkpoint sync pause" one; that's untunable in any reasonable way without some easy way of monitoring the number of files typically sync'd. Sometimes when I'm investigating checkpoint spikes during sync, I wonder whether they were because more files than usual were synced, or if it's instead just because of more churn on a smaller number. Making this easy to graph pulls that data out to where I can compare it with disk I/O trends. And there's precedent now proving that an always incrementing number in pg_stat_bgwriter can be turned into such a graph easily by monitoring tools. > And, it doesn't seem like it's necessarily going to safe me a whole > lot either, because if it turns out that my sync phases are long, the > first question out of my mouth is going to be "what percentage of my > total sync time is accounted for by the longest sync?". And so right > there I'm back to the logs. It's not clear how such information could > be usefully exposed in pg_stat_bgwriter either, since you probably > want to know only the last few values, not a total over all time. This isn't ideal yet. I mentioned how some future "performance event logging history collector" was really needed as a place to push longest sync times into, and we don't have it yet. This is the best thing to instrument that I'm sure is useful, and that I can stick onto with the existing infrastructure. The idea is that this change makes it possible to trigger a "sync times are too long" alert out of a tool that's based solely on database queries. When that goes off, yes you're possibly back to the logs again for more details about the longest individual sync time. But the rest of the time, what's hopefully the normal state of things, you can ignore the logs and just track the pg_stat_bgwriter numbers. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Jan 15, 2012 at 10:28 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I would like to be able to tell people they need never turn on > log_checkpoints if they monitor pg_stat_bgwriter instead, because that sets > a good precedent for what direction the database is going. It would be nice > for pg_stat_bgwriter's format to settle down for a long period too. While on the topic of the format of pg_stat_bgwriter and it converging on stability: I'm finding the backend_writes column pretty unfortunate. The only use I know of for it is to determine if the bgwriter is lagging behind. Yet it doesn't serve even this purpose because it lumps together the backend writes due to lagging background writes, and the backend writes "by design" due to the use buffer access strategy during bulk inserts. If I'm running a tightly controlled benchmark on an otherwise unused server and I know that no BAS is being used, then I can meaningfully use backend_writes. That is a pretty limiting limit. I think we should either create a separate column to segregate BAS backend_writes, or just don't report them at all and report only the non-BAS ones into pg_stat_bgwriter. I know you've argued against this before (http://archives.postgresql.org/pgsql-performance/2011-03/msg00340.php), but I think the assumption that all writes are absorbed by the OS without blocking is assuming an awful lot. If the server is not behaving well, then it very well may be blocking on writes. If we are confident that the OS can always efficiently cache writes, why bother with a background writer at all? I think the argument that it is not important to know which strategy motivated a backend write could just as validly be reformulated as an argument that we don't need to know how many backend writes there were in the first place. Knowing only an aggregation of BAS-motivated backend writes and straggling-bgwriter-motivated backend writes just doesn't seem useful to me. Am I missing a use for it? Not only is it probably not useful, it seems to be actively harmful as I've seen several times where it leads people down false paths (including me several times, as about once a year I forget why it is useless and spend time rediscovering it). If we are not willing to separate the two types of backend writes, or just omit the BAS ones altogether, then I would suggest we drop the column entirely. Would a patch to separate the backend writes into two columns be entertained for 9.3? Cheers, Jeff
Jeff Janes wrote: > I'm finding the backend_writes column pretty unfortunate. The only > use I know of for it is to determine if the bgwriter is lagging > behind. Yet it doesn't serve even this purpose because it lumps > together the backend writes due to lagging background writes, and the > backend writes "by design" due to the use buffer access strategy > during bulk inserts. > > If I'm running a tightly controlled benchmark on an otherwise unused > server and I know that no BAS is being used, then I can meaningfully > use backend_writes. That is a pretty limiting limit. > I don't think it's quite that bad in general; this presumes a moderate amount of BAS writes relative to other activity. But I understand your concern better now. I don't think the sorts of workloads you seem to have a lot of were considered very carefully before here. > I think we should either create a separate column to segregate BAS > backend_writes, or just don't report them at all and report only the > non-BAS ones into pg_stat_bgwriter. > We can't not report them. One of the goals of pg_stat_bgwriter is to account for all writes out of the buffer cache. If there enough BAS writes on your system that them being lumped together is a serious problem, having them go missing altogether would be even worse. And any whacking around of pg_stat_bgwriter might as well fix that too. Do you think you could put together a quick test case that's similar to the ones you're seeing unfair accounting for here? This isn't quite buggy behavior, but a solid example I could test against showing it's a sketchy approach would be enough for me to incorporate a fix for it into this suggested redesign.
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Robert Haas
Date:
On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I'm finding the backend_writes column pretty unfortunate. The only > use I know of for it is to determine if the bgwriter is lagging > behind. Yet it doesn't serve even this purpose because it lumps > together the backend writes due to lagging background writes, and the > backend writes "by design" due to the use buffer access strategy > during bulk inserts. +1 for separating those. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > >> I'm finding the backend_writes column pretty unfortunate. The only >> use I know of for it is to determine if the bgwriter is lagging >> behind. Yet it doesn't serve even this purpose because it lumps >> together the backend writes due to lagging background writes, and the >> backend writes "by design" due to the use buffer access strategy >> during bulk inserts. >> > > +1 for separating those. > I'm tied up with some on-site things until Friday, can rev the patch with this in mind (and clear some of my overall review work pile) then. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Peter Geoghegan
Date:
On 16 January 2012 06:28, Greg Smith <greg@2ndquadrant.com> wrote: > One of the most useful bits of feedback on how well checkpoint I/O is going > is the amount of time taken to sync files to disk. Right now the only way > to get that is to parse the logs. The attached patch publishes the most > useful three bits of data you could only get from log_checkpoints before out > to pg_stat_bgwriter. Easiest to just show an example: This is a preliminary review of the patch. Attached is a revision of the patch that clears up some minor bit-rot and style issues, just as a convenience to everyone else who might like to try it out against master who doesn't want to repeat my work. There were oid collisions against master, and an upstream change in the pg_proc format with the introduction of the "proleakproof" column. Regression tests pass (as they must have in the prior revision, as the original authors modified "expected/rules.out"). Note that I have deliberately avoided changing the patch further than that, though I believe that further changes will be necessary. One beef that I have with the variable name m_write_ms is that "ms" could equally well refer to microseconds or milliseconds, and these mistakes are very common. Therefore, you might think about making the fields of type "instr_time", which is consistent with some other such fields used within the statistics collector - this takes care of issues with overflows, portability and confusion over units, to as great an extent as is possible. I'm not even confident that this is the right thing though, because PgStat_StatFuncEntry, for example, *does* use PgStat_Countuer as the type for time values, in its case apparently *microseconds* rather than milliseconds, even though the pg_stat_user_functions view displays values in millisecond (it's incorrectly commented). Even the use of instr_time within some pgstat.h structs sees the value converted to microseconds for transmission to the collector. It might be that since we don't really accumulate time values here, it's better to just directly represent that value as PgStat_Counter, but in that case I'd still clearly mark the value as being milliseconds. In any case, I would have the columns in milliseconds, but not ending with *_ms within pg_stat_bgwriter, for consistency with pg_stat_user_functions et al. There is no reason to cast to uint64 when accumulating milliseconds, as the lvalue is currently a PgStat_Counter. Details of existing comment bug: The docs show that pg_stat_*_functions accumulates time in units of milliseconds. However, a PgStat_FunctionEntry payload is commented as sending the time/self time to the stats collector in microseconds. So this comment, in the existing code, is actually wrong: PgStat_Counter f_time; /* times in microseconds */ PgStat_Counter f_time_self; } PgStat_StatFuncEntry; Can someone commit a fix for that please? It could easily result in a user-visible bug. Mixing units, and representing them as simple integers and totally eschewing type safety is sort of a pet peeve of mine, but consistency with the existing code is more important. My general impression of the patch's code quality is that it is more or less idomatic, and, as Greg said, simple; it generalises from the style of existing code in LogCheckpointEnd(), as well as existing fields within the PgStat_MsgBgWriter and PgStat_GlobalStats structs . As for the substance of the patch, I am in general agreement that this is a good idea. Storing the statistics in pg_stat_bgwriter is a more flexible approach than is immediately apparent. Users can usefully calculate delta values, as part of the same process through which checkpoint tuning is performed by comparing output from "select now(), * from pg_stat_bgwriter" at different intervals. This also puts this information within easy reach of monitoring tools. So, I'd ask Greg and/or Jaime to produce a revision of the patch with those concerns in mind, as well as fixing the md.c usage of log_checkpoints. That variable is currently used in these places only: src/backend/access/transam/xlog.c 80:bool log_checkpoints = false; 7694: * Note: because it is possible for log_checkpoints to change while a 7696: * log_checkpoints is currently off. 7835: if (log_checkpoints) 8035: if (log_checkpoints) 8229: * Note: because it is possible for log_checkpoints to change while a 8231: * log_checkpoints is currently off. 8236: if (log_checkpoints) 8301: if (log_checkpoints) 8305: ereport((log_checkpoints ? LOG : DEBUG2), src/backend/storage/smgr/md.c 1097: if (log_checkpoints) 1105: if (log_checkpoints && (!INSTR_TIME_IS_ZERO(sync_start))) I also agree with separating backend write times due to lagging background writes, and the backend writes "by design", as described by Jeff. I'm not inclined to worry too much about the usability issues surrounding the relative values of things like the total checkpoint time and total write and sync phase times matching up. What do you intend to do about testing the log_checkpoints = 'off' case? Can I help in some way? It seems pretty obvious to me that the changes made are correct, apart from the fact that you omitted to modify md.c . In particular, there are no pre or post conditions involved with calling TimeCheckpointEnd() (formerly LogCheckpointEnd() ). This is evident from the following comment at transam/xlog.c:7709: * Note: because it is possible for log_checkpoints to change while a * checkpoint proceeds, we always accumulate stats, even if * log_checkpoints is currently off. It seems fairly silly to me to not expose this information if it's invariably being collected anyway. Cheers -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Robert Haas
Date:
On Wed, Feb 22, 2012 at 2:11 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > One beef that I have with the variable name m_write_ms is that "ms" > could equally well refer to microseconds or milliseconds, and these > mistakes are very common. I would expect ms to mean milliseconds and us to mean microseconds. > Details of existing comment bug: The docs show that > pg_stat_*_functions accumulates time in units of milliseconds. > However, a PgStat_FunctionEntry payload is commented as sending the > time/self time to the stats collector in microseconds. So this > comment, in the existing code, is actually wrong: > > PgStat_Counter f_time; /* times in microseconds */ > PgStat_Counter f_time_self; > } PgStat_StatFuncEntry; I think that the comment is not wrong. The code that populates it looks like this: m_ent->f_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_time); m_ent->f_time_self = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_time_self If that's not producing microseconds, those macros are badly misnamed.Note that the pg_stat_user_functions view divides thereturn value of the function by 1000, which is why the view output is in milliseconds. > As for the substance of the patch, I am in general agreement that this > is a good idea. Storing the statistics in pg_stat_bgwriter is a more > flexible approach than is immediately apparent. Users can usefully > calculate delta values, as part of the same process through which > checkpoint tuning is performed by comparing output from "select now(), > * from pg_stat_bgwriter" at different intervals. This also puts this > information within easy reach of monitoring tools. On further thought, I'll revise the position I took upthread and concede that write_ms and sync_ms are useful. Given values of the stats counters at time A and time B, you can calculate what percentage of that time was spent in the write phase of some checkpoint, the sync phase of some checkpoint, and not in any checkpoint. But I'm still pretty sketchy on sync_files. I can't see how you can do anything useful with that. > So, I'd ask Greg and/or Jaime to produce a revision of the patch with > those concerns in mind, as well as fixing the md.c usage of > log_checkpoints. Seems we still are waiting for an update of this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Peter Geoghegan
Date:
On 23 January 2012 02:08, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> I'm finding the backend_writes column pretty unfortunate. The only >> use I know of for it is to determine if the bgwriter is lagging >> behind. Yet it doesn't serve even this purpose because it lumps >> together the backend writes due to lagging background writes, and the >> backend writes "by design" due to the use buffer access strategy >> during bulk inserts. > > +1 for separating those. I decided to have a go myself. Attached patch breaks out strategy allocations in pg_stat_bgwriter, but not strategy writes. My thinking is that this may serve to approximate non-BAS_NORMAL writes, with the considerable advantage of not requiring that I work backwards to figure out strategy from some block when backend-local syncing (yes, syncing, not writing) a buffer to work out which strategy object references the buffer. The bookkeeping that that would likely entail seems to make it infeasible. Incidentally, it seems Postgres doesn't currently record backend writes when the buffer doesn't go on to be sync'd. That seems problematic to me, or at the very least a misrepresentation, since temporary tables will be written out by the backend for example. Not sure if it's worth fixing, though I've added a comment to that effect at the site of where backend_writes is bumped. I have corrected the md.c bug. This previously would have prevented the sync_files (number of relation segments synced) value from being valid in non-log_checkpoints configurations. I'm not currently confident that the strategy_alloc filed is a very useful proxy for a strategy_backend_writes field. I think that rather than bumping the strategy allocation count analogously to the way the overall count is bumped (within StrategyGetBuffer()), I should have bumped earlier within BufferAlloc() so that it'd count if the buffer was requested with non-BAS_NORMAL strategy but was found in shared_buffers (so control never got as far as StrategyGetBuffer() ). That might make the value more closely line-up to the would-be value of a strategy_backend_writes column. What do you think? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Robert Haas
Date:
On Thu, Mar 22, 2012 at 6:07 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 23 January 2012 02:08, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> I'm finding the backend_writes column pretty unfortunate. The only >>> use I know of for it is to determine if the bgwriter is lagging >>> behind. Yet it doesn't serve even this purpose because it lumps >>> together the backend writes due to lagging background writes, and the >>> backend writes "by design" due to the use buffer access strategy >>> during bulk inserts. >> >> +1 for separating those. > > I decided to have a go myself. Attached patch breaks out strategy > allocations in pg_stat_bgwriter, but not strategy writes. My thinking > is that this may serve to approximate non-BAS_NORMAL writes, with the > considerable advantage of not requiring that I work backwards to > figure out strategy from some block when backend-local syncing (yes, > syncing, not writing) a buffer to work out which strategy object > references the buffer. The bookkeeping that that would likely entail > seems to make it infeasible. > > Incidentally, it seems Postgres doesn't currently record backend > writes when the buffer doesn't go on to be sync'd. That seems > problematic to me, or at the very least a misrepresentation, since > temporary tables will be written out by the backend for example. Not > sure if it's worth fixing, though I've added a comment to that effect > at the site of where backend_writes is bumped. > > I have corrected the md.c bug. This previously would have prevented > the sync_files (number of relation segments synced) value from being > valid in non-log_checkpoints configurations. > > I'm not currently confident that the strategy_alloc filed is a very > useful proxy for a strategy_backend_writes field. I think that rather > than bumping the strategy allocation count analogously to the way the > overall count is bumped (within StrategyGetBuffer()), I should have > bumped earlier within BufferAlloc() so that it'd count if the buffer > was requested with non-BAS_NORMAL strategy but was found in > shared_buffers (so control never got as far as StrategyGetBuffer() ). > That might make the value more closely line-up to the would-be value > of a strategy_backend_writes column. What do you think? Although I liked the idea of separating this out, I wasn't thinking we should do it as part of this patch. It seems like an independent project. At any rate, I strongly agree that counting the number of strategy allocations is not really a viable proxy for counting the number of backend writes. You can't know how many of those actually got dirtied. Since any backend write is necessarily the result of that backend trying to allocate a buffer, I think maybe we should just count whether the number of times it was trying to allocate a buffer *using a BAS* vs. the number of times it was trying to allocate a buffer *not using a BAS*. That is, decide whether or not it's a "strategy write" not based on whether the buffer came in via a strategy allocation, but rather based on whether it's going out as a result of a strategy allocation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Peter Geoghegan
Date:
On 28 March 2012 15:23, Robert Haas <robertmhaas@gmail.com> wrote: > At any rate, I strongly agree that counting the number of > strategy allocations is not really a viable proxy for counting the > number of backend writes. You can't know how many of those actually > got dirtied. Sure. > Since any backend write is necessarily the result of that backend > trying to allocate a buffer, I think maybe we should just count > whether the number of times it was trying to allocate a buffer *using > a BAS* vs. the number of times it was trying to allocate a buffer *not > using a BAS*. That is, decide whether or not it's a "strategy write" > not based on whether the buffer came in via a strategy allocation, but > rather based on whether it's going out as a result of a strategy > allocation. I'm not quite sure I understand what you mean here. Are you suggesting that I produce a revision that bumps beside FlushBuffer() in BufferAlloc(), as a dirty page is evicted/written, while breaking the figure out into != BAS_NORMAL and == BAS_NORMAL figures? Would both figures be presented as separate columns within pg_stat_bgwriter? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Robert Haas
Date:
On Sat, Mar 31, 2012 at 5:34 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> Since any backend write is necessarily the result of that backend >> trying to allocate a buffer, I think maybe we should just count >> whether the number of times it was trying to allocate a buffer *using >> a BAS* vs. the number of times it was trying to allocate a buffer *not >> using a BAS*. That is, decide whether or not it's a "strategy write" >> not based on whether the buffer came in via a strategy allocation, but >> rather based on whether it's going out as a result of a strategy >> allocation. > > I'm not quite sure I understand what you mean here. Are you suggesting > that I produce a revision that bumps beside FlushBuffer() in > BufferAlloc(), as a dirty page is evicted/written, while breaking the > figure out into != BAS_NORMAL and == BAS_NORMAL figures? Would both > figures be presented as separate columns within pg_stat_bgwriter? Well, for 9.2, I am asking that you rip all that stuff out of the patch altogether so we can focus on the stuff that was in the original patch. For 9.3, yes, that's what I'm suggesting, although I'll defer to you, Jeff, and others on whether it's a *good* suggestion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Peter Geoghegan
Date:
On 3 April 2012 12:11, Robert Haas <robertmhaas@gmail.com> wrote: > Well, for 9.2, I am asking that you rip all that stuff out of the > patch altogether so we can focus on the stuff that was in the original > patch. Given how we're pushed for time, I'm inclined to agree that that is the best course of action. Attached revision does not attempt to do anything with strategy writes/allocations. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Wed, Mar 28, 2012 at 7:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Although I liked the idea of separating this out, I wasn't thinking we > should do it as part of this patch. It seems like an independent > project. At any rate, I strongly agree that counting the number of > strategy allocations is not really a viable proxy for counting the > number of backend writes. You can't know how many of those actually > got dirtied. I was thinking that it would be part of this patch, but only because Greg mentioned the view format stabilizing. If it is going to be frozen, I wanted this in it. But I guess stabilize and fossilize are different things... > Since any backend write is necessarily the result of that backend > trying to allocate a buffer, I think maybe we should just count > whether the number of times it was trying to allocate a buffer *using > a BAS* vs. the number of times it was trying to allocate a buffer *not > using a BAS*. That is, decide whether or not it's a "strategy write" > not based on whether the buffer came in via a strategy allocation, but > rather based on whether it's going out as a result of a strategy > allocation. Yes, exactly. Cheers, Jeff
Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
From
Robert Haas
Date:
On Thu, Apr 5, 2012 at 9:05 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 3 April 2012 12:11, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, for 9.2, I am asking that you rip all that stuff out of the >> patch altogether so we can focus on the stuff that was in the original >> patch. > > Given how we're pushed for time, I'm inclined to agree that that is > the best course of action. > > Attached revision does not attempt to do anything with strategy > writes/allocations. I have committed this after extensive revisions. I removed the sync_files count, since no one ever explained what that was good for. I renamed a bunch of things so that it was clear that these stats were referring to checkpoints rather than anything else. I moved the documentation to what I believe to be the correct place. I fixed a few other assorted things, too, and reverted a couple hunks that didn't seem to me to be adding anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company