Thread: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Greg Smith
Date:
One of the most useful bits of feedback on how well checkpoint I/O is
going is the amount of time taken to sync files to disk.  Right now the
only way to get that is to parse the logs.  The attached patch publishes
the most useful three bits of data you could only get from
log_checkpoints before out to pg_stat_bgwriter.  Easiest to just show an
example:

$ createdb pgbench
$ psql -c "select pg_stat_reset_shared('bgwriter')"
$ pgbench -i -s 10 pgbench
$ psql -c "checkpoint"

$ tail $PGLOG
LOG:  checkpoint complete: wrote 590 buffers (14.4%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.117 s,
total=1.746 s; sync files=31, longest=0.104 s, average=0.003 s

$ psql -c "select write_ms,sync_ms,sync_files from pg_stat_bgwriter"
  write_ms | sync_ms | sync_files
----------+---------+------------
         2 |     117 |         31

Like the other checkpoint statistics, these are updated in one lump when
the checkpoint finishes.

This includes rough documentation and the code isn't very complicated; a
cat version bump is required.  Some open concerns I still have about this:

-The total amount of time in the sync phase is not the same as the total
of all sync wait times.  I have published the total elapsed sync time.
The total of the times spent waiting on individual sync calls is also
available at the point when the above data is collected, and it could be
published as well.  I don't think that's necessary yet, since they are
almost identical right now.  But any feature that staggers sync calls
around might make that distinction important.

-The total checkpoint time is larger than the total of write+sync.  You
can see that here:  2ms write + 117ms sync = 119ms; total checkpoint
time is actually 1746ms though.  While I don't normally find the total
useful from a tuning perspective, it may be sensible to publish it just
to further reduce any perceived need for log scraping here.

-None of the debug1 level information about individual sync calls can be
published this way usefully.  I'd like that to come out somewhere
eventually, but really what's needed there is a performance event
logging history collector, not pg_stats.  I'm content that this patch
gets the big picture data out; it's sufficient to let you rule out
checkpoint problems if they aren't happening.  If you only have to drill
into the logs when long times start showing up in this data, that's not
so bad to live with.

-I can't tell for sure if this is working properly when log_checkpoints
is off.  This now collects checkpoint end time data in all cases,
whereas before it ignored that work if log_checkpoints was off.  I think
I will have to fake the logging in this area, make it log some things
even when log_checkpoints is off, to confirm it gives the right
results.  I haven't done that yet.

I would like to be able to tell people they need never turn on
log_checkpoints if they monitor pg_stat_bgwriter instead, because that
sets a good precedent for what direction the database is going.  It
would be nice for pg_stat_bgwriter's format to settle down for a long
period too.  In addition to my small list of concerns here, there seem
to be an increasing number of write path feature patches floating around
here lately.  That all works against this being stable.  The main
consumer of this data I use regularly is the pg_stat_bgwriter graphs
that Munin produces, and my expectation is that myself and/or Magnus
will make sure it's compatible with any changes made here for 9.2.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Attachment
On 01/16/2012 01:28 AM, Greg Smith wrote:
> -I can't tell for sure if this is working properly when 
> log_checkpoints is off.  This now collects checkpoint end time data in 
> all cases, whereas before it ignored that work if log_checkpoints was off.

...and there's at least one I missed located already:  inside of md.c.  
I'd forgotten how many spots where timing calls are optimized out are 
floating around this code path.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Hitoshi Harada
Date:
On Sun, Jan 15, 2012 at 10:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 01/16/2012 01:28 AM, Greg Smith wrote:
>>
>> -I can't tell for sure if this is working properly when log_checkpoints is
>> off.  This now collects checkpoint end time data in all cases, whereas
>> before it ignored that work if log_checkpoints was off.
>
>
> ...and there's at least one I missed located already:  inside of md.c.  I'd
> forgotten how many spots where timing calls are optimized out are floating
> around this code path.

I think CF app page for this patch has missing link with wrong message-id.

Thanks,
--
Hitoshi Harada


Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Alvaro Herrera
Date:
Excerpts from Hitoshi Harada's message of jue ene 19 07:08:52 -0300 2012:
> On Sun, Jan 15, 2012 at 10:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> > On 01/16/2012 01:28 AM, Greg Smith wrote:
> >>
> >> -I can't tell for sure if this is working properly when log_checkpoints is
> >> off.  This now collects checkpoint end time data in all cases, whereas
> >> before it ignored that work if log_checkpoints was off.
> >
> >
> > ...and there's at least one I missed located already:  inside of md.c.  I'd
> > forgotten how many spots where timing calls are optimized out are floating
> > around this code path.
>
> I think CF app page for this patch has missing link with wrong message-id.

Fixed, thanks

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Robert Haas
Date:
On Mon, Jan 16, 2012 at 1:28 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> One of the most useful bits of feedback on how well checkpoint I/O is going
> is the amount of time taken to sync files to disk.  Right now the only way
> to get that is to parse the logs.  The attached patch publishes the most
> useful three bits of data you could only get from log_checkpoints before out
> to pg_stat_bgwriter.

It's not quite clear from your email, but I gather that the way that
this is intended to work is that these values increment every time we
checkpoint?

Also, forgive for asking this possibly-stupid question, but of what
use is this information? I can't imagine why I'd care about a running
total of the number of files fsync'd to disk.  I also can't really
imagine why I'd care about the length of the write phase, which surely
will almost always be a function of checkpoint_completion_target and
checkpoint_timeout unless I manage to overrun the number of
checkpoint_segments I've allocated.  The only number that really seems
useful to me is the time spent syncing.  I have a clear idea what to
look for there: smaller numbers are better than bigger ones.  For the
rest I'm mystified.

And, it doesn't seem like it's necessarily going to safe me a whole
lot either, because if it turns out that my sync phases are long, the
first question out of my mouth is going to be "what percentage of my
total sync time is accounted for by the longest sync?".  And so right
there I'm back to the logs.  It's not clear how such information could
be usefully exposed in pg_stat_bgwriter either, since you probably
want to know only the last few values, not a total over all time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Simon Riggs
Date:
On Thu, Jan 19, 2012 at 3:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> And, it doesn't seem like it's necessarily going to safe me a whole
> lot either, because if it turns out that my sync phases are long, the
> first question out of my mouth is going to be "what percentage of my
> total sync time is accounted for by the longest sync?".  And so right
> there I'm back to the logs.

It seems clear that the purpose of this is to quickly and easily
decide whether there is a potential problem.

If you decide there is a potential problem, you may wish to look at
more detailed information.

So there is a huge time saving from avoiding the need to look at the
detail when its unnecessary and possibly not even enabled.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On 01/19/2012 10:52 AM, Robert Haas wrote:
> It's not quite clear from your email, but I gather that the way that
> this is intended to work is that these values increment every time we
> checkpoint?

Right--they get updated in the same atomic bump that moves up things 
like buffers_checkpoint

> Also, forgive for asking this possibly-stupid question, but of what
> use is this information? I can't imagine why I'd care about a running
> total of the number of files fsync'd to disk.  I also can't really
> imagine why I'd care about the length of the write phase, which surely
> will almost always be a function of checkpoint_completion_target and
> checkpoint_timeout unless I manage to overrun the number of
> checkpoint_segments I've allocated.  The only number that really seems
> useful to me is the time spent syncing.  I have a clear idea what to
> look for there: smaller numbers are better than bigger ones.  For the
> rest I'm mystified.

Priority #1 here is to reduce (but, admittedly, not always eliminate) 
the need for log file parsing of this particular area, so including all 
the major bits from the existing log message that can be published this 
way would include the write phase time.  You mentioned one reason why 
the write phase time might be interesting; there could be others.  One 
of the things expected here is that Munin will expand its graphing of 
values from pg_stat_bgwriter to include all these fields.  Most of the 
time the graph of time spent in the write phase will be boring and 
useless.  Making it easy for a look at a graph to spot those rare times 
when it isn't is one motivation for including it.

As for why to include the number of files being sync'd, one reason is 
again simply wanting to include everything that can easily be 
published.  A second is that it helps support ideas like my "Checkpoint 
sync pause" one; that's untunable in any reasonable way without some 
easy way of monitoring the number of files typically sync'd.  Sometimes 
when I'm investigating checkpoint spikes during sync, I wonder whether 
they were because more files than usual were synced, or if it's instead 
just because of more churn on a smaller number.  Making this easy to 
graph pulls that data out to where I can compare it with disk I/O 
trends.  And there's precedent now proving that an always incrementing 
number in pg_stat_bgwriter can be turned into such a graph easily by 
monitoring tools.

> And, it doesn't seem like it's necessarily going to safe me a whole
> lot either, because if it turns out that my sync phases are long, the
> first question out of my mouth is going to be "what percentage of my
> total sync time is accounted for by the longest sync?".  And so right
> there I'm back to the logs.  It's not clear how such information could
> be usefully exposed in pg_stat_bgwriter either, since you probably
> want to know only the last few values, not a total over all time.

This isn't ideal yet.  I mentioned how some future "performance event 
logging history collector" was really needed as a place to push longest 
sync times into, and we don't have it yet.  This is the best thing to 
instrument that I'm sure is useful, and that I can stick onto with the 
existing infrastructure.

The idea is that this change makes it possible to trigger a "sync times 
are too long" alert out of a tool that's based solely on database 
queries.  When that goes off, yes you're possibly back to the logs again 
for more details about the longest individual sync time.  But the rest 
of the time, what's hopefully the normal state of things, you can ignore 
the logs and just track the pg_stat_bgwriter numbers.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



On Sun, Jan 15, 2012 at 10:28 PM, Greg Smith <greg@2ndquadrant.com> wrote:

> I would like to be able to tell people they need never turn on
> log_checkpoints if they monitor pg_stat_bgwriter instead, because that sets
> a good precedent for what direction the database is going.  It would be nice
> for pg_stat_bgwriter's format to settle down for a long period too.

While on the topic of the format of pg_stat_bgwriter and it converging
on stability:

I'm finding the backend_writes column pretty unfortunate.  The only
use I know of for it is to determine if the bgwriter is lagging
behind.  Yet it doesn't serve even this purpose because it lumps
together the backend writes due to lagging background writes, and the
backend writes "by design" due to the use buffer access strategy
during bulk inserts.

If I'm running a tightly controlled benchmark on an otherwise unused
server and I know that no BAS is being used, then I can meaningfully
use backend_writes.  That is a pretty limiting limit.

I think we should either create a separate column to segregate BAS
backend_writes, or just don't report them at all and report only the
non-BAS ones into pg_stat_bgwriter.

I know you've argued against this before
(http://archives.postgresql.org/pgsql-performance/2011-03/msg00340.php),
but I think the assumption that all writes are absorbed by the OS
without blocking is assuming an awful lot.  If the server is not
behaving well, then it very well may be blocking on writes.  If we are
confident that the OS can always efficiently cache writes, why bother
with a background writer at all?

I think the argument that it is not important to know which strategy
motivated a backend write could just as validly be reformulated as an
argument that we don't need to know how many backend writes there were
in the first place.

Knowing only an aggregation of BAS-motivated backend writes and
straggling-bgwriter-motivated backend writes just doesn't seem useful
to me.  Am I missing a use for it?  Not only is it probably not
useful, it seems to be actively harmful as I've seen several times
where it leads people down false paths (including me several times, as
about once a year I forget why it is useless and spend time
rediscovering it).

If we are not willing to separate the two types of backend writes, or
just omit the BAS ones altogether, then I would suggest we drop the
column entirely.

Would a patch to separate the backend writes into two columns be
entertained for 9.3?

Cheers,

Jeff


Jeff Janes wrote:
> I'm finding the backend_writes column pretty unfortunate.  The only
> use I know of for it is to determine if the bgwriter is lagging
> behind.  Yet it doesn't serve even this purpose because it lumps
> together the backend writes due to lagging background writes, and the
> backend writes "by design" due to the use buffer access strategy
> during bulk inserts.
>
> If I'm running a tightly controlled benchmark on an otherwise unused
> server and I know that no BAS is being used, then I can meaningfully
> use backend_writes.  That is a pretty limiting limit.
>   

I don't think it's quite that bad in general; this presumes a moderate 
amount of BAS writes relative to other activity.  But I understand your 
concern better now.  I don't think the sorts of workloads you seem to 
have a lot of were considered very carefully before here.

> I think we should either create a separate column to segregate BAS
> backend_writes, or just don't report them at all and report only the
> non-BAS ones into pg_stat_bgwriter.
>   

We can't not report them.  One of the goals of pg_stat_bgwriter is to 
account for all writes out of the buffer cache.  If there enough BAS 
writes on your system that them being lumped together is a serious 
problem, having them go missing altogether would be even worse.  And any 
whacking around of pg_stat_bgwriter might as well fix that too.

Do you think you could put together a quick test case that's similar to 
the ones you're seeing unfair accounting for here?  This isn't quite 
buggy behavior, but a solid example I could test against showing it's a 
sketchy approach would be enough for me to incorporate a fix for it into 
this suggested redesign.



Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Robert Haas
Date:
On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I'm finding the backend_writes column pretty unfortunate.  The only
> use I know of for it is to determine if the bgwriter is lagging
> behind.  Yet it doesn't serve even this purpose because it lumps
> together the backend writes due to lagging background writes, and the
> backend writes "by design" due to the use buffer access strategy
> during bulk inserts.

+1 for separating those.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas wrote:
> On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>   
>> I'm finding the backend_writes column pretty unfortunate.  The only
>> use I know of for it is to determine if the bgwriter is lagging
>> behind.  Yet it doesn't serve even this purpose because it lumps
>> together the backend writes due to lagging background writes, and the
>> backend writes "by design" due to the use buffer access strategy
>> during bulk inserts.
>>     
>
> +1 for separating those.
>   

I'm tied up with some on-site things until Friday, can rev the patch 
with this in mind (and clear some of my overall review work pile) then.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Peter Geoghegan
Date:
On 16 January 2012 06:28, Greg Smith <greg@2ndquadrant.com> wrote:
> One of the most useful bits of feedback on how well checkpoint I/O is going
> is the amount of time taken to sync files to disk.  Right now the only way
> to get that is to parse the logs.  The attached patch publishes the most
> useful three bits of data you could only get from log_checkpoints before out
> to pg_stat_bgwriter.  Easiest to just show an example:

This is a preliminary review of the patch. Attached is a revision of
the patch that clears up some minor bit-rot and style issues, just as
a convenience to everyone else who might like to try it out against
master who doesn't want to repeat my work. There were oid collisions
against master, and an upstream change in the pg_proc format with the
introduction of the "proleakproof" column. Regression tests pass (as
they must have in the prior revision, as the original authors modified
"expected/rules.out"). Note that I have deliberately avoided changing
the patch further than that, though I believe that further changes
will be necessary.

One beef that I have with the variable name m_write_ms is that "ms"
could equally well refer to microseconds or milliseconds, and these
mistakes are very common. Therefore, you might think about making the
fields of type "instr_time", which is consistent with some other such
fields used within the statistics collector - this takes care of
issues with overflows, portability and confusion over units, to as
great an extent as is possible. I'm not even confident that this is
the right thing though, because PgStat_StatFuncEntry, for example,
*does* use PgStat_Countuer as the type for time values, in its case
apparently *microseconds* rather than milliseconds, even though the
pg_stat_user_functions view displays values in millisecond (it's
incorrectly commented). Even the use of instr_time within some
pgstat.h structs sees the value converted to microseconds for
transmission to the collector. It might be that since we don't really
accumulate time values here, it's better to just directly represent
that value as PgStat_Counter, but in that case I'd still clearly mark
the value as being milliseconds. In any case, I would have the columns
in milliseconds, but not ending with *_ms  within pg_stat_bgwriter,
for consistency with pg_stat_user_functions et al.

There is no reason to cast to uint64 when accumulating milliseconds,
as the lvalue is currently a PgStat_Counter.

Details of existing comment bug: The docs show that
pg_stat_*_functions accumulates time in units of milliseconds.
However, a PgStat_FunctionEntry payload is commented as sending the
time/self time to the stats collector in microseconds. So this
comment, in the existing code, is actually wrong:

    PgStat_Counter f_time;        /* times in microseconds */
    PgStat_Counter f_time_self;
} PgStat_StatFuncEntry;

Can someone commit a fix for that please? It could easily result in a
user-visible bug. Mixing units, and representing them as simple
integers and totally eschewing type safety is sort of a pet peeve of
mine, but consistency with the existing code is more important.

My general impression of the patch's code quality is that it is more
or less idomatic, and, as Greg said, simple; it generalises from the
style of existing code in LogCheckpointEnd(), as well as existing
fields within the PgStat_MsgBgWriter and PgStat_GlobalStats structs .

As for the substance of the patch, I am in general agreement that this
is a good idea. Storing the statistics in pg_stat_bgwriter is a more
flexible approach than is immediately apparent.  Users can usefully
calculate delta values, as part of the same process through which
checkpoint tuning is performed by comparing output from "select now(),
* from pg_stat_bgwriter" at different intervals. This also puts this
information within easy reach of monitoring tools.

So, I'd ask Greg and/or Jaime to produce a revision of the patch with
those concerns in mind, as well as fixing the md.c usage of
log_checkpoints. That variable is currently used in these places only:

src/backend/access/transam/xlog.c
80:bool        log_checkpoints = false;
7694:     * Note: because it is possible for log_checkpoints to change while a
7696:     * log_checkpoints is currently off.
7835:    if (log_checkpoints)
8035:    if (log_checkpoints)
8229:     * Note: because it is possible for log_checkpoints to change while a
8231:     * log_checkpoints is currently off.
8236:    if (log_checkpoints)
8301:    if (log_checkpoints)
8305:    ereport((log_checkpoints ? LOG : DEBUG2),

src/backend/storage/smgr/md.c
1097:                if (log_checkpoints)
1105:                    if (log_checkpoints && (!INSTR_TIME_IS_ZERO(sync_start)))

I also agree with separating backend write times due to lagging
background writes, and the backend writes "by design", as described by
Jeff. I'm not inclined to worry too much about the usability issues
surrounding the relative values of things like the total checkpoint
time and total write and sync phase times matching up.

What do you intend to do about testing the log_checkpoints = 'off'
case? Can I help in some way? It seems pretty obvious to me that the
changes made are correct, apart from the fact that you omitted to
modify md.c . In particular, there are no pre or post conditions
involved with calling TimeCheckpointEnd() (formerly LogCheckpointEnd()
). This is evident from the following comment at transam/xlog.c:7709:

 * Note: because it is possible for log_checkpoints to change while a
 * checkpoint proceeds, we always accumulate stats, even if
 * log_checkpoints is currently off.

It seems fairly silly to me to not expose this information if it's
invariably being collected anyway.

Cheers
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment
On Wed, Feb 22, 2012 at 2:11 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> One beef that I have with the variable name m_write_ms is that "ms"
> could equally well refer to microseconds or milliseconds, and these
> mistakes are very common.

I would expect ms to mean milliseconds and us to mean microseconds.

> Details of existing comment bug: The docs show that
> pg_stat_*_functions accumulates time in units of milliseconds.
> However, a PgStat_FunctionEntry payload is commented as sending the
> time/self time to the stats collector in microseconds. So this
> comment, in the existing code, is actually wrong:
>
>        PgStat_Counter f_time;          /* times in microseconds */
>        PgStat_Counter f_time_self;
> } PgStat_StatFuncEntry;

I think that the comment is not wrong.  The code that populates it
looks like this:
       m_ent->f_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_time);       m_ent->f_time_self =
INSTR_TIME_GET_MICROSEC(entry->f_counts.f_time_self

If that's not producing microseconds, those macros are badly misnamed.Note that the pg_stat_user_functions view divides
thereturn value of 
the function by 1000, which is why the view output is in milliseconds.

> As for the substance of the patch, I am in general agreement that this
> is a good idea. Storing the statistics in pg_stat_bgwriter is a more
> flexible approach than is immediately apparent.  Users can usefully
> calculate delta values, as part of the same process through which
> checkpoint tuning is performed by comparing output from "select now(),
> * from pg_stat_bgwriter" at different intervals. This also puts this
> information within easy reach of monitoring tools.

On further thought, I'll revise the position I took upthread and
concede that write_ms and sync_ms are useful.  Given values of the
stats counters at time A and time B, you can calculate what percentage
of that time was spent in the write phase of some checkpoint, the sync
phase of some checkpoint, and not in any checkpoint.  But I'm still
pretty sketchy on sync_files.  I can't see how you can do anything
useful with that.

> So, I'd ask Greg and/or Jaime to produce a revision of the patch with
> those concerns in mind, as well as fixing the md.c usage of
> log_checkpoints.

Seems we still are waiting for an update of this patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Peter Geoghegan
Date:
On 23 January 2012 02:08, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> I'm finding the backend_writes column pretty unfortunate.  The only
>> use I know of for it is to determine if the bgwriter is lagging
>> behind.  Yet it doesn't serve even this purpose because it lumps
>> together the backend writes due to lagging background writes, and the
>> backend writes "by design" due to the use buffer access strategy
>> during bulk inserts.
>
> +1 for separating those.

I decided to have a go myself. Attached patch breaks out strategy
allocations in pg_stat_bgwriter, but not strategy writes. My thinking
is that this may serve to approximate non-BAS_NORMAL writes, with the
considerable advantage of not requiring that I work backwards to
figure out strategy from some block when backend-local syncing (yes,
syncing, not writing) a buffer to work out which strategy object
references the buffer. The bookkeeping that that would likely entail
seems to make it infeasible.

Incidentally, it seems Postgres doesn't currently record backend
writes when the buffer doesn't go on to be sync'd. That seems
problematic to me, or at the very least a misrepresentation, since
temporary tables will be written out by the backend for example. Not
sure if it's worth fixing, though I've added a comment to that effect
at the site of where backend_writes is bumped.

I have corrected the md.c bug. This previously would have prevented
the sync_files (number of relation segments synced) value from being
valid in non-log_checkpoints configurations.

I'm not currently confident that the strategy_alloc filed is a very
useful proxy for a strategy_backend_writes field. I think that rather
than bumping the strategy allocation count analogously to the way the
overall count is bumped (within StrategyGetBuffer()), I should have
bumped earlier within BufferAlloc() so that it'd count if the buffer
was requested with non-BAS_NORMAL strategy but was found in
shared_buffers (so control never got as far as StrategyGetBuffer() ).
That might make the value more closely line-up to the would-be value
of a strategy_backend_writes column. What do you think?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment
On Thu, Mar 22, 2012 at 6:07 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 23 January 2012 02:08, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Jan 21, 2012 at 6:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> I'm finding the backend_writes column pretty unfortunate.  The only
>>> use I know of for it is to determine if the bgwriter is lagging
>>> behind.  Yet it doesn't serve even this purpose because it lumps
>>> together the backend writes due to lagging background writes, and the
>>> backend writes "by design" due to the use buffer access strategy
>>> during bulk inserts.
>>
>> +1 for separating those.
>
> I decided to have a go myself. Attached patch breaks out strategy
> allocations in pg_stat_bgwriter, but not strategy writes. My thinking
> is that this may serve to approximate non-BAS_NORMAL writes, with the
> considerable advantage of not requiring that I work backwards to
> figure out strategy from some block when backend-local syncing (yes,
> syncing, not writing) a buffer to work out which strategy object
> references the buffer. The bookkeeping that that would likely entail
> seems to make it infeasible.
>
> Incidentally, it seems Postgres doesn't currently record backend
> writes when the buffer doesn't go on to be sync'd. That seems
> problematic to me, or at the very least a misrepresentation, since
> temporary tables will be written out by the backend for example. Not
> sure if it's worth fixing, though I've added a comment to that effect
> at the site of where backend_writes is bumped.
>
> I have corrected the md.c bug. This previously would have prevented
> the sync_files (number of relation segments synced) value from being
> valid in non-log_checkpoints configurations.
>
> I'm not currently confident that the strategy_alloc filed is a very
> useful proxy for a strategy_backend_writes field. I think that rather
> than bumping the strategy allocation count analogously to the way the
> overall count is bumped (within StrategyGetBuffer()), I should have
> bumped earlier within BufferAlloc() so that it'd count if the buffer
> was requested with non-BAS_NORMAL strategy but was found in
> shared_buffers (so control never got as far as StrategyGetBuffer() ).
> That might make the value more closely line-up to the would-be value
> of a strategy_backend_writes column. What do you think?

Although I liked the idea of separating this out, I wasn't thinking we
should do it as part of this patch.  It seems like an independent
project.  At any rate, I strongly agree that counting the number of
strategy allocations is not really a viable proxy for counting the
number of backend writes.  You can't know how many of those actually
got dirtied.

Since any backend write is necessarily the result of that backend
trying to allocate a buffer, I think maybe we should just count
whether the number of times it was trying to allocate a buffer *using
a BAS* vs. the number of times it was trying to allocate a buffer *not
using a BAS*.  That is, decide whether or not it's a "strategy write"
not based on whether the buffer came in via a strategy allocation, but
rather based on whether it's going out as a result of a strategy
allocation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Peter Geoghegan
Date:
On 28 March 2012 15:23, Robert Haas <robertmhaas@gmail.com> wrote:
> At any rate, I strongly agree that counting the number of
> strategy allocations is not really a viable proxy for counting the
> number of backend writes.  You can't know how many of those actually
> got dirtied.

Sure.

> Since any backend write is necessarily the result of that backend
> trying to allocate a buffer, I think maybe we should just count
> whether the number of times it was trying to allocate a buffer *using
> a BAS* vs. the number of times it was trying to allocate a buffer *not
> using a BAS*.  That is, decide whether or not it's a "strategy write"
> not based on whether the buffer came in via a strategy allocation, but
> rather based on whether it's going out as a result of a strategy
> allocation.

I'm not quite sure I understand what you mean here. Are you suggesting
that I produce a revision that bumps beside FlushBuffer() in
BufferAlloc(), as a dirty page is evicted/written, while breaking the
figure out into != BAS_NORMAL and == BAS_NORMAL figures? Would both
figures be presented as separate columns within pg_stat_bgwriter?
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


On Sat, Mar 31, 2012 at 5:34 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> Since any backend write is necessarily the result of that backend
>> trying to allocate a buffer, I think maybe we should just count
>> whether the number of times it was trying to allocate a buffer *using
>> a BAS* vs. the number of times it was trying to allocate a buffer *not
>> using a BAS*.  That is, decide whether or not it's a "strategy write"
>> not based on whether the buffer came in via a strategy allocation, but
>> rather based on whether it's going out as a result of a strategy
>> allocation.
>
> I'm not quite sure I understand what you mean here. Are you suggesting
> that I produce a revision that bumps beside FlushBuffer() in
> BufferAlloc(), as a dirty page is evicted/written, while breaking the
> figure out into != BAS_NORMAL and == BAS_NORMAL figures? Would both
> figures be presented as separate columns within pg_stat_bgwriter?

Well, for 9.2, I am asking that you rip all that stuff out of the
patch altogether so we can focus on the stuff that was in the original
patch.

For 9.3, yes, that's what I'm suggesting, although I'll defer to you,
Jeff, and others on whether it's a *good* suggestion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

From
Peter Geoghegan
Date:
On 3 April 2012 12:11, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, for 9.2, I am asking that you rip all that stuff out of the
> patch altogether so we can focus on the stuff that was in the original
> patch.

Given how we're pushed for time, I'm inclined to agree that that is
the best course of action.

Attached revision does not attempt to do anything with strategy
writes/allocations.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment
On Wed, Mar 28, 2012 at 7:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> Although I liked the idea of separating this out, I wasn't thinking we
> should do it as part of this patch.  It seems like an independent
> project.  At any rate, I strongly agree that counting the number of
> strategy allocations is not really a viable proxy for counting the
> number of backend writes.  You can't know how many of those actually
> got dirtied.

I was thinking that it would be part of this patch, but only because
Greg mentioned the view format stabilizing.  If it is going to be
frozen, I wanted this in it.  But I guess stabilize and fossilize are
different things...

> Since any backend write is necessarily the result of that backend
> trying to allocate a buffer, I think maybe we should just count
> whether the number of times it was trying to allocate a buffer *using
> a BAS* vs. the number of times it was trying to allocate a buffer *not
> using a BAS*.  That is, decide whether or not it's a "strategy write"
> not based on whether the buffer came in via a strategy allocation, but
> rather based on whether it's going out as a result of a strategy
> allocation.

Yes, exactly.

Cheers,

Jeff


On Thu, Apr 5, 2012 at 9:05 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 3 April 2012 12:11, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, for 9.2, I am asking that you rip all that stuff out of the
>> patch altogether so we can focus on the stuff that was in the original
>> patch.
>
> Given how we're pushed for time, I'm inclined to agree that that is
> the best course of action.
>
> Attached revision does not attempt to do anything with strategy
> writes/allocations.

I have committed this after extensive revisions.  I removed the
sync_files count, since no one ever explained what that was good for.
I renamed a bunch of things so that it was clear that these stats were
referring to checkpoints rather than anything else.  I moved the
documentation to what I believe to be the correct place.  I fixed a
few other assorted things, too, and reverted a couple hunks that
didn't seem to me to be adding anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company