Re: Proposal: Log inability to lock pages during vacuum - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: Proposal: Log inability to lock pages during vacuum
Date
Msg-id 54616812.2000302@BlueTreble.com
Whole thread Raw
In response to Re: Proposal: Log inability to lock pages during vacuum  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Proposal: Log inability to lock pages during vacuum  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Proposal: Log inability to lock pages during vacuum  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On 11/10/14, 12:56 PM, Andres Freund wrote:
> On 2014-11-10 12:37:29 -0600, Jim Nasby wrote:
>> On 11/10/14, 12:15 PM, Andres Freund wrote:
>>>>> If what we want is to quantify the extent of the issue, would it be more
>>>>> convenient to save counters to pgstat?  Vacuum already sends pgstat
>>>>> messages, so there's no additional traffic there.
>>> I'm pretty strongly against that one in isolation. They'd need to be
>>> stored somewhere and they'd need to be queryable somewhere with enough
>>> context to make sense.  To actually make sense of the numbers we'd also
>>> need to report all the other datapoints of vacuum in some form. That's
>>> quite a worthwile project imo - but*much*  *much*  more work than this.
>>
>> We already report statistics on vacuums
>> (lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding
>> 1 or 2 counters to that. Should we add the other counters from vacuum?
>> That would be significantly more data.
>
> At the very least it'd require:
> * The number of buffers skipped due to the vm
> * The number of buffers actually scanned
> * The number of full table in contrast to partial vacuums

If we're going to track full scan vacuums separately, I think we'd need two separate scan counters. I think (for
pgstats)it'd make more sense to just count initial failure to acquire the lock in a full scan in the 'skipped page'
counter.In terms of answering the question "how common is it not to get the lock", it's really the same event.
 

> I think it'd require a fair amount of thinking about which values are
> required to make sense of the number of skipped buffers due to not being
> able to acquire the cleanup lock.
>
> If you want to do this - and I sure don't want to stop you from it - you
> should look at it from a general perspective, not from the perspective
> of how skipped cleanup locks are logged.

Honestly, my desire at this point is just to see if there's actually a problem. Many people are asserting that this
shouldbe a very rare occurrence, but there's no way to know.
 

Towards that simple end, I'm a bit torn. My preference would be to simply log, and throw a warning if it's over some
threshold.I believe that would give the best odds of getting feedback from users if this isn't as uncommon as we
think.

I agree that ideally this would be tracked as another stat, but from that standpoint I think there's other, much more
importantmetrics to track, and AFAIK the only reason we don't have them is that busy systems already push pgstats to
it'slimits. We should try and fix that, but that's a much bigger issue.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: SSL information view
Next
From: Tom Lane
Date:
Subject: Re: Proposal: Log inability to lock pages during vacuum