Re: Proposal: Log inability to lock pages during vacuum - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: Proposal: Log inability to lock pages during vacuum
Date
Msg-id 54639A0C.3030000@BlueTreble.com
Whole thread Raw
In response to Re: Proposal: Log inability to lock pages during vacuum  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Proposal: Log inability to lock pages during vacuum  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On 11/11/14, 2:01 AM, Andres Freund wrote:
> On 2014-11-10 19:36:18 -0600, Jim Nasby wrote:
>> On 11/10/14, 12:56 PM, Andres Freund wrote:
>>> On 2014-11-10 12:37:29 -0600, Jim Nasby wrote:
>>>> On 11/10/14, 12:15 PM, Andres Freund wrote:
>>>>>>> If what we want is to quantify the extent of the issue, would it be more
>>>>>>> convenient to save counters to pgstat?  Vacuum already sends pgstat
>>>>>>> messages, so there's no additional traffic there.
>>>>> I'm pretty strongly against that one in isolation. They'd need to be
>>>>> stored somewhere and they'd need to be queryable somewhere with enough
>>>>> context to make sense.  To actually make sense of the numbers we'd also
>>>>> need to report all the other datapoints of vacuum in some form. That's
>>>>> quite a worthwile project imo - but*much*  *much*  more work than this.
>>>>
>>>> We already report statistics on vacuums
>>>> (lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding
>>>> 1 or 2 counters to that. Should we add the other counters from vacuum?
>>>> That would be significantly more data.
>>>
>>> At the very least it'd require:
>>> * The number of buffers skipped due to the vm
>>> * The number of buffers actually scanned
>>> * The number of full table in contrast to partial vacuums
>>
>> If we're going to track full scan vacuums separately, I think we'd
>> need two separate scan counters.
>
> Well, we already have the entire number of vacuums, so we'd have that.

I mean number of pages scanned, but as I said below I don't think that's really necessary.

>> I think (for pgstats) it'd make more sense to just count initial
>> failure to acquire the lock in a full scan in the 'skipped page'
>> counter. In terms of answering the question "how common is it not to
>> get the lock", it's really the same event.
>
> It's absolutely not. You need to correlate the number of skipped pages
> to the number of vacuumed pages. If you have 100k skipped in 10 billion
> total scanned pages it's something entirely different than 100k in 200k
> pages.

If the goal here is to find out if this even is a problem then I think the critical question is not "did we vacuum",
but"were we able to acquire the lock on the first try".
 

Obviously users will care much more about the vacuuming and not so much about the lock; but if this really is a
non-issueas most tend to believe I don't think it's worth worrying about any of this (except perhaps putting
dtrace/systemtap probes in).
 

>> Honestly, my desire at this point is just to see if there's actually a
>> problem. Many people are asserting that this should be a very rare
>> occurrence, but there's no way to know.
>
> Ok.
>
>> Towards that simple end, I'm a bit torn. My preference would be to
>> simply log, and throw a warning if it's over some threshold. I believe
>> that would give the best odds of getting feedback from users if this
>> isn't as uncommon as we think.
>
> I'm strongly against a warning. We have absolutely no sane way of tuning
> that. We'll just create a pointless warning that people will get
> confused about and that they'll have to live with till the next release.

To clarify: I'm only suggesting we issue a warning if we have to skip some significant number of pages; say 5 or 0.01%
ofthe table, whichever is greater. That's aimed directly at the goal of letting us know if this is actually a problem
ornot.
 

The reason I'm inclined to do the warning is because I don't think people will notice this otherwise. If this really
isn'ta problem then it won't matter; if it's a *big* problem then we'll at least know about it.
 

I'm thinking of an undocumented GUC to control the threshold, but I assume no one else would be on board with that?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pg_background (and more parallelism infrastructure patches)
Next
From: Andres Freund
Date:
Subject: Re: Proposal: Log inability to lock pages during vacuum