Re: Less than ideal error reporting in pg_stat_statements - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Less than ideal error reporting in pg_stat_statements
Date
Msg-id CAM3SWZRH8=6P8=eicmOFAY+WROV_5giezZc=LnoUUtFYRYG0pg@mail.gmail.com
Whole thread Raw
In response to Re: Less than ideal error reporting in pg_stat_statements  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Less than ideal error reporting in pg_stat_statements  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Oct 4, 2015 at 9:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm.  I'm unconvinced by the aspects of this that involve using
> mean_query_len as a filter on which texts will be accepted.  While that's
> not necessarily bad in the abstract, there are way too many implementation
> artifacts here that will result in random-seeming decisions about whether
> to normalize.

There are already plausible race conditions that can make query text
normalization not occur. I think that's much more likely in practice
to cause a failure to normalize than anything proposed here; I've
personally observed such things in the wild a few times already. Also,
note that mean_query_len is not used directly there -- I decided on
Max(ASSUMED_LENGTH_INIT, mean_query_len) instead. mean_query_len is
just for cases with very large query texts.

> * mean_query_len only gets recomputed in entry_dealloc(), which is only
> run if we exceed pgss_max, and gc_qtexts(), which is only run if we decide
> the query texts file is more than 50% bloat.  So there could be quite a
> long startup transient before the value gets off its initialization
> minimum, and I'm suspicious that there might be plausible use-cases where
> it never does.  So it's not so much "restrict to a multiple of the mean
> query len" as "restrict to some number that might once upon a time have
> had some relation to the mean query len, or maybe not".

ASSUMED_LENGTH_INIT * 5 is a pretty conservative lower bound, I'd say.
mean_query_len is only really used for cases where query texts are
much longer on average. So in order for that to be a problem, you'd
have to have what are, in an absolute sense, very large query texts. I
think I noticed no more than a handful of changes in the regression
tests, for example.

> * One could expect that after changing mean_query_len, the population of
> query texts would change character as a result of the filter behavior
> changing, so that convergence to stable behavior over the long haul is
> not exactly self-evident.

FWIW, I think that there is a feedback loop today, and that in problem
cases that was what allowed it to get out of hand.

> * As you've got it here, entry_dealloc() and gc_qtexts() don't compute
> mean_query_len the same way, because only one of them discriminates
> against sticky entries.  So the value would bounce about rather randomly
> based on which one had run last.

entry_dealloc() will naturally run far more frequently than
gc_qtexts(). That said, it would be better if they matched.

> * I'm not exactly convinced that sticky entries should be ignored for
> this purpose anyway.

I think that data integration transactions that fail repeatedly are
strongly implicated here in practice. That's behind the query size
filter thing that you may also take issue with, as well as this.

> Taking a step back, ISTM the real issue you're fighting here is lots of
> orphaned sticky entries, but the patch doesn't do anything directly to fix
> that problem.  I wonder if we should do something like making
> entry_dealloc() and/or gc_qtexts() aggressively remove sticky entries,
> or at least those with "large" texts.

Sticky entries are (almost by definition) always aggressively removed,
and I hesitate to give certain ones a lower usage_count to begin with,
which is the only way to directly be more aggressive that might work
better.

> I think the aspects of this patch that are reasonably uncontroversial are
> increasing the allowed malloc attempt size in gc_qtexts, flushing the
> query text file on malloc failure, fixing the missing cleanup steps after
> a gc failure, and making entry_dealloc's recomputation of mean_query_len
> sane (which I'll define for the moment as "the same as gc_qtexts would
> get").  Since we're hard against a release deadline, I propose to commit
> just those changes, and we can consider the idea of a query size filter
> and/or redefining mean_query_len at leisure.

I'm not clear on what you actually propose to do to "make
entry_dealloc's recomputation of mean_query_len sane", but I think you
are talking about something distinct from what I've proposed based on
your separate remarks about entry_dealloc and the extra discrimination
against sticky entries there (vis-a-vis calculating mean query
length). I can't decide exactly what you mean, though: neither
entry_dealloc nor gc_qtexts care about orphaned query texts in my
patch (or in master). Please clarify.

I'd be quite happy if you did everything listed, and also left the
extra discrimination against sticky entries within entry_dealloc in --
consider what happens when a huge malloc() ends up swapping with an
exclusive lock held, and consider that repeated, failed data
integration transactions are implicated in this in a big way when a
problem appears in the wild. A big part of the problem here was that
garbage collection did not run often enough.

In other words, I'd be fine with *not* doing the query size filter
thing for now, since that is something that seems like an extra
defense and not core to the problem. I was kind of ambivalent about
doing that part myself, actually.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: about fsync in CLOG buffer write
Next
From: Tom Lane
Date:
Subject: Re: Less than ideal error reporting in pg_stat_statements