Re: Improve handling of pg_stat_statements handling of bind "IN" variables - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Re: Improve handling of pg_stat_statements handling of bind "IN" variables
Date
Msg-id CA+q6zcU3wKOyZEP2txLVML6ucyFgAuGdUe4rWLh_mH9+NiBn=A@mail.gmail.com
Whole thread Raw
In response to Re: Improve handling of pg_stat_statements handling of bind "IN" variables  (Pavel Trukhanov <pavel.trukhanov@gmail.com>)
Responses Re: Improve handling of pg_stat_statements handling of bind "IN" variables
List pgsql-hackers
> On Thu, Oct 3, 2019 at 3:33 AM Pavel Trukhanov <pavel.trukhanov@gmail.com> wrote:
>
>> On Wed, Jun 26, 2019 at 11:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Greg Stark <stark@mit.edu> writes:
>> > Actually thinking about this for two more seconds the question is what it
>> > would do with a query like
>> > WHERE col = ANY '1,2,3'::integer[]
>> > Or
>> > WHERE col = ANY ARRAY[1,2,3]
>> > Whichever the language binding that is failing to do parameterized queries
>> > is doing to emulate them.
>>
>> Yeah, one interesting question is whether this is actually modeling
>> what happens with real-world applications --- are they sending Consts,
>> or Params?
>>
>> I resolutely dislike the idea of marking arrays that came from IN
>> differently from other ones; that's just a hack and it's going to give
>> rise to unexplainable behavioral differences for logically-equivalent
>> queries.
>
> Thanks for your input.
>
> As for real-world applications – being a founder of a server monitoring saas
> (okmeter) I have access to stats on hundreds of postgres installations.
>
> It shows that IN with a variable number of params is ~7 times more used than
> ANY(array).

Hi,

I would like to do some archaeology and inquire about this thread, since
unfortunately there was no patch presented as far as I see.

IIUC the ideas suggested in this thread are evolving mostly about modifying
parser:

> On Fri, Jun 14, 2019 at 2:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I do not think you need new expression infrastructure. IMO what's going on
> here is that we're indulging in premature optimization in the parser. It
> would be better from a structural standpoint if the output of parse analysis
> were closer to what the user wrote, and then the business of separating Vars
> from Consts and reducing the Consts to an array were handled in the planner's
> expression preprocessing phase.
>
> So maybe what you should be thinking about is a preliminary patch that's
> mostly in the nature of refactoring, to move that processing to where it
> should be.
>
> Of course, life is never quite that simple; there are at least two
> issues you'd have to think about.
>
> * The parsing phase is responsible for determining the semantics of
> the query, in particular resolving the data types of the IN-list items
> and choosing the comparison operators that will be used.  The planner
> is not allowed to rethink that.  What I'm not clear about offhand is
> whether the existing coding in parse analysis might lead to different
> choices of data types/operators than a more straightforward approach
> does.  If so, we'd have to decide whether that's okay.
>
> * Postponing this work might make things slower overall, which wouldn't
> matter too much for short IN-lists, but you can bet that people who
> throw ten-thousand-entry IN-lists at us will notice.  So you'd need to
> keep an eye on efficiency and make sure you don't end up repeating
> similar processing over and over.

This puzzles me, since the original issue sounds like a "representation"
problem, when we want to calculate jumble hash in a way that obviously
repeating parameters or constants are hashed into one value. I see the point in
ideas like this:

>> One idea that comes to me after looking at the code involved is that
>> it might be an improvement across-the-board if transformAExprIn were to
>> reduce the generated ArrayExpr to an array Const immediately, in cases
>> where all the inputs are Consts.  That is going to happen anyway come
>> plan time, so it'd have zero impact on semantics or query performance.
>> Doing it earlier would cost nothing, and could even be a net win, by
>> reducing per-parse-node overhead in places like the rewriter.
>>
>> The advantage for the problem at hand is that a Const that's an array
>> of 2 elements is going to look the same as a Const that's any other
>> number of elements so far as pg_stat_statements is concerned.
>>
>> This doesn't help of course in cases where the values aren't all
>> Consts.  Since we eliminated Vars already, the main practical case
>> would be that they're Params, leading us back to the previous
>> question of whether apps are binding queries with different numbers
>> of parameter markers in an IN, and how hard pg_stat_statements should
>> try to fuzz that if they are.
>>
>> Then, to Greg's point, there's a question of whether transformArrayExpr
>> should do likewise, ie try to produce an array Const immediately.
>> I'm a bit less excited about that, but consistency suggests that
>> we should have it act the same as the IN case.

Interestingly enough, something similar was already mentioned in [1]. But no
one jumped into this, probably due to its relative complexity, lack of personal
time resources or not clear way to handle Params (I'm actually not sure about
the statistics for Consts vs Params myself and need to check this, but can
easily imagine both could be an often problem).

Another idea also was mentioned in [1]:

> I wonder whether we could improve this by arranging things so that both
> Consts and Params contribute zero to the jumble hash, and a list of these
> things also contributes zero, regardless of the length of the list.

Taking everything into account, is there anything particularly wrong about
approach of squashing down lists of constants/parameters in pg_stat_statements
itself? This sounds simpler, and judging from my experiments even preventing
jumbling of ArrayExpr and rte values constants of the same type with a position
index above some threshold will already help a lot in many cases that I
observe.

[1]:
https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: OpenSSL randomness seeding
Next
From: Tom Lane
Date:
Subject: Re: pg_subscription.subslotname is wrongly marked NOT NULL