Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)
Date
Msg-id CAM3SWZSpdPB3uErnXWMt3q74y0r+84ZsOt2U3HKKes_V7O+0Qg@mail.gmail.com
Whole thread Raw
Responses Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)  (Greg Stark <stark@mit.edu>)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)  (Lukas Fittl <lukas@fittl.com>)
List pgsql-hackers
On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg@heroku.com> wrote:
> pg_stat_statements' fingerprinting logic considers the following two
> statements as distinct:
>
> select 1 in (1, 2, 3);
> select 1 in (1, 2, 3, 4);
>
> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
> of elements recursively. In this case it's a list of Const nodes, and
> the fingerprinting logic jumbles those nodes indifferently.
>
> Somebody told me that they think that pg_stat_statements should not do
> that. This person felt that it would be preferable for such
> expressions to be normalized without regard to the number of distinct
> Const elements. I suppose that that would work by determing if the
> ArrayExpr elements list was a list of Const nodes and only const
> nodes. Iff that turned out to be the case, something else would be
> jumbled (something other than the list) that would essentially be a
> representation of "some list of zero or more (or maybe one or more)
> Const nodes with consttype of, in this example, 23". I think that this
> would make at least one person happy, because of course the two
> statements above would have their costs aggregated within a single
> pg_stat_statements entry.

Baron Schwartz recently gave a talk at PGConf Silicon Valley about the
proprietary query instrumentation tool, VividCortex. The slides are
available from:


http://info.citusdata.com/rs/235-CNE-301/images/Analyzing_PostgreSQL_Network_Traffic_with_vc-pgsql-sniffer_-_Baron_Schwartz.pdf

One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

His theory is that you should allow a proprietary binary to run with
root permissions, a binary that sniffs the wire protocol, mostly
because pg_stat_statements has this limitation (all other
pg_stat_statements limitations listed are pretty unconvincing, IMV).
That doesn't seem like a great plan to me, but I think he has a point
about pg_stat_statements. It's about time that we fixed this -- it
isn't realistic to imagine that people are going to know to use an
array constant like "= any ('{1,2,3}')" -- even a major contributor to
Django that I talked to about this issue a couple of years ago didn't
know about that. It also isn't portable across database systems.

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

* How might altering the jumbling logic to make it recognize a
variable number of constants as equivalent work in practice? Perhaps
we should do something to flatten the representation based on which
powers of two the number of constants is between. There are still some
details to work out there, but that's my first idea. That seems like a
good compromise between the current behavior, and completely
disregarding the number of constants.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: New email address
Next
From: Amit Langote
Date:
Subject: Re: [PROPOSAL] VACUUM Progress Checker.