Re: Multivariate MCV stats can leak data to unprivileged users - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Multivariate MCV stats can leak data to unprivileged users
Date
Msg-id 20190613173745.pjmvxq7x4ceoosck@development
Whole thread Raw
In response to Re: Multivariate MCV stats can leak data to unprivileged users  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Multivariate MCV stats can leak data to unprivileged users  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Mon, Jun 10, 2019 at 02:32:04PM +0100, Dean Rasheed wrote:
>On Thu, 6 Jun 2019 at 21:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Attached are three patches tweaking the stats - two were already posted
>> in this thread, the third one is just updating docs.
>>
>> 1) 0001 - split pg_statistic_ext into definition + data
>>
>> This is pretty much the patch Dean posted some time ago, rebased to
>> current master (fixing just minor pgindent bitrot).
>>
>> 2) 0002 - update sgml docs to reflect changes from 0001
>>
>> 3) 0003 - define pg_stats_ext view, similar to pg_stats
>>
>
>Seems reasonable on a quick read-through, except I spotted a bug in
>the view (my fault) -- the statistics_owner column should come from
>s.stxowner rather than c.relowner.
>
>
>> The question is whether we want to also redesign pg_statistic_ext_data
>> per Tom's proposal (more about that later), but I think we can treat
>> that as an additional step on top of 0001. So I propose we get those
>> changes committed, and then perhaps also switch the data table to the
>> EAV model.
>>
>> Barring objections, I'll do that early next week, after cleaning up
>> those patches a bit more.
>>
>> One thing I think we should fix is naming of the attributes in the 0001
>> patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is
>> in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should
>> probably switch to "stxd" in the _data catalog. Opinions?
>>
>
>Yes, that makes sense. Especially when joining the 2 tables, since it
>makes it more obvious which table a given column is coming from in a
>join clause.
>

OK, attached are patches fixing the issues reported by you and John
Naylor, and squashing the parts into just two patches (catalog split and
pg_stats_ext). Barring objections, I'll push those tomorrow.

I've renamed columns in the _data catalog from 'stx' to 'stxd', which I
think is appropriate given the "data" in catalog name.

I'm wondering if we should change the examples in SGML docs (say, in
planstats.sgml) to use the new pg_stats_ext view, instead of querying the
catalogs directly. I've tried doing that, but I found the results less
readable than what we currently have (especially for the MCV list, where
it'd require matching elements in multiple arrays). So I've left this
unchanged for now.

>
>> Now, back to the proposal to split the _data catalog rows to EAV form,
>> with a new data type replacing the multiple types we have at the moment.
>> I've started hacking on it today, but the more I work on it the less
>> useful it seems to me.
>>
>> My understanding is that with that approach we'd replace the _data
>> catalog (which currently has one column per statistic type, with a
>> separate data type) with 1:M generic rows, with a generic data type.
>> That is, we'd replace this
>>
>>     CREATE TABLE pg_statistic_ext_data (
>>         stxoid OID,
>>         stxdependencies pg_dependencies,
>>         stxndistinct pg_ndistinct,
>>         stxmcv pg_mcv_list,
>>         ... histograms ...
>>     );
>>
>> with something like this:
>>
>>     CREATE TABLE pg_statistiex_ext_data (
>>         stxoid OID,
>>         stxkind CHAR,
>>         stxdata pg_statistic_ext_type
>>     );
>>
>> where pg_statistic_ext would store all existing statistic types. along
>> with a "flag" saying which value it actually stored (essentially a copy
>> of the stxkind column, which we however need to lookup a statistic of a
>> certain type, without having to detoast the statistic itself).
>>
>> As I mentioned before, I kinda dislike the fact that this obfuscates the
>> actual statistic type by hiding it behing the "wrapper" type.
>>
>> The other thing is that we have to deal with 1:M relationship every time
>> we (re)build the statistics, or when we need to access them. Now, it may
>> not be a huge amount of code, but it just seems unnecessary. It would
>> make sense if we planned to add large number of additional statistic
>> types, but that seems unlikely - I personally can think of maybe one new
>> statistic type, but that's about it.
>>
>> I'll continue working on it and I'll share the results early next week,
>> after playing with it a bit, but I think we should get the existing
>> patches committed and then continue discussing this as an additional
>> improvement.
>>
>
>I wonder ... would it be completely crazy to just use a JSON column to
>store the extended stats data?
>
>It wouldn't be as compact as your representation, but it would allow
>for future stats kinds without changing the catalog definitions, and
>it wouldn't obfuscate the stats types. You could keep the 1:1
>relationship, and have top-level JSON keys for each stats kind built,
>and you wouldn't need the pg_mcv_list_items() function because you
>could just put the MCV data in JSON arrays, which would be much more
>transparent, and would make the user-accessible view much simpler. One
>could also imagine writing regression tests that checked for specific
>expected MCV values like "stxdata->'mcv'->'frequency'->0".
>

You mean storing it as JSONB, I presume?

I've actually considered that at some point, but eventually concluded it's
not a good match. I mean, JSON(B) is pretty versatile and can be whacked
to store pretty much anything, but it has various limitations - e.g. it
does not support arbitrary data types, so we'd have to store a lot of
stuff as text (through input/output functions). That doesn't seem very
nice, I guess.

If we want JSONB output, that should not be difficult to generate. But I
guess your point was about generic storage format.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: upgrades in row-level locks can deadlock
Next
From: Tom Lane
Date:
Subject: Re: PG 11 JIT deform failure