Thread: Multivariate MCV stats can leak data to unprivileged users

Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
While working on 1aebfbea83c, I noticed that the new multivariate MCV
stats feature suffers from the same problem, and also the original
problems that were fixed in e2d4ef8de8 and earlier --- namely that a
user can see values in the MCV lists that they shouldn't see (values
from tables that they don't have privileges on).

I think there are 2 separate issues here:

1). The table pg_statistic_ext is accessible to anyone, so any user
can see the MCV lists of any table. I think we should give this the
same treatment as pg_statistic, and hide it behind a security barrier
view, revoking public access from the table.

2). The multivariate MCV stats planner code can be made to invoke
user-defined operators, so a user can create a leaky operator and use
it to reveal data values from the MCV lists even if they have no
permissions on the table.

Attached is a draft patch to fix (2), which hooks into
statext_is_compatible_clause().

I haven't thought much about (1). There are some questions about what
exactly the view should look like. Probably it should translate table
oids to names, like pg_stats does, but should it also translate column
indexes to names? That could get fiddly. Should it unpack MCV items?

I'll raise this as an open item for PG12.

Regards,
Dean

Attachment

Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote:
>While working on 1aebfbea83c, I noticed that the new multivariate MCV
>stats feature suffers from the same problem, and also the original
>problems that were fixed in e2d4ef8de8 and earlier --- namely that a
>user can see values in the MCV lists that they shouldn't see (values
>from tables that they don't have privileges on).
>
>I think there are 2 separate issues here:
>
>1). The table pg_statistic_ext is accessible to anyone, so any user
>can see the MCV lists of any table. I think we should give this the
>same treatment as pg_statistic, and hide it behind a security barrier
>view, revoking public access from the table.
>
>2). The multivariate MCV stats planner code can be made to invoke
>user-defined operators, so a user can create a leaky operator and use
>it to reveal data values from the MCV lists even if they have no
>permissions on the table.
>
>Attached is a draft patch to fix (2), which hooks into
>statext_is_compatible_clause().
>

I think that patch is good.

>I haven't thought much about (1). There are some questions about what
>exactly the view should look like. Probably it should translate table
>oids to names, like pg_stats does, but should it also translate column
>indexes to names? That could get fiddly. Should it unpack MCV items?
>

Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
It would:

(1) translate the schema / relation / attribute names

  I don't see why translating column indexes to names would be fiddly.
  It's a matter of simple unnest + join, no? Or what issues you see?

(2) include values for ndistinct / dependencies, if built

  Those don't include any actual values, so this should be OK. You might
  make the argument that even this does leak a bit of information (e.g.
  when you can see values in one column, and you know there's a strong
  functional dependence, it tells you something about the other column
  which you may not see). But I think we kinda already leak information
  about that through estimates, so maybe that's not an issue.

(3) include MCV list only when user has access to all columns

  Essentially, if the user is missing 'select' privilege on at least one
  of the columns, there'll be NULL. Otherwise the MCV value.

The attached patch adds pg_stats_ext doing this. I don't claim it's the
best possible query backing the view, so any improvements are welcome.


I've been thinking we might somehow filter the statistics values, e.g. by
not showing values for attributes the user has no 'select' privilege on,
but that seems like a can of worms. It'd lead to MCV items that can't be
distinguished because the only difference was the removed attribute, and
so on. So I propose we simply show/hide the whole MCV list.

Likewise, I don't think it makes sense to expand the MCV list in this
view - that works for the single-dimensional case, because then the
list is expanded into two arrays (values + frequencies), which are easy
to process further. But for multivariate MCV lists that's much more
complex - we don't know how many attributes are there, for example.

So I suggest we just show the pg_mcv_list value as is, and leave it up
to the user to call the pg_mcv_list_items() function if needed.

This will also work for histograms, where expanding the value in the
pg_stats_ext would be even trickier.


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
> It would:
>
> (1) translate the schema / relation / attribute names
>
>   I don't see why translating column indexes to names would be fiddly.
>   It's a matter of simple unnest + join, no? Or what issues you see?
>

Yeah, you're right. I thought it would be harder than that. One minor
thing -- I think we should have an explicit ORDER BY where we collect
the column names, to guarantee that they're listed in the right order.

> (2) include values for ndistinct / dependencies, if built
>
>   Those don't include any actual values, so this should be OK. You might
>   make the argument that even this does leak a bit of information (e.g.
>   when you can see values in one column, and you know there's a strong
>   functional dependence, it tells you something about the other column
>   which you may not see). But I think we kinda already leak information
>   about that through estimates, so maybe that's not an issue.
>

Hmm. For normal statistics, if the user has no permissions on the
table, they don't get to see any of these kinds of statistics, not
even things like n_distinct. I think we should do the same here --
i.e., if the user has no permissions on the table, don't let them see
anything. Such a user will not be able to run EXPLAIN on queries
against the table, so they won't get to see any estimates, and I don't
think they should get to see any extended statistics either.

> (3) include MCV list only when user has access to all columns
>
>   Essentially, if the user is missing 'select' privilege on at least one
>   of the columns, there'll be NULL. Otherwise the MCV value.
>

OK, that seems reasonable, except as I said above, I think that should
apply to all statistics, not just the MCV lists.

> The attached patch adds pg_stats_ext doing this. I don't claim it's the
> best possible query backing the view, so any improvements are welcome.
>
>
> I've been thinking we might somehow filter the statistics values, e.g. by
> not showing values for attributes the user has no 'select' privilege on,
> but that seems like a can of worms. It'd lead to MCV items that can't be
> distinguished because the only difference was the removed attribute, and
> so on. So I propose we simply show/hide the whole MCV list.
>

Agreed.

> Likewise, I don't think it makes sense to expand the MCV list in this
> view - that works for the single-dimensional case, because then the
> list is expanded into two arrays (values + frequencies), which are easy
> to process further. But for multivariate MCV lists that's much more
> complex - we don't know how many attributes are there, for example.
>
> So I suggest we just show the pg_mcv_list value as is, and leave it up
> to the user to call the pg_mcv_list_items() function if needed.
>

I think expanding the MCV lists is actually quite useful because then
you can see arrays of values, nulls, frequencies and base frequencies
in a reasonably readable form (it certainly looks better than a binary
dump), without needing to join to a function call, which is a bit
ugly, and unmemorable.

The results from the attached look quite reasonable at first glance.
It contains a few other changes as well:

1). It exposes the schema, name and owner of the statistics object as
well via the view, for completeness.

2). It changes a few column names -- traditionally these views strip
off the column name prefix from the underlying table, so I've
attempted to be consistent with other similar views.

3). I added array-valued columns for each of the MCV list components,
which makes it more like pg_stats.

4). I moved all the permission checks to the top-level WHERE clause,
so a user needs to have select permissions on all the columns
mentioned by the statistics, and the table mustn't have RLS in effect,
otherwise the user won't see the row for that statistics object.

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Note: doc and test updates still to do.

Regards,
Dean

Attachment

Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Thu, May 16, 2019 at 02:28:03PM +0100, Dean Rasheed wrote:
>On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
>> It would:
>>
>> (1) translate the schema / relation / attribute names
>>
>>   I don't see why translating column indexes to names would be fiddly.
>>   It's a matter of simple unnest + join, no? Or what issues you see?
>>
>
>Yeah, you're right. I thought it would be harder than that. One minor
>thing -- I think we should have an explicit ORDER BY where we collect
>the column names, to guarantee that they're listed in the right order.
>

Good point.

>> (2) include values for ndistinct / dependencies, if built
>>
>>   Those don't include any actual values, so this should be OK. You might
>>   make the argument that even this does leak a bit of information (e.g.
>>   when you can see values in one column, and you know there's a strong
>>   functional dependence, it tells you something about the other column
>>   which you may not see). But I think we kinda already leak information
>>   about that through estimates, so maybe that's not an issue.
>>
>
>Hmm. For normal statistics, if the user has no permissions on the
>table, they don't get to see any of these kinds of statistics, not
>even things like n_distinct. I think we should do the same here --
>i.e., if the user has no permissions on the table, don't let them see
>anything. Such a user will not be able to run EXPLAIN on queries
>against the table, so they won't get to see any estimates, and I don't
>think they should get to see any extended statistics either.
>

OK, I haven't realized we don't show that even for normal stats.

>> (3) include MCV list only when user has access to all columns
>>
>>   Essentially, if the user is missing 'select' privilege on at least one
>>   of the columns, there'll be NULL. Otherwise the MCV value.
>>
>
>OK, that seems reasonable, except as I said above, I think that should
>apply to all statistics, not just the MCV lists.
>
>> The attached patch adds pg_stats_ext doing this. I don't claim it's the
>> best possible query backing the view, so any improvements are welcome.
>>
>>
>> I've been thinking we might somehow filter the statistics values, e.g. by
>> not showing values for attributes the user has no 'select' privilege on,
>> but that seems like a can of worms. It'd lead to MCV items that can't be
>> distinguished because the only difference was the removed attribute, and
>> so on. So I propose we simply show/hide the whole MCV list.
>>
>
>Agreed.
>
>> Likewise, I don't think it makes sense to expand the MCV list in this
>> view - that works for the single-dimensional case, because then the
>> list is expanded into two arrays (values + frequencies), which are easy
>> to process further. But for multivariate MCV lists that's much more
>> complex - we don't know how many attributes are there, for example.
>>
>> So I suggest we just show the pg_mcv_list value as is, and leave it up
>> to the user to call the pg_mcv_list_items() function if needed.
>>
>
>I think expanding the MCV lists is actually quite useful because then
>you can see arrays of values, nulls, frequencies and base frequencies
>in a reasonably readable form (it certainly looks better than a binary
>dump), without needing to join to a function call, which is a bit
>ugly, and unmemorable.
>

Hmmm, ok. I think my main worry here is that it may or may not work for
more complex types of extended stats that are likely to come in the
future. Although, maybe it can be made work even for that.

>The results from the attached look quite reasonable at first glance.
>It contains a few other changes as well:
>
>1). It exposes the schema, name and owner of the statistics object as
>well via the view, for completeness.
>
>2). It changes a few column names -- traditionally these views strip
>off the column name prefix from the underlying table, so I've
>attempted to be consistent with other similar views.
>
>3). I added array-valued columns for each of the MCV list components,
>which makes it more like pg_stats.
>
>4). I moved all the permission checks to the top-level WHERE clause,
>so a user needs to have select permissions on all the columns
>mentioned by the statistics, and the table mustn't have RLS in effect,
>otherwise the user won't see the row for that statistics object.
>
>5). Some columns from pg_statistic_ext have to be made visible for
>psql \d to work. Basically, it needs to be able to query for the
>existence of extended statistics, but it doesn't need to see the
>actual statistical data. Of course, we could change psql to use the
>view, but this way gives us better backwards compatibility with older
>clients.
>
>This is still going to break compatibility of any user code looking at
>stxndistinct or stxdependencies from pg_statistic_ext, but at least it
>doesn't break old versions of psql.
>
>Note: doc and test updates still to do.
>

Thanks. I'm travelling today/tomorrow, but I'll do my best to fill in the
missing bits ASAP.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Multivariate MCV stats can leak data to unprivileged users

From
Andres Freund
Date:
Hi,

On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:
> 5). Some columns from pg_statistic_ext have to be made visible for
> psql \d to work. Basically, it needs to be able to query for the
> existence of extended statistics, but it doesn't need to see the
> actual statistical data. Of course, we could change psql to use the
> view, but this way gives us better backwards compatibility with older
> clients.
> 
> This is still going to break compatibility of any user code looking at
> stxndistinct or stxdependencies from pg_statistic_ext, but at least it
> doesn't break old versions of psql.

Hm, it's not normally a goal to keep old psql working against new
postgres versions. And there's plenty other issues preventing a v11 psql
to work against 12. I'd not let this guide any design decisions.


Greetings,

Andres Freund



Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Fri, 17 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:
>
> On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:
> > 5). Some columns from pg_statistic_ext have to be made visible for
> > psql \d to work. Basically, it needs to be able to query for the
> > existence of extended statistics, but it doesn't need to see the
> > actual statistical data. Of course, we could change psql to use the
> > view, but this way gives us better backwards compatibility with older
> > clients.
> >
> > This is still going to break compatibility of any user code looking at
> > stxndistinct or stxdependencies from pg_statistic_ext, but at least it
> > doesn't break old versions of psql.
>
> Hm, it's not normally a goal to keep old psql working against new
> postgres versions. And there's plenty other issues preventing a v11 psql
> to work against 12. I'd not let this guide any design decisions.
>

Ah good point. In fact running "\d some_table" from v11's psql against
a v12 database immediately falls over because of the removal of
relhasoids from pg_class, so this isn't a valid reason for retaining
access to any columns from pg_statistic_ext.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Sat, 18 May 2019 at 10:11, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> On Fri, 17 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:
> > > 5). Some columns from pg_statistic_ext have to be made visible for
> > > psql \d to work. Basically, it needs to be able to query for the
> > > existence of extended statistics, but it doesn't need to see the
> > > actual statistical data. Of course, we could change psql to use the
> > > view, but this way gives us better backwards compatibility with older
> > > clients.
> > >
> > > This is still going to break compatibility of any user code looking at
> > > stxndistinct or stxdependencies from pg_statistic_ext, but at least it
> > > doesn't break old versions of psql.
> >
> > Hm, it's not normally a goal to keep old psql working against new
> > postgres versions. And there's plenty other issues preventing a v11 psql
> > to work against 12. I'd not let this guide any design decisions.
> >
>
> Ah good point. In fact running "\d some_table" from v11's psql against
> a v12 database immediately falls over because of the removal of
> relhasoids from pg_class, so this isn't a valid reason for retaining
> access to any columns from pg_statistic_ext.
>

On the other hand, pg_dump relies on pg_statistic_ext to work out
which extended statistics objects to dump. If we were to change that
to use pg_stats_ext, then a user dumping a table with RLS using the
--enable-row-security flag wouldn't get any extended statistics
objects, which would be a somewhat surprising result.

That could be fixed by changing the view to return rows for every
extended statistics object, nulling out values in columns that the
user doesn't have permission to see, in a similar way to Tomas'
original patch. It would have to be modified to do the RLS check in
the same place as the privilege checks, rather than in the top-level
WHERE clause, and we'd probably also have to expose OIDs in addition
to names, because that's what clients like psql and pg_dump want. To
me, that feels quite messy though, so I think I'd still vote for
leaving the first few columns of pg_statistic_ext accessible to
public, and not have to change the clients to work differently from
v12 onwards.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On the other hand, pg_dump relies on pg_statistic_ext to work out
> which extended statistics objects to dump. If we were to change that
> to use pg_stats_ext, then a user dumping a table with RLS using the
> --enable-row-security flag wouldn't get any extended statistics
> objects, which would be a somewhat surprising result.

It seems like what we need here is to have a separation between the
*definition* of a stats object (which is what pg_dump needs access
to) and the current actual *data* in it.  I'd have expected that
keeping those in separate catalogs would be the thing to do, though
perhaps it's too late for that.

            regards, tom lane



Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> > On the other hand, pg_dump relies on pg_statistic_ext to work out
> > which extended statistics objects to dump. If we were to change that
> > to use pg_stats_ext, then a user dumping a table with RLS using the
> > --enable-row-security flag wouldn't get any extended statistics
> > objects, which would be a somewhat surprising result.
>
> It seems like what we need here is to have a separation between the
> *definition* of a stats object (which is what pg_dump needs access
> to) and the current actual *data* in it.  I'd have expected that
> keeping those in separate catalogs would be the thing to do, though
> perhaps it's too late for that.
>

Yeah, with the benefit of hindsight, that would have made sense, but
that seems like a pretty big change to be attempting at this stage.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
Andres Freund
Date:
Hi,

On May 18, 2019 8:43:29 AM PDT, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> > On the other hand, pg_dump relies on pg_statistic_ext to work out
>> > which extended statistics objects to dump. If we were to change
>that
>> > to use pg_stats_ext, then a user dumping a table with RLS using the
>> > --enable-row-security flag wouldn't get any extended statistics
>> > objects, which would be a somewhat surprising result.
>>
>> It seems like what we need here is to have a separation between the
>> *definition* of a stats object (which is what pg_dump needs access
>> to) and the current actual *data* in it.  I'd have expected that
>> keeping those in separate catalogs would be the thing to do, though
>> perhaps it's too late for that.
>>
>
>Yeah, with the benefit of hindsight, that would have made sense, but
>that seems like a pretty big change to be attempting at this stage.

Otoh, having a suboptimal catalog representation that we'll likely have to change in one of the next releases also
isn'tgreat. Seems likely that we'll need post beta1 catversion bumps anyway? 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote:
>Hi,
>
>On May 18, 2019 8:43:29 AM PDT, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>>> > On the other hand, pg_dump relies on pg_statistic_ext to work out
>>> > which extended statistics objects to dump. If we were to change
>>that
>>> > to use pg_stats_ext, then a user dumping a table with RLS using the
>>> > --enable-row-security flag wouldn't get any extended statistics
>>> > objects, which would be a somewhat surprising result.
>>>
>>> It seems like what we need here is to have a separation between the
>>> *definition* of a stats object (which is what pg_dump needs access
>>> to) and the current actual *data* in it.  I'd have expected that
>>> keeping those in separate catalogs would be the thing to do, though
>>> perhaps it's too late for that.
>>>
>>
>>Yeah, with the benefit of hindsight, that would have made sense, but
>>that seems like a pretty big change to be attempting at this stage.

>
>Otoh, having a suboptimal catalog representation that we'll likely have
>to change in one of the next releases also isn't great. Seems likely
>that we'll need post beta1 catversion bumps anyway?
>

But that's not an issue intruduced by PG12, it works like that even for
the extended statistics introduced in PG10.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote:
>>> On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> It seems like what we need here is to have a separation between the
>>>> *definition* of a stats object (which is what pg_dump needs access
>>>> to) and the current actual *data* in it.

>> Otoh, having a suboptimal catalog representation that we'll likely have
>> to change in one of the next releases also isn't great. Seems likely
>> that we'll need post beta1 catversion bumps anyway?

> But that's not an issue intruduced by PG12, it works like that even for
> the extended statistics introduced in PG10.

Yeah, but no time like the present to fix it if it's wrong ...

            regards, tom lane



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote:
>>>> On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> It seems like what we need here is to have a separation between the
>>>>> *definition* of a stats object (which is what pg_dump needs access
>>>>> to) and the current actual *data* in it.
>
>>> Otoh, having a suboptimal catalog representation that we'll likely have
>>> to change in one of the next releases also isn't great. Seems likely
>>> that we'll need post beta1 catversion bumps anyway?
>
>> But that's not an issue intruduced by PG12, it works like that even for
>> the extended statistics introduced in PG10.
>
>Yeah, but no time like the present to fix it if it's wrong ...
>

Sorry, not sure I understand. Are you saying we should try to rework
this before the beta1 release, or that we don't have time to do that?

I think we have four options - rework it before beta1, rework it after
beta1, rework it in PG13 and leave it as it is now.

If the pg_dump thing si the only issue, maybe there's a simple solution
that reworking all the catalogs. Not sure. Are there any other reasons
why the current catalog representation would be suboptimal, or do we
have some precedent of a catalog split this way? I can't think of any.

regards 

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> But that's not an issue intruduced by PG12, it works like that even for
>>> the extended statistics introduced in PG10.

>> Yeah, but no time like the present to fix it if it's wrong ...

> Sorry, not sure I understand. Are you saying we should try to rework
> this before the beta1 release, or that we don't have time to do that?

> I think we have four options - rework it before beta1, rework it after
> beta1, rework it in PG13 and leave it as it is now.

Yup, that's about what the options are.  I'm just voting against
"change it in v13".  If we're going to change it, then the fewer
major versions that have the bogus definition the better --- and
since we're changing that catalog in v12 anyway, users will see
fewer distinct behaviors if we do this change too.

It's very possibly too late to get it done before beta1,
unfortunately.  But as Andres noted, post-beta1 catversion
bumps are hardly unusual, so I do not think "rework after
beta1" is unacceptable.

            regards, tom lane



Re: Multivariate MCV stats can leak data to unprivileged users

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> > On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote:
> >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >>> But that's not an issue intruduced by PG12, it works like that even for
> >>> the extended statistics introduced in PG10.
>
> >> Yeah, but no time like the present to fix it if it's wrong ...
>
> > Sorry, not sure I understand. Are you saying we should try to rework
> > this before the beta1 release, or that we don't have time to do that?
>
> > I think we have four options - rework it before beta1, rework it after
> > beta1, rework it in PG13 and leave it as it is now.
>
> Yup, that's about what the options are.  I'm just voting against
> "change it in v13".  If we're going to change it, then the fewer
> major versions that have the bogus definition the better --- and
> since we're changing that catalog in v12 anyway, users will see
> fewer distinct behaviors if we do this change too.
>
> It's very possibly too late to get it done before beta1,
> unfortunately.  But as Andres noted, post-beta1 catversion
> bumps are hardly unusual, so I do not think "rework after
> beta1" is unacceptable.

Agreed.

Thanks,

Stephen

Attachment

Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Sun, 19 May 2019 at 00:48, Stephen Frost <sfrost@snowman.net> wrote:
>
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >
> > > I think we have four options - rework it before beta1, rework it after
> > > beta1, rework it in PG13 and leave it as it is now.
> >
> > Yup, that's about what the options are.  I'm just voting against
> > "change it in v13".  If we're going to change it, then the fewer
> > major versions that have the bogus definition the better --- and
> > since we're changing that catalog in v12 anyway, users will see
> > fewer distinct behaviors if we do this change too.
> >
> > It's very possibly too late to get it done before beta1,
> > unfortunately.  But as Andres noted, post-beta1 catversion
> > bumps are hardly unusual, so I do not think "rework after
> > beta1" is unacceptable.
>
> Agreed.
>

Yes, that makes sense.

I think we shouldn't risk trying to get this into beta1, but let's try
to get it done as soon as possible after that.

Actually, it doesn't appear to be as big a change as I had feared. As
a starter for ten, here's a patch doing the basic split, moving the
extended stats data into a new catalog pg_statistic_ext_data (I'm not
particularly wedded to that name, it's just the first name that came
to mind).

With this patch the catalogs look like this:


\d pg_statistic_ext
            Table "pg_catalog.pg_statistic_ext"
    Column    |    Type    | Collation | Nullable | Default
--------------+------------+-----------+----------+---------
 oid          | oid        |           | not null |
 stxrelid     | oid        |           | not null |
 stxname      | name       |           | not null |
 stxnamespace | oid        |           | not null |
 stxowner     | oid        |           | not null |
 stxkeys      | int2vector |           | not null |
 stxkind      | "char"[]   |           | not null |
Indexes:
    "pg_statistic_ext_name_index" UNIQUE, btree (stxname, stxnamespace)
    "pg_statistic_ext_oid_index" UNIQUE, btree (oid)
    "pg_statistic_ext_relid_index" btree (stxrelid)


\d pg_statistic_ext_data
              Table "pg_catalog.pg_statistic_ext_data"
     Column      |      Type       | Collation | Nullable | Default
-----------------+-----------------+-----------+----------+---------
 stxoid          | oid             |           | not null |
 stxndistinct    | pg_ndistinct    | C         |          |
 stxdependencies | pg_dependencies | C         |          |
 stxmcv          | pg_mcv_list     | C         |          |
Indexes:
    "pg_statistic_ext_data_stxoid_index" UNIQUE, btree (stxoid)


I opted to create/remove pg_statistic_ext_data tuples at the same time
as the pg_statistic_ext tuples, in CreateStatistics() /
RemoveStatisticsById(), so then it's easier to see that they're in a
one-to-one relationship, and other code doesn't need to worry about
the data tuple not existing. The other option would be to defer
inserting the data tuple to ANALYZE.

I couldn't resist moving the code block that declares
pg_statistic_ext's indexes in indexing.h to the right place, assuming
that file is (mostly) sorted alphabetically by catalog name. This puts
the extended stats entries just after the normal stats entries which
seems preferable.

This is only a very rough first draft (e.g., no doc updates), but it
passes all the regression tests.

Regards,
Dean

Attachment

Re: Multivariate MCV stats can leak data to unprivileged users

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> I think we shouldn't risk trying to get this into beta1, but let's try
> to get it done as soon as possible after that.

Agreed.

> \d pg_statistic_ext
>             Table "pg_catalog.pg_statistic_ext"
>     Column    |    Type    | Collation | Nullable | Default
> --------------+------------+-----------+----------+---------
>  oid          | oid        |           | not null |
>  stxrelid     | oid        |           | not null |
>  stxname      | name       |           | not null |
>  stxnamespace | oid        |           | not null |
>  stxowner     | oid        |           | not null |
>  stxkeys      | int2vector |           | not null |
>  stxkind      | "char"[]   |           | not null |
> Indexes:
>     "pg_statistic_ext_name_index" UNIQUE, btree (stxname, stxnamespace)
>     "pg_statistic_ext_oid_index" UNIQUE, btree (oid)
>     "pg_statistic_ext_relid_index" btree (stxrelid)

Check.

> \d pg_statistic_ext_data
>               Table "pg_catalog.pg_statistic_ext_data"
>      Column      |      Type       | Collation | Nullable | Default
> -----------------+-----------------+-----------+----------+---------
>  stxoid          | oid             |           | not null |
>  stxndistinct    | pg_ndistinct    | C         |          |
>  stxdependencies | pg_dependencies | C         |          |
>  stxmcv          | pg_mcv_list     | C         |          |
> Indexes:
>     "pg_statistic_ext_data_stxoid_index" UNIQUE, btree (stxoid)

I wonder ... another way we could potentially do this is

create table pg_statistic_ext_data(
    stxoid oid,  -- OID of owning pg_statistic_ext entry
    stxkind char, -- what kind of data
    stxdata bytea -- the data, in some format or other
);

The advantage of this way is that we'd not have to rejigger the
catalog's rowtype every time we think of a new kind of extended
stats.  The disadvantage is that manual inspection of the contents
of an entry would become much harder, for lack of any convenient
output function.  However, this whole exercise is mostly to prevent
casual manual inspection anyway :-(, so I wonder how much we care
about that.

Also, I assume there's going to be a user-accessible view that shows
a join of these tables, but only those rows that correspond to columns
the current user can read all of.  Should we give that view the name
pg_statistic_ext for maximum backward compatibility?  I'm not sure.
pg_dump would probably prefer it if the view is what has a new name,
but other clients might like the other way.

            regards, tom lane



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tom Lane
Date:
I wrote:
> I wonder ... another way we could potentially do this is

> create table pg_statistic_ext_data(
>     stxoid oid,  -- OID of owning pg_statistic_ext entry
>     stxkind char, -- what kind of data
>     stxdata bytea -- the data, in some format or other
> );

> The advantage of this way is that we'd not have to rejigger the
> catalog's rowtype every time we think of a new kind of extended
> stats.  The disadvantage is that manual inspection of the contents
> of an entry would become much harder, for lack of any convenient
> output function.

No, wait, scratch that.  We could fold the three existing types
pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
"pg_stats_ext_data", where the actual storage would need to have an
ID field (so we'd waste a byte or two duplicating the externally
visible stxkind field inside stxdata).  The output function for this
type is just a switch over the existing code.  The big advantage of
this way compared to the current approach is that adding a new
ext-stats type requires *zero* work with adding new catalog entries.
Just add another switch case in pg_stats_ext_data_out() and you're
done.

            regards, tom lane



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Sun, May 19, 2019 at 10:28:43AM -0400, Tom Lane wrote:
>I wrote:
>> I wonder ... another way we could potentially do this is
>
>> create table pg_statistic_ext_data(
>>     stxoid oid,  -- OID of owning pg_statistic_ext entry
>>     stxkind char, -- what kind of data
>>     stxdata bytea -- the data, in some format or other
>> );
>
>> The advantage of this way is that we'd not have to rejigger the
>> catalog's rowtype every time we think of a new kind of extended
>> stats.  The disadvantage is that manual inspection of the contents
>> of an entry would become much harder, for lack of any convenient
>> output function.
>
>No, wait, scratch that.  We could fold the three existing types
>pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
>"pg_stats_ext_data", where the actual storage would need to have an
>ID field (so we'd waste a byte or two duplicating the externally
>visible stxkind field inside stxdata).  The output function for this
>type is just a switch over the existing code.  The big advantage of
>this way compared to the current approach is that adding a new
>ext-stats type requires *zero* work with adding new catalog entries.
>Just add another switch case in pg_stats_ext_data_out() and you're
>done.
>

The annoying thing is that this undoes the protections provided by special
data types generated only in internally. It's not possible to generate
e.g. pg_mcv_list values in user code (except for C code, at which points
all bets are off anyway). By abandoning this and reverting to bytea anyone
could craft a bytea and claim it's a statistic value.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Sun, 19 May 2019 at 15:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > I wonder ... another way we could potentially do this is
>
> > create table pg_statistic_ext_data(
> >     stxoid oid,  -- OID of owning pg_statistic_ext entry
> >     stxkind char, -- what kind of data
> >     stxdata bytea -- the data, in some format or other
> > );
>
> > The advantage of this way is that we'd not have to rejigger the
> > catalog's rowtype every time we think of a new kind of extended
> > stats.  The disadvantage is that manual inspection of the contents
> > of an entry would become much harder, for lack of any convenient
> > output function.
>
> No, wait, scratch that.  We could fold the three existing types
> pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
> "pg_stats_ext_data", where the actual storage would need to have an
> ID field (so we'd waste a byte or two duplicating the externally
> visible stxkind field inside stxdata).  The output function for this
> type is just a switch over the existing code.  The big advantage of
> this way compared to the current approach is that adding a new
> ext-stats type requires *zero* work with adding new catalog entries.
> Just add another switch case in pg_stats_ext_data_out() and you're
> done.
>

This feels a little over-engineered to me. Presumably there'd be a
compound key on (stxoid, stxkind) and we'd have to scan multiple rows
to get all the applicable stats, whereas currently they're all in one
row. And then the user-accessible view would probably need separate
sub-queries for each stats kind.

If the point is just to avoid adding columns to the catalog in future
releases, I'm not sure it's worth the added complexity. We know that
we will probably add histogram stats in a future release. I'm not sure
how many more kinds we'll end up adding, but it doesn't seem likely to
be a huge number. I think we'll add far more columns to other catalog
tables as we add new features to each release.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Sun, May 19, 2019 at 10:28:43AM -0400, Tom Lane wrote:
>> No, wait, scratch that.  We could fold the three existing types
>> pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
>> "pg_stats_ext_data", where the actual storage would need to have an
>> ID field (so we'd waste a byte or two duplicating the externally
>> visible stxkind field inside stxdata).  The output function for this
>> type is just a switch over the existing code.  The big advantage of
>> this way compared to the current approach is that adding a new
>> ext-stats type requires *zero* work with adding new catalog entries.
>> Just add another switch case in pg_stats_ext_data_out() and you're
>> done.

> The annoying thing is that this undoes the protections provided by special
> data types generated only in internally. It's not possible to generate
> e.g. pg_mcv_list values in user code (except for C code, at which points
> all bets are off anyway). By abandoning this and reverting to bytea anyone
> could craft a bytea and claim it's a statistic value.

That would have been true of the original proposal, but not of this
modified one.

            regards, tom lane



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Sun, May 19, 2019 at 02:14:54PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Sun, May 19, 2019 at 10:28:43AM -0400, Tom Lane wrote:
>>> No, wait, scratch that.  We could fold the three existing types
>>> pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
>>> "pg_stats_ext_data", where the actual storage would need to have an
>>> ID field (so we'd waste a byte or two duplicating the externally
>>> visible stxkind field inside stxdata).  The output function for this
>>> type is just a switch over the existing code.  The big advantage of
>>> this way compared to the current approach is that adding a new
>>> ext-stats type requires *zero* work with adding new catalog entries.
>>> Just add another switch case in pg_stats_ext_data_out() and you're
>>> done.
>
>> The annoying thing is that this undoes the protections provided by special
>> data types generated only in internally. It's not possible to generate
>> e.g. pg_mcv_list values in user code (except for C code, at which points
>> all bets are off anyway). By abandoning this and reverting to bytea anyone
>> could craft a bytea and claim it's a statistic value.
>
>That would have been true of the original proposal, but not of this
>modified one.
>

Oh, right. It still has the disadvantage that it obfuscates the actual
data stored in the pg_stats_ext_data (or whatever would it be called),
so e.g. functions would have to do additional checks to make sure it
actually is the right statistic type. For example pg_mcv_list_items()
could not rely on receiving pg_mcv_list values, as per the signature,
but would have to check the value.

Of course, I don't expect to have too many such functions, but overall
this approach with a single type feels a bit too like EAV to my taste.

I think Dean is right we should not expect many more statistic types
than what we already have - a histogram, and perhaps one or two more. So
I agree with Dean the current design with separate statistic types is
not such a big issue.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Oh, right. It still has the disadvantage that it obfuscates the actual
> data stored in the pg_stats_ext_data (or whatever would it be called),
> so e.g. functions would have to do additional checks to make sure it
> actually is the right statistic type. For example pg_mcv_list_items()
> could not rely on receiving pg_mcv_list values, as per the signature,
> but would have to check the value.
>

Yes. In fact, since the user-accessible view would want to expose
datatypes specific to the stats kinds rather than bytea or cstring
values, we would need SQL-callable conversion functions for each kind:

* to_pg_ndistinct(pg_extended_stats_ext_data) returns pg_ndistinct
* to_pg_dependencies(pg_extended_stats_ext_data) returns pg_dependencies
* to_pg_mcv(pg_extended_stats_ext_data) returns pg_mcv
* ...

and each of these would throw an error if it weren't given an extended
stats object of the right kind. Then to extract MCV data, you'd have
to do pg_mcv_list_items(to_pg_mcv(ext_data)), and presumably there'd
be something similar for histograms.

IMO, that's not a nice model, compared to just having columns of the
right types in the first place.

Also this model presupposes that all future stats kinds are most
conveniently represented in a single column, but maybe that won't be
the case. It's conceivable that a future stats kind would benefit from
splitting its data across multiple columns.


> Of course, I don't expect to have too many such functions, but overall
> this approach with a single type feels a bit too like EAV to my taste.
>

Yes, I think it is an EAV model. I think EAV models do have their
place, but I think that's largely where adding new columns is a common
operation and involves adding little to no extra code. I don't think
either of those is true for extended stats. What we've seen over the
last couple of years is that adding each new stats kind is a large
undertaking, involving lots of new code. That alone is going to limit
just how many ever get added, and compared to that effort, adding new
columns to the catalog is small fry.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Oh, right. It still has the disadvantage that it obfuscates the actual
>> data stored in the pg_stats_ext_data (or whatever would it be called),
>> so e.g. functions would have to do additional checks to make sure it
>> actually is the right statistic type. For example pg_mcv_list_items()
>> could not rely on receiving pg_mcv_list values, as per the signature,
>> but would have to check the value.

> Yes. In fact, since the user-accessible view would want to expose
> datatypes specific to the stats kinds rather than bytea or cstring
> values, we would need SQL-callable conversion functions for each kind:

It seems like people are willfully misunderstanding my suggestion.
You'd only need *one* conversion function, which would look at the
embedded ID field and then emit the appropriate text representation.
I don't see a reason why we'd have the separate pg_ndistinct etc. types
any more at all.

> Also this model presupposes that all future stats kinds are most
> conveniently represented in a single column, but maybe that won't be
> the case. It's conceivable that a future stats kind would benefit from
> splitting its data across multiple columns.

Hm, that's possible I suppose, but it seems a little far-fetched.
You could equally well argue that pg_ndistinct etc. should have been
broken down into smaller types, but we didn't.

> Yes, I think it is an EAV model. I think EAV models do have their
> place, but I think that's largely where adding new columns is a common
> operation and involves adding little to no extra code. I don't think
> either of those is true for extended stats. What we've seen over the
> last couple of years is that adding each new stats kind is a large
> undertaking, involving lots of new code. That alone is going to limit
> just how many ever get added, and compared to that effort, adding new
> columns to the catalog is small fry.

I can't argue with that --- the make-work is just a small part of the
total.  But it's still make-work.

Anyway, it was just a suggestion, and if people don't like it that's
fine.  But I don't want it to be rejected on the basis of false
arguments.

            regards, tom lane



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Mon, May 20, 2019 at 09:32:24AM -0400, Tom Lane wrote:
>Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>> Oh, right. It still has the disadvantage that it obfuscates the actual
>>> data stored in the pg_stats_ext_data (or whatever would it be called),
>>> so e.g. functions would have to do additional checks to make sure it
>>> actually is the right statistic type. For example pg_mcv_list_items()
>>> could not rely on receiving pg_mcv_list values, as per the signature,
>>> but would have to check the value.
>
>> Yes. In fact, since the user-accessible view would want to expose
>> datatypes specific to the stats kinds rather than bytea or cstring
>> values, we would need SQL-callable conversion functions for each kind:
>
>It seems like people are willfully misunderstanding my suggestion.
>You'd only need *one* conversion function, which would look at the
>embedded ID field and then emit the appropriate text representation.
>I don't see a reason why we'd have the separate pg_ndistinct etc. types
>any more at all.
>

That would however require having input functions, which we currently
don't have. Otherwise people could not process the statistic values using
functions like pg_mcv_list_items(). Which I think is useful.

Of course, we could add input functions, but there was a reasoning for not
having them (similarly to pg_node_tree). 

>> Also this model presupposes that all future stats kinds are most
>> conveniently represented in a single column, but maybe that won't be
>> the case. It's conceivable that a future stats kind would benefit from
>> splitting its data across multiple columns.
>
>Hm, that's possible I suppose, but it seems a little far-fetched.
>You could equally well argue that pg_ndistinct etc. should have been
>broken down into smaller types, but we didn't.
>

True. I can't rule out adding such "split" statistic type, but don't think
it's very likely. The extended statistic values tend to be complex and
easier to represent in a single value.

>> Yes, I think it is an EAV model. I think EAV models do have their
>> place, but I think that's largely where adding new columns is a common
>> operation and involves adding little to no extra code. I don't think
>> either of those is true for extended stats. What we've seen over the
>> last couple of years is that adding each new stats kind is a large
>> undertaking, involving lots of new code. That alone is going to limit
>> just how many ever get added, and compared to that effort, adding new
>> columns to the catalog is small fry.
>
>I can't argue with that --- the make-work is just a small part of the
>total.  But it's still make-work.
>
>Anyway, it was just a suggestion, and if people don't like it that's
>fine.  But I don't want it to be rejected on the basis of false
>arguments.
>

Sure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Mon, 20 May 2019 at 14:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> > On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> Oh, right. It still has the disadvantage that it obfuscates the actual
> >> data stored in the pg_stats_ext_data (or whatever would it be called),
> >> so e.g. functions would have to do additional checks to make sure it
> >> actually is the right statistic type. For example pg_mcv_list_items()
> >> could not rely on receiving pg_mcv_list values, as per the signature,
> >> but would have to check the value.
>
> > Yes. In fact, since the user-accessible view would want to expose
> > datatypes specific to the stats kinds rather than bytea or cstring
> > values, we would need SQL-callable conversion functions for each kind:
>
> It seems like people are willfully misunderstanding my suggestion.

I'm more than capable of inadvertently misunderstanding, without the
need to willfully do so :-)

> You'd only need *one* conversion function, which would look at the
> embedded ID field and then emit the appropriate text representation.
> I don't see a reason why we'd have the separate pg_ndistinct etc. types
> any more at all.

Hmm, OK. So then would you also make the user-accessible view agnostic
about the kinds of stats supported in the same way, returning zero or
more rows per STATISTICS object, depending on how many kinds of stats
have been built? That would have the advantage of never needing to
change the view definition again, as more stats kinds are supported.

We'd need to change pg_mcv_list_items() to accept a pg_stats_ext_data
value rather than a pg_mcv value, and it would be the user's
responsibility to call it if they wanted to see the contents of the
MCV list (I was originally thinking that we'd include a call to
pg_mcv_list_items() in the view definition, so that it produced
friendlier looking output, since the default textual representation of
an MCV list is completely opaque, unlike the other stats kinds).
Actually, I can see another advantage to not including
pg_mcv_list_items() in the view definition -- in the future, we may
dream up a better version of pg_mcv_list_items(), like say one that
produced JSON, and then we'd regret using the current function.

> Anyway, it was just a suggestion, and if people don't like it that's
> fine.  But I don't want it to be rejected on the basis of false
> arguments.

To be clear, I'm not intentionally rejecting your idea. I'm merely
trying to fully understand the implications.

At this stage, perhaps it would be helpful to prototype something for
comparison.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Mon, May 20, 2019 at 04:09:24PM +0100, Dean Rasheed wrote:
>On Mon, 20 May 2019 at 14:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> > On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >> Oh, right. It still has the disadvantage that it obfuscates the actual
>> >> data stored in the pg_stats_ext_data (or whatever would it be called),
>> >> so e.g. functions would have to do additional checks to make sure it
>> >> actually is the right statistic type. For example pg_mcv_list_items()
>> >> could not rely on receiving pg_mcv_list values, as per the signature,
>> >> but would have to check the value.
>>
>> > Yes. In fact, since the user-accessible view would want to expose
>> > datatypes specific to the stats kinds rather than bytea or cstring
>> > values, we would need SQL-callable conversion functions for each kind:
>>
>> It seems like people are willfully misunderstanding my suggestion.
>
>I'm more than capable of inadvertently misunderstanding, without the
>need to willfully do so :-)
>
>> You'd only need *one* conversion function, which would look at the
>> embedded ID field and then emit the appropriate text representation.
>> I don't see a reason why we'd have the separate pg_ndistinct etc. types
>> any more at all.
>
>Hmm, OK. So then would you also make the user-accessible view agnostic
>about the kinds of stats supported in the same way, returning zero or
>more rows per STATISTICS object, depending on how many kinds of stats
>have been built? That would have the advantage of never needing to
>change the view definition again, as more stats kinds are supported.
>

If I got Tom's proposal right, there'd be only one statistic value in
each pg_stats_ext_data value. It'd be a very thin wrapper, essentially
just the value itself + type flag. So for example if you did

  CREATE STATISTICS s (ndistinct, mcv, dependencies) ...

you'd get three rows in pg_statistic_ext_data (assuming all the stats
get actually built).

>We'd need to change pg_mcv_list_items() to accept a pg_stats_ext_data
>value rather than a pg_mcv value, and it would be the user's
>responsibility to call it if they wanted to see the contents of the
>MCV list (I was originally thinking that we'd include a call to
>pg_mcv_list_items() in the view definition, so that it produced
>friendlier looking output, since the default textual representation of
>an MCV list is completely opaque, unlike the other stats kinds).
>Actually, I can see another advantage to not including
>pg_mcv_list_items() in the view definition -- in the future, we may
>dream up a better version of pg_mcv_list_items(), like say one that
>produced JSON, and then we'd regret using the current function.
>

Yeah. As I said, it obfuscates the "actual" type of the stats value, so
we can no longer rely on the function machinery to verify the type. All
functions dealing with the "wrapper" type would have to verify it
actually contains the right statistic type.

>> Anyway, it was just a suggestion, and if people don't like it that's
>> fine.  But I don't want it to be rejected on the basis of false
>> arguments.
>
>To be clear, I'm not intentionally rejecting your idea. I'm merely
>trying to fully understand the implications.
>
>At this stage, perhaps it would be helpful to prototype something for
>comparison.
>

I'll look into that. I'll try to whip something up before pgcon, but I
can't guarantee that :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
Hi,

Attached are three patches tweaking the stats - two were already posted
in this thread, the third one is just updating docs.

1) 0001 - split pg_statistic_ext into definition + data

This is pretty much the patch Dean posted some time ago, rebased to
current master (fixing just minor pgindent bitrot).

2) 0002 - update sgml docs to reflect changes from 0001

3) 0003 - define pg_stats_ext view, similar to pg_stats


The question is whether we want to also redesign pg_statistic_ext_data
per Tom's proposal (more about that later), but I think we can treat
that as an additional step on top of 0001. So I propose we get those
changes committed, and then perhaps also switch the data table to the
EAV model.

Barring objections, I'll do that early next week, after cleaning up
those patches a bit more.

One thing I think we should fix is naming of the attributes in the 0001
patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is
in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should
probably switch to "stxd" in the _data catalog. Opinions?

Now, back to the proposal to split the _data catalog rows to EAV form,
with a new data type replacing the multiple types we have at the moment.
I've started hacking on it today, but the more I work on it the less
useful it seems to me.

My understanding is that with that approach we'd replace the _data
catalog (which currently has one column per statistic type, with a
separate data type) with 1:M generic rows, with a generic data type.
That is, we'd replace this

    CREATE TABLE pg_statistic_ext_data (
        stxoid OID,
        stxdependencies pg_dependencies,
        stxndistinct pg_ndistinct,
        stxmcv pg_mcv_list,
        ... histograms ...
    );

with something like this:

    CREATE TABLE pg_statistiex_ext_data (
        stxoid OID,
        stxkind CHAR,
        stxdata pg_statistic_ext_type
    );

where pg_statistic_ext would store all existing statistic types. along
with a "flag" saying which value it actually stored (essentially a copy
of the stxkind column, which we however need to lookup a statistic of a
certain type, without having to detoast the statistic itself).

As I mentioned before, I kinda dislike the fact that this obfuscates the
actual statistic type by hiding it behing the "wrapper" type.

The other thing is that we have to deal with 1:M relationship every time
we (re)build the statistics, or when we need to access them. Now, it may
not be a huge amount of code, but it just seems unnecessary. It would
make sense if we planned to add large number of additional statistic
types, but that seems unlikely - I personally can think of maybe one new
statistic type, but that's about it.

I'll continue working on it and I'll share the results early next week,
after playing with it a bit, but I think we should get the existing
patches committed and then continue discussing this as an additional
improvement.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Thu, 6 Jun 2019 at 21:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> Attached are three patches tweaking the stats - two were already posted
> in this thread, the third one is just updating docs.
>
> 1) 0001 - split pg_statistic_ext into definition + data
>
> This is pretty much the patch Dean posted some time ago, rebased to
> current master (fixing just minor pgindent bitrot).
>
> 2) 0002 - update sgml docs to reflect changes from 0001
>
> 3) 0003 - define pg_stats_ext view, similar to pg_stats
>

Seems reasonable on a quick read-through, except I spotted a bug in
the view (my fault) -- the statistics_owner column should come from
s.stxowner rather than c.relowner.


> The question is whether we want to also redesign pg_statistic_ext_data
> per Tom's proposal (more about that later), but I think we can treat
> that as an additional step on top of 0001. So I propose we get those
> changes committed, and then perhaps also switch the data table to the
> EAV model.
>
> Barring objections, I'll do that early next week, after cleaning up
> those patches a bit more.
>
> One thing I think we should fix is naming of the attributes in the 0001
> patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is
> in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should
> probably switch to "stxd" in the _data catalog. Opinions?
>

Yes, that makes sense. Especially when joining the 2 tables, since it
makes it more obvious which table a given column is coming from in a
join clause.


> Now, back to the proposal to split the _data catalog rows to EAV form,
> with a new data type replacing the multiple types we have at the moment.
> I've started hacking on it today, but the more I work on it the less
> useful it seems to me.
>
> My understanding is that with that approach we'd replace the _data
> catalog (which currently has one column per statistic type, with a
> separate data type) with 1:M generic rows, with a generic data type.
> That is, we'd replace this
>
>     CREATE TABLE pg_statistic_ext_data (
>         stxoid OID,
>         stxdependencies pg_dependencies,
>         stxndistinct pg_ndistinct,
>         stxmcv pg_mcv_list,
>         ... histograms ...
>     );
>
> with something like this:
>
>     CREATE TABLE pg_statistiex_ext_data (
>         stxoid OID,
>         stxkind CHAR,
>         stxdata pg_statistic_ext_type
>     );
>
> where pg_statistic_ext would store all existing statistic types. along
> with a "flag" saying which value it actually stored (essentially a copy
> of the stxkind column, which we however need to lookup a statistic of a
> certain type, without having to detoast the statistic itself).
>
> As I mentioned before, I kinda dislike the fact that this obfuscates the
> actual statistic type by hiding it behing the "wrapper" type.
>
> The other thing is that we have to deal with 1:M relationship every time
> we (re)build the statistics, or when we need to access them. Now, it may
> not be a huge amount of code, but it just seems unnecessary. It would
> make sense if we planned to add large number of additional statistic
> types, but that seems unlikely - I personally can think of maybe one new
> statistic type, but that's about it.
>
> I'll continue working on it and I'll share the results early next week,
> after playing with it a bit, but I think we should get the existing
> patches committed and then continue discussing this as an additional
> improvement.
>

I wonder ... would it be completely crazy to just use a JSON column to
store the extended stats data?

It wouldn't be as compact as your representation, but it would allow
for future stats kinds without changing the catalog definitions, and
it wouldn't obfuscate the stats types. You could keep the 1:1
relationship, and have top-level JSON keys for each stats kind built,
and you wouldn't need the pg_mcv_list_items() function because you
could just put the MCV data in JSON arrays, which would be much more
transparent, and would make the user-accessible view much simpler. One
could also imagine writing regression tests that checked for specific
expected MCV values like "stxdata->'mcv'->'frequency'->0".

Just a thought.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
John Naylor
Date:
On Fri, Jun 7, 2019 at 4:33 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 2) 0002 - update sgml docs to reflect changes from 0001

There is some copypasta here in the new section referring to the old catalog:

+ <sect1 id="catalog-pg-statistic-ext-data">
+  <title><structname>pg_statistic_ext_data</structname></title>
+
+  <indexterm zone="catalog-pg-statistic-ext">
+   <primary>pg_statistic_ext</primary>
+  </indexterm>
+
+  <para>
+   The catalog <structname>pg_statistic_ext</structname>
+   holds extended planner statistics.
+   Each row in this catalog corresponds to a <firstterm>statistics
object</firstterm>
+   created with <xref linkend="sql-createstatistics"/>.
+  </para>

And a minor stylistic nit -- it might be good to capitalize "JOIN" and
"ON" in the queries in the docs and tests.

> One thing I think we should fix is naming of the attributes in the 0001
> patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is
> in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should
> probably switch to "stxd" in the _data catalog. Opinions?

That's probably a good idea.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Mon, Jun 10, 2019 at 02:32:04PM +0100, Dean Rasheed wrote:
>On Thu, 6 Jun 2019 at 21:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Attached are three patches tweaking the stats - two were already posted
>> in this thread, the third one is just updating docs.
>>
>> 1) 0001 - split pg_statistic_ext into definition + data
>>
>> This is pretty much the patch Dean posted some time ago, rebased to
>> current master (fixing just minor pgindent bitrot).
>>
>> 2) 0002 - update sgml docs to reflect changes from 0001
>>
>> 3) 0003 - define pg_stats_ext view, similar to pg_stats
>>
>
>Seems reasonable on a quick read-through, except I spotted a bug in
>the view (my fault) -- the statistics_owner column should come from
>s.stxowner rather than c.relowner.
>
>
>> The question is whether we want to also redesign pg_statistic_ext_data
>> per Tom's proposal (more about that later), but I think we can treat
>> that as an additional step on top of 0001. So I propose we get those
>> changes committed, and then perhaps also switch the data table to the
>> EAV model.
>>
>> Barring objections, I'll do that early next week, after cleaning up
>> those patches a bit more.
>>
>> One thing I think we should fix is naming of the attributes in the 0001
>> patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is
>> in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should
>> probably switch to "stxd" in the _data catalog. Opinions?
>>
>
>Yes, that makes sense. Especially when joining the 2 tables, since it
>makes it more obvious which table a given column is coming from in a
>join clause.
>

OK, attached are patches fixing the issues reported by you and John
Naylor, and squashing the parts into just two patches (catalog split and
pg_stats_ext). Barring objections, I'll push those tomorrow.

I've renamed columns in the _data catalog from 'stx' to 'stxd', which I
think is appropriate given the "data" in catalog name.

I'm wondering if we should change the examples in SGML docs (say, in
planstats.sgml) to use the new pg_stats_ext view, instead of querying the
catalogs directly. I've tried doing that, but I found the results less
readable than what we currently have (especially for the MCV list, where
it'd require matching elements in multiple arrays). So I've left this
unchanged for now.

>
>> Now, back to the proposal to split the _data catalog rows to EAV form,
>> with a new data type replacing the multiple types we have at the moment.
>> I've started hacking on it today, but the more I work on it the less
>> useful it seems to me.
>>
>> My understanding is that with that approach we'd replace the _data
>> catalog (which currently has one column per statistic type, with a
>> separate data type) with 1:M generic rows, with a generic data type.
>> That is, we'd replace this
>>
>>     CREATE TABLE pg_statistic_ext_data (
>>         stxoid OID,
>>         stxdependencies pg_dependencies,
>>         stxndistinct pg_ndistinct,
>>         stxmcv pg_mcv_list,
>>         ... histograms ...
>>     );
>>
>> with something like this:
>>
>>     CREATE TABLE pg_statistiex_ext_data (
>>         stxoid OID,
>>         stxkind CHAR,
>>         stxdata pg_statistic_ext_type
>>     );
>>
>> where pg_statistic_ext would store all existing statistic types. along
>> with a "flag" saying which value it actually stored (essentially a copy
>> of the stxkind column, which we however need to lookup a statistic of a
>> certain type, without having to detoast the statistic itself).
>>
>> As I mentioned before, I kinda dislike the fact that this obfuscates the
>> actual statistic type by hiding it behing the "wrapper" type.
>>
>> The other thing is that we have to deal with 1:M relationship every time
>> we (re)build the statistics, or when we need to access them. Now, it may
>> not be a huge amount of code, but it just seems unnecessary. It would
>> make sense if we planned to add large number of additional statistic
>> types, but that seems unlikely - I personally can think of maybe one new
>> statistic type, but that's about it.
>>
>> I'll continue working on it and I'll share the results early next week,
>> after playing with it a bit, but I think we should get the existing
>> patches committed and then continue discussing this as an additional
>> improvement.
>>
>
>I wonder ... would it be completely crazy to just use a JSON column to
>store the extended stats data?
>
>It wouldn't be as compact as your representation, but it would allow
>for future stats kinds without changing the catalog definitions, and
>it wouldn't obfuscate the stats types. You could keep the 1:1
>relationship, and have top-level JSON keys for each stats kind built,
>and you wouldn't need the pg_mcv_list_items() function because you
>could just put the MCV data in JSON arrays, which would be much more
>transparent, and would make the user-accessible view much simpler. One
>could also imagine writing regression tests that checked for specific
>expected MCV values like "stxdata->'mcv'->'frequency'->0".
>

You mean storing it as JSONB, I presume?

I've actually considered that at some point, but eventually concluded it's
not a good match. I mean, JSON(B) is pretty versatile and can be whacked
to store pretty much anything, but it has various limitations - e.g. it
does not support arbitrary data types, so we'd have to store a lot of
stuff as text (through input/output functions). That doesn't seem very
nice, I guess.

If we want JSONB output, that should not be difficult to generate. But I
guess your point was about generic storage format.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Thu, Jun 13, 2019 at 07:37:45PM +0200, Tomas Vondra wrote:
> ...
>
>OK, attached are patches fixing the issues reported by you and John
>Naylor, and squashing the parts into just two patches (catalog split and
>pg_stats_ext). Barring objections, I'll push those tomorrow.
>
>I've renamed columns in the _data catalog from 'stx' to 'stxd', which I
>think is appropriate given the "data" in catalog name.
>
>I'm wondering if we should change the examples in SGML docs (say, in
>planstats.sgml) to use the new pg_stats_ext view, instead of querying the
>catalogs directly. I've tried doing that, but I found the results less
>readable than what we currently have (especially for the MCV list, where
>it'd require matching elements in multiple arrays). So I've left this
>unchanged for now.
>

I've pushed those changes, after adding docs for the pg_stats_ext view.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Multivariate MCV stats can leak data to unprivileged users

From
Dean Rasheed
Date:
On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote:
> >While working on 1aebfbea83c, I noticed that the new multivariate MCV
> >stats feature suffers from the same problem, and also the original
> >problems that were fixed in e2d4ef8de8 and earlier --- namely that a
> >user can see values in the MCV lists that they shouldn't see (values
> >from tables that they don't have privileges on).
> >
> >I think there are 2 separate issues here:
> >
> >1). The table pg_statistic_ext is accessible to anyone, so any user
> >can see the MCV lists of any table. I think we should give this the
> >same treatment as pg_statistic, and hide it behind a security barrier
> >view, revoking public access from the table.
> >
> >2). The multivariate MCV stats planner code can be made to invoke
> >user-defined operators, so a user can create a leaky operator and use
> >it to reveal data values from the MCV lists even if they have no
> >permissions on the table.
> >
> >Attached is a draft patch to fix (2), which hooks into
> >statext_is_compatible_clause().
> >
>
> I think that patch is good.
>

I realised that we forgot to push this second part, so I've just done so.

Regards,
Dean



Re: Multivariate MCV stats can leak data to unprivileged users

From
Tomas Vondra
Date:
On Sun, Jun 23, 2019 at 06:56:53PM +0100, Dean Rasheed wrote:
>On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote:
>> >While working on 1aebfbea83c, I noticed that the new multivariate MCV
>> >stats feature suffers from the same problem, and also the original
>> >problems that were fixed in e2d4ef8de8 and earlier --- namely that a
>> >user can see values in the MCV lists that they shouldn't see (values
>> >from tables that they don't have privileges on).
>> >
>> >I think there are 2 separate issues here:
>> >
>> >1). The table pg_statistic_ext is accessible to anyone, so any user
>> >can see the MCV lists of any table. I think we should give this the
>> >same treatment as pg_statistic, and hide it behind a security barrier
>> >view, revoking public access from the table.
>> >
>> >2). The multivariate MCV stats planner code can be made to invoke
>> >user-defined operators, so a user can create a leaky operator and use
>> >it to reveal data values from the MCV lists even if they have no
>> >permissions on the table.
>> >
>> >Attached is a draft patch to fix (2), which hooks into
>> >statext_is_compatible_clause().
>> >
>>
>> I think that patch is good.
>>
>
>I realised that we forgot to push this second part, so I've just done so.
>

Whoops! Too many patches in this thread. Thanks for noticing.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services