Thread: Multivariate MCV stats can leak data to unprivileged users
While working on 1aebfbea83c, I noticed that the new multivariate MCV stats feature suffers from the same problem, and also the original problems that were fixed in e2d4ef8de8 and earlier --- namely that a user can see values in the MCV lists that they shouldn't see (values from tables that they don't have privileges on). I think there are 2 separate issues here: 1). The table pg_statistic_ext is accessible to anyone, so any user can see the MCV lists of any table. I think we should give this the same treatment as pg_statistic, and hide it behind a security barrier view, revoking public access from the table. 2). The multivariate MCV stats planner code can be made to invoke user-defined operators, so a user can create a leaky operator and use it to reveal data values from the MCV lists even if they have no permissions on the table. Attached is a draft patch to fix (2), which hooks into statext_is_compatible_clause(). I haven't thought much about (1). There are some questions about what exactly the view should look like. Probably it should translate table oids to names, like pg_stats does, but should it also translate column indexes to names? That could get fiddly. Should it unpack MCV items? I'll raise this as an open item for PG12. Regards, Dean
Attachment
On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote: >While working on 1aebfbea83c, I noticed that the new multivariate MCV >stats feature suffers from the same problem, and also the original >problems that were fixed in e2d4ef8de8 and earlier --- namely that a >user can see values in the MCV lists that they shouldn't see (values >from tables that they don't have privileges on). > >I think there are 2 separate issues here: > >1). The table pg_statistic_ext is accessible to anyone, so any user >can see the MCV lists of any table. I think we should give this the >same treatment as pg_statistic, and hide it behind a security barrier >view, revoking public access from the table. > >2). The multivariate MCV stats planner code can be made to invoke >user-defined operators, so a user can create a leaky operator and use >it to reveal data values from the MCV lists even if they have no >permissions on the table. > >Attached is a draft patch to fix (2), which hooks into >statext_is_compatible_clause(). > I think that patch is good. >I haven't thought much about (1). There are some questions about what >exactly the view should look like. Probably it should translate table >oids to names, like pg_stats does, but should it also translate column >indexes to names? That could get fiddly. Should it unpack MCV items? > Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats. It would: (1) translate the schema / relation / attribute names I don't see why translating column indexes to names would be fiddly. It's a matter of simple unnest + join, no? Or what issues you see? (2) include values for ndistinct / dependencies, if built Those don't include any actual values, so this should be OK. You might make the argument that even this does leak a bit of information (e.g. when you can see values in one column, and you know there's a strong functional dependence, it tells you something about the other column which you may not see). But I think we kinda already leak information about that through estimates, so maybe that's not an issue. (3) include MCV list only when user has access to all columns Essentially, if the user is missing 'select' privilege on at least one of the columns, there'll be NULL. Otherwise the MCV value. The attached patch adds pg_stats_ext doing this. I don't claim it's the best possible query backing the view, so any improvements are welcome. I've been thinking we might somehow filter the statistics values, e.g. by not showing values for attributes the user has no 'select' privilege on, but that seems like a can of worms. It'd lead to MCV items that can't be distinguished because the only difference was the removed attribute, and so on. So I propose we simply show/hide the whole MCV list. Likewise, I don't think it makes sense to expand the MCV list in this view - that works for the single-dimensional case, because then the list is expanded into two arrays (values + frequencies), which are easy to process further. But for multivariate MCV lists that's much more complex - we don't know how many attributes are there, for example. So I suggest we just show the pg_mcv_list value as is, and leave it up to the user to call the pg_mcv_list_items() function if needed. This will also work for histograms, where expanding the value in the pg_stats_ext would be even trickier. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats. > It would: > > (1) translate the schema / relation / attribute names > > I don't see why translating column indexes to names would be fiddly. > It's a matter of simple unnest + join, no? Or what issues you see? > Yeah, you're right. I thought it would be harder than that. One minor thing -- I think we should have an explicit ORDER BY where we collect the column names, to guarantee that they're listed in the right order. > (2) include values for ndistinct / dependencies, if built > > Those don't include any actual values, so this should be OK. You might > make the argument that even this does leak a bit of information (e.g. > when you can see values in one column, and you know there's a strong > functional dependence, it tells you something about the other column > which you may not see). But I think we kinda already leak information > about that through estimates, so maybe that's not an issue. > Hmm. For normal statistics, if the user has no permissions on the table, they don't get to see any of these kinds of statistics, not even things like n_distinct. I think we should do the same here -- i.e., if the user has no permissions on the table, don't let them see anything. Such a user will not be able to run EXPLAIN on queries against the table, so they won't get to see any estimates, and I don't think they should get to see any extended statistics either. > (3) include MCV list only when user has access to all columns > > Essentially, if the user is missing 'select' privilege on at least one > of the columns, there'll be NULL. Otherwise the MCV value. > OK, that seems reasonable, except as I said above, I think that should apply to all statistics, not just the MCV lists. > The attached patch adds pg_stats_ext doing this. I don't claim it's the > best possible query backing the view, so any improvements are welcome. > > > I've been thinking we might somehow filter the statistics values, e.g. by > not showing values for attributes the user has no 'select' privilege on, > but that seems like a can of worms. It'd lead to MCV items that can't be > distinguished because the only difference was the removed attribute, and > so on. So I propose we simply show/hide the whole MCV list. > Agreed. > Likewise, I don't think it makes sense to expand the MCV list in this > view - that works for the single-dimensional case, because then the > list is expanded into two arrays (values + frequencies), which are easy > to process further. But for multivariate MCV lists that's much more > complex - we don't know how many attributes are there, for example. > > So I suggest we just show the pg_mcv_list value as is, and leave it up > to the user to call the pg_mcv_list_items() function if needed. > I think expanding the MCV lists is actually quite useful because then you can see arrays of values, nulls, frequencies and base frequencies in a reasonably readable form (it certainly looks better than a binary dump), without needing to join to a function call, which is a bit ugly, and unmemorable. The results from the attached look quite reasonable at first glance. It contains a few other changes as well: 1). It exposes the schema, name and owner of the statistics object as well via the view, for completeness. 2). It changes a few column names -- traditionally these views strip off the column name prefix from the underlying table, so I've attempted to be consistent with other similar views. 3). I added array-valued columns for each of the MCV list components, which makes it more like pg_stats. 4). I moved all the permission checks to the top-level WHERE clause, so a user needs to have select permissions on all the columns mentioned by the statistics, and the table mustn't have RLS in effect, otherwise the user won't see the row for that statistics object. 5). Some columns from pg_statistic_ext have to be made visible for psql \d to work. Basically, it needs to be able to query for the existence of extended statistics, but it doesn't need to see the actual statistical data. Of course, we could change psql to use the view, but this way gives us better backwards compatibility with older clients. This is still going to break compatibility of any user code looking at stxndistinct or stxdependencies from pg_statistic_ext, but at least it doesn't break old versions of psql. Note: doc and test updates still to do. Regards, Dean
Attachment
On Thu, May 16, 2019 at 02:28:03PM +0100, Dean Rasheed wrote: >On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats. >> It would: >> >> (1) translate the schema / relation / attribute names >> >> I don't see why translating column indexes to names would be fiddly. >> It's a matter of simple unnest + join, no? Or what issues you see? >> > >Yeah, you're right. I thought it would be harder than that. One minor >thing -- I think we should have an explicit ORDER BY where we collect >the column names, to guarantee that they're listed in the right order. > Good point. >> (2) include values for ndistinct / dependencies, if built >> >> Those don't include any actual values, so this should be OK. You might >> make the argument that even this does leak a bit of information (e.g. >> when you can see values in one column, and you know there's a strong >> functional dependence, it tells you something about the other column >> which you may not see). But I think we kinda already leak information >> about that through estimates, so maybe that's not an issue. >> > >Hmm. For normal statistics, if the user has no permissions on the >table, they don't get to see any of these kinds of statistics, not >even things like n_distinct. I think we should do the same here -- >i.e., if the user has no permissions on the table, don't let them see >anything. Such a user will not be able to run EXPLAIN on queries >against the table, so they won't get to see any estimates, and I don't >think they should get to see any extended statistics either. > OK, I haven't realized we don't show that even for normal stats. >> (3) include MCV list only when user has access to all columns >> >> Essentially, if the user is missing 'select' privilege on at least one >> of the columns, there'll be NULL. Otherwise the MCV value. >> > >OK, that seems reasonable, except as I said above, I think that should >apply to all statistics, not just the MCV lists. > >> The attached patch adds pg_stats_ext doing this. I don't claim it's the >> best possible query backing the view, so any improvements are welcome. >> >> >> I've been thinking we might somehow filter the statistics values, e.g. by >> not showing values for attributes the user has no 'select' privilege on, >> but that seems like a can of worms. It'd lead to MCV items that can't be >> distinguished because the only difference was the removed attribute, and >> so on. So I propose we simply show/hide the whole MCV list. >> > >Agreed. > >> Likewise, I don't think it makes sense to expand the MCV list in this >> view - that works for the single-dimensional case, because then the >> list is expanded into two arrays (values + frequencies), which are easy >> to process further. But for multivariate MCV lists that's much more >> complex - we don't know how many attributes are there, for example. >> >> So I suggest we just show the pg_mcv_list value as is, and leave it up >> to the user to call the pg_mcv_list_items() function if needed. >> > >I think expanding the MCV lists is actually quite useful because then >you can see arrays of values, nulls, frequencies and base frequencies >in a reasonably readable form (it certainly looks better than a binary >dump), without needing to join to a function call, which is a bit >ugly, and unmemorable. > Hmmm, ok. I think my main worry here is that it may or may not work for more complex types of extended stats that are likely to come in the future. Although, maybe it can be made work even for that. >The results from the attached look quite reasonable at first glance. >It contains a few other changes as well: > >1). It exposes the schema, name and owner of the statistics object as >well via the view, for completeness. > >2). It changes a few column names -- traditionally these views strip >off the column name prefix from the underlying table, so I've >attempted to be consistent with other similar views. > >3). I added array-valued columns for each of the MCV list components, >which makes it more like pg_stats. > >4). I moved all the permission checks to the top-level WHERE clause, >so a user needs to have select permissions on all the columns >mentioned by the statistics, and the table mustn't have RLS in effect, >otherwise the user won't see the row for that statistics object. > >5). Some columns from pg_statistic_ext have to be made visible for >psql \d to work. Basically, it needs to be able to query for the >existence of extended statistics, but it doesn't need to see the >actual statistical data. Of course, we could change psql to use the >view, but this way gives us better backwards compatibility with older >clients. > >This is still going to break compatibility of any user code looking at >stxndistinct or stxdependencies from pg_statistic_ext, but at least it >doesn't break old versions of psql. > >Note: doc and test updates still to do. > Thanks. I'm travelling today/tomorrow, but I'll do my best to fill in the missing bits ASAP. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote: > 5). Some columns from pg_statistic_ext have to be made visible for > psql \d to work. Basically, it needs to be able to query for the > existence of extended statistics, but it doesn't need to see the > actual statistical data. Of course, we could change psql to use the > view, but this way gives us better backwards compatibility with older > clients. > > This is still going to break compatibility of any user code looking at > stxndistinct or stxdependencies from pg_statistic_ext, but at least it > doesn't break old versions of psql. Hm, it's not normally a goal to keep old psql working against new postgres versions. And there's plenty other issues preventing a v11 psql to work against 12. I'd not let this guide any design decisions. Greetings, Andres Freund
On Fri, 17 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote: > > On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote: > > 5). Some columns from pg_statistic_ext have to be made visible for > > psql \d to work. Basically, it needs to be able to query for the > > existence of extended statistics, but it doesn't need to see the > > actual statistical data. Of course, we could change psql to use the > > view, but this way gives us better backwards compatibility with older > > clients. > > > > This is still going to break compatibility of any user code looking at > > stxndistinct or stxdependencies from pg_statistic_ext, but at least it > > doesn't break old versions of psql. > > Hm, it's not normally a goal to keep old psql working against new > postgres versions. And there's plenty other issues preventing a v11 psql > to work against 12. I'd not let this guide any design decisions. > Ah good point. In fact running "\d some_table" from v11's psql against a v12 database immediately falls over because of the removal of relhasoids from pg_class, so this isn't a valid reason for retaining access to any columns from pg_statistic_ext. Regards, Dean
On Sat, 18 May 2019 at 10:11, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > On Fri, 17 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote: > > > > On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote: > > > 5). Some columns from pg_statistic_ext have to be made visible for > > > psql \d to work. Basically, it needs to be able to query for the > > > existence of extended statistics, but it doesn't need to see the > > > actual statistical data. Of course, we could change psql to use the > > > view, but this way gives us better backwards compatibility with older > > > clients. > > > > > > This is still going to break compatibility of any user code looking at > > > stxndistinct or stxdependencies from pg_statistic_ext, but at least it > > > doesn't break old versions of psql. > > > > Hm, it's not normally a goal to keep old psql working against new > > postgres versions. And there's plenty other issues preventing a v11 psql > > to work against 12. I'd not let this guide any design decisions. > > > > Ah good point. In fact running "\d some_table" from v11's psql against > a v12 database immediately falls over because of the removal of > relhasoids from pg_class, so this isn't a valid reason for retaining > access to any columns from pg_statistic_ext. > On the other hand, pg_dump relies on pg_statistic_ext to work out which extended statistics objects to dump. If we were to change that to use pg_stats_ext, then a user dumping a table with RLS using the --enable-row-security flag wouldn't get any extended statistics objects, which would be a somewhat surprising result. That could be fixed by changing the view to return rows for every extended statistics object, nulling out values in columns that the user doesn't have permission to see, in a similar way to Tomas' original patch. It would have to be modified to do the RLS check in the same place as the privilege checks, rather than in the top-level WHERE clause, and we'd probably also have to expose OIDs in addition to names, because that's what clients like psql and pg_dump want. To me, that feels quite messy though, so I think I'd still vote for leaving the first few columns of pg_statistic_ext accessible to public, and not have to change the clients to work differently from v12 onwards. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On the other hand, pg_dump relies on pg_statistic_ext to work out > which extended statistics objects to dump. If we were to change that > to use pg_stats_ext, then a user dumping a table with RLS using the > --enable-row-security flag wouldn't get any extended statistics > objects, which would be a somewhat surprising result. It seems like what we need here is to have a separation between the *definition* of a stats object (which is what pg_dump needs access to) and the current actual *data* in it. I'd have expected that keeping those in separate catalogs would be the thing to do, though perhaps it's too late for that. regards, tom lane
On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dean Rasheed <dean.a.rasheed@gmail.com> writes: > > On the other hand, pg_dump relies on pg_statistic_ext to work out > > which extended statistics objects to dump. If we were to change that > > to use pg_stats_ext, then a user dumping a table with RLS using the > > --enable-row-security flag wouldn't get any extended statistics > > objects, which would be a somewhat surprising result. > > It seems like what we need here is to have a separation between the > *definition* of a stats object (which is what pg_dump needs access > to) and the current actual *data* in it. I'd have expected that > keeping those in separate catalogs would be the thing to do, though > perhaps it's too late for that. > Yeah, with the benefit of hindsight, that would have made sense, but that seems like a pretty big change to be attempting at this stage. Regards, Dean
Hi, On May 18, 2019 8:43:29 AM PDT, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Dean Rasheed <dean.a.rasheed@gmail.com> writes: >> > On the other hand, pg_dump relies on pg_statistic_ext to work out >> > which extended statistics objects to dump. If we were to change >that >> > to use pg_stats_ext, then a user dumping a table with RLS using the >> > --enable-row-security flag wouldn't get any extended statistics >> > objects, which would be a somewhat surprising result. >> >> It seems like what we need here is to have a separation between the >> *definition* of a stats object (which is what pg_dump needs access >> to) and the current actual *data* in it. I'd have expected that >> keeping those in separate catalogs would be the thing to do, though >> perhaps it's too late for that. >> > >Yeah, with the benefit of hindsight, that would have made sense, but >that seems like a pretty big change to be attempting at this stage. Otoh, having a suboptimal catalog representation that we'll likely have to change in one of the next releases also isn'tgreat. Seems likely that we'll need post beta1 catversion bumps anyway? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote: >Hi, > >On May 18, 2019 8:43:29 AM PDT, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >>On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>> Dean Rasheed <dean.a.rasheed@gmail.com> writes: >>> > On the other hand, pg_dump relies on pg_statistic_ext to work out >>> > which extended statistics objects to dump. If we were to change >>that >>> > to use pg_stats_ext, then a user dumping a table with RLS using the >>> > --enable-row-security flag wouldn't get any extended statistics >>> > objects, which would be a somewhat surprising result. >>> >>> It seems like what we need here is to have a separation between the >>> *definition* of a stats object (which is what pg_dump needs access >>> to) and the current actual *data* in it. I'd have expected that >>> keeping those in separate catalogs would be the thing to do, though >>> perhaps it's too late for that. >>> >> >>Yeah, with the benefit of hindsight, that would have made sense, but >>that seems like a pretty big change to be attempting at this stage. > >Otoh, having a suboptimal catalog representation that we'll likely have >to change in one of the next releases also isn't great. Seems likely >that we'll need post beta1 catversion bumps anyway? > But that's not an issue intruduced by PG12, it works like that even for the extended statistics introduced in PG10. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote: >>> On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> It seems like what we need here is to have a separation between the >>>> *definition* of a stats object (which is what pg_dump needs access >>>> to) and the current actual *data* in it. >> Otoh, having a suboptimal catalog representation that we'll likely have >> to change in one of the next releases also isn't great. Seems likely >> that we'll need post beta1 catversion bumps anyway? > But that's not an issue intruduced by PG12, it works like that even for > the extended statistics introduced in PG10. Yeah, but no time like the present to fix it if it's wrong ... regards, tom lane
On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote: >>>> On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>> It seems like what we need here is to have a separation between the >>>>> *definition* of a stats object (which is what pg_dump needs access >>>>> to) and the current actual *data* in it. > >>> Otoh, having a suboptimal catalog representation that we'll likely have >>> to change in one of the next releases also isn't great. Seems likely >>> that we'll need post beta1 catversion bumps anyway? > >> But that's not an issue intruduced by PG12, it works like that even for >> the extended statistics introduced in PG10. > >Yeah, but no time like the present to fix it if it's wrong ... > Sorry, not sure I understand. Are you saying we should try to rework this before the beta1 release, or that we don't have time to do that? I think we have four options - rework it before beta1, rework it after beta1, rework it in PG13 and leave it as it is now. If the pg_dump thing si the only issue, maybe there's a simple solution that reworking all the catalogs. Not sure. Are there any other reasons why the current catalog representation would be suboptimal, or do we have some precedent of a catalog split this way? I can't think of any. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>> But that's not an issue intruduced by PG12, it works like that even for >>> the extended statistics introduced in PG10. >> Yeah, but no time like the present to fix it if it's wrong ... > Sorry, not sure I understand. Are you saying we should try to rework > this before the beta1 release, or that we don't have time to do that? > I think we have four options - rework it before beta1, rework it after > beta1, rework it in PG13 and leave it as it is now. Yup, that's about what the options are. I'm just voting against "change it in v13". If we're going to change it, then the fewer major versions that have the bogus definition the better --- and since we're changing that catalog in v12 anyway, users will see fewer distinct behaviors if we do this change too. It's very possibly too late to get it done before beta1, unfortunately. But as Andres noted, post-beta1 catversion bumps are hardly unusual, so I do not think "rework after beta1" is unacceptable. regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > > On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote: > >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > >>> But that's not an issue intruduced by PG12, it works like that even for > >>> the extended statistics introduced in PG10. > > >> Yeah, but no time like the present to fix it if it's wrong ... > > > Sorry, not sure I understand. Are you saying we should try to rework > > this before the beta1 release, or that we don't have time to do that? > > > I think we have four options - rework it before beta1, rework it after > > beta1, rework it in PG13 and leave it as it is now. > > Yup, that's about what the options are. I'm just voting against > "change it in v13". If we're going to change it, then the fewer > major versions that have the bogus definition the better --- and > since we're changing that catalog in v12 anyway, users will see > fewer distinct behaviors if we do this change too. > > It's very possibly too late to get it done before beta1, > unfortunately. But as Andres noted, post-beta1 catversion > bumps are hardly unusual, so I do not think "rework after > beta1" is unacceptable. Agreed. Thanks, Stephen
Attachment
On Sun, 19 May 2019 at 00:48, Stephen Frost <sfrost@snowman.net> wrote: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > > > > > I think we have four options - rework it before beta1, rework it after > > > beta1, rework it in PG13 and leave it as it is now. > > > > Yup, that's about what the options are. I'm just voting against > > "change it in v13". If we're going to change it, then the fewer > > major versions that have the bogus definition the better --- and > > since we're changing that catalog in v12 anyway, users will see > > fewer distinct behaviors if we do this change too. > > > > It's very possibly too late to get it done before beta1, > > unfortunately. But as Andres noted, post-beta1 catversion > > bumps are hardly unusual, so I do not think "rework after > > beta1" is unacceptable. > > Agreed. > Yes, that makes sense. I think we shouldn't risk trying to get this into beta1, but let's try to get it done as soon as possible after that. Actually, it doesn't appear to be as big a change as I had feared. As a starter for ten, here's a patch doing the basic split, moving the extended stats data into a new catalog pg_statistic_ext_data (I'm not particularly wedded to that name, it's just the first name that came to mind). With this patch the catalogs look like this: \d pg_statistic_ext Table "pg_catalog.pg_statistic_ext" Column | Type | Collation | Nullable | Default --------------+------------+-----------+----------+--------- oid | oid | | not null | stxrelid | oid | | not null | stxname | name | | not null | stxnamespace | oid | | not null | stxowner | oid | | not null | stxkeys | int2vector | | not null | stxkind | "char"[] | | not null | Indexes: "pg_statistic_ext_name_index" UNIQUE, btree (stxname, stxnamespace) "pg_statistic_ext_oid_index" UNIQUE, btree (oid) "pg_statistic_ext_relid_index" btree (stxrelid) \d pg_statistic_ext_data Table "pg_catalog.pg_statistic_ext_data" Column | Type | Collation | Nullable | Default -----------------+-----------------+-----------+----------+--------- stxoid | oid | | not null | stxndistinct | pg_ndistinct | C | | stxdependencies | pg_dependencies | C | | stxmcv | pg_mcv_list | C | | Indexes: "pg_statistic_ext_data_stxoid_index" UNIQUE, btree (stxoid) I opted to create/remove pg_statistic_ext_data tuples at the same time as the pg_statistic_ext tuples, in CreateStatistics() / RemoveStatisticsById(), so then it's easier to see that they're in a one-to-one relationship, and other code doesn't need to worry about the data tuple not existing. The other option would be to defer inserting the data tuple to ANALYZE. I couldn't resist moving the code block that declares pg_statistic_ext's indexes in indexing.h to the right place, assuming that file is (mostly) sorted alphabetically by catalog name. This puts the extended stats entries just after the normal stats entries which seems preferable. This is only a very rough first draft (e.g., no doc updates), but it passes all the regression tests. Regards, Dean
Attachment
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > I think we shouldn't risk trying to get this into beta1, but let's try > to get it done as soon as possible after that. Agreed. > \d pg_statistic_ext > Table "pg_catalog.pg_statistic_ext" > Column | Type | Collation | Nullable | Default > --------------+------------+-----------+----------+--------- > oid | oid | | not null | > stxrelid | oid | | not null | > stxname | name | | not null | > stxnamespace | oid | | not null | > stxowner | oid | | not null | > stxkeys | int2vector | | not null | > stxkind | "char"[] | | not null | > Indexes: > "pg_statistic_ext_name_index" UNIQUE, btree (stxname, stxnamespace) > "pg_statistic_ext_oid_index" UNIQUE, btree (oid) > "pg_statistic_ext_relid_index" btree (stxrelid) Check. > \d pg_statistic_ext_data > Table "pg_catalog.pg_statistic_ext_data" > Column | Type | Collation | Nullable | Default > -----------------+-----------------+-----------+----------+--------- > stxoid | oid | | not null | > stxndistinct | pg_ndistinct | C | | > stxdependencies | pg_dependencies | C | | > stxmcv | pg_mcv_list | C | | > Indexes: > "pg_statistic_ext_data_stxoid_index" UNIQUE, btree (stxoid) I wonder ... another way we could potentially do this is create table pg_statistic_ext_data( stxoid oid, -- OID of owning pg_statistic_ext entry stxkind char, -- what kind of data stxdata bytea -- the data, in some format or other ); The advantage of this way is that we'd not have to rejigger the catalog's rowtype every time we think of a new kind of extended stats. The disadvantage is that manual inspection of the contents of an entry would become much harder, for lack of any convenient output function. However, this whole exercise is mostly to prevent casual manual inspection anyway :-(, so I wonder how much we care about that. Also, I assume there's going to be a user-accessible view that shows a join of these tables, but only those rows that correspond to columns the current user can read all of. Should we give that view the name pg_statistic_ext for maximum backward compatibility? I'm not sure. pg_dump would probably prefer it if the view is what has a new name, but other clients might like the other way. regards, tom lane
I wrote: > I wonder ... another way we could potentially do this is > create table pg_statistic_ext_data( > stxoid oid, -- OID of owning pg_statistic_ext entry > stxkind char, -- what kind of data > stxdata bytea -- the data, in some format or other > ); > The advantage of this way is that we'd not have to rejigger the > catalog's rowtype every time we think of a new kind of extended > stats. The disadvantage is that manual inspection of the contents > of an entry would become much harder, for lack of any convenient > output function. No, wait, scratch that. We could fold the three existing types pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say "pg_stats_ext_data", where the actual storage would need to have an ID field (so we'd waste a byte or two duplicating the externally visible stxkind field inside stxdata). The output function for this type is just a switch over the existing code. The big advantage of this way compared to the current approach is that adding a new ext-stats type requires *zero* work with adding new catalog entries. Just add another switch case in pg_stats_ext_data_out() and you're done. regards, tom lane
On Sun, May 19, 2019 at 10:28:43AM -0400, Tom Lane wrote: >I wrote: >> I wonder ... another way we could potentially do this is > >> create table pg_statistic_ext_data( >> stxoid oid, -- OID of owning pg_statistic_ext entry >> stxkind char, -- what kind of data >> stxdata bytea -- the data, in some format or other >> ); > >> The advantage of this way is that we'd not have to rejigger the >> catalog's rowtype every time we think of a new kind of extended >> stats. The disadvantage is that manual inspection of the contents >> of an entry would become much harder, for lack of any convenient >> output function. > >No, wait, scratch that. We could fold the three existing types >pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say >"pg_stats_ext_data", where the actual storage would need to have an >ID field (so we'd waste a byte or two duplicating the externally >visible stxkind field inside stxdata). The output function for this >type is just a switch over the existing code. The big advantage of >this way compared to the current approach is that adding a new >ext-stats type requires *zero* work with adding new catalog entries. >Just add another switch case in pg_stats_ext_data_out() and you're >done. > The annoying thing is that this undoes the protections provided by special data types generated only in internally. It's not possible to generate e.g. pg_mcv_list values in user code (except for C code, at which points all bets are off anyway). By abandoning this and reverting to bytea anyone could craft a bytea and claim it's a statistic value. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 19 May 2019 at 15:28, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I wonder ... another way we could potentially do this is > > > create table pg_statistic_ext_data( > > stxoid oid, -- OID of owning pg_statistic_ext entry > > stxkind char, -- what kind of data > > stxdata bytea -- the data, in some format or other > > ); > > > The advantage of this way is that we'd not have to rejigger the > > catalog's rowtype every time we think of a new kind of extended > > stats. The disadvantage is that manual inspection of the contents > > of an entry would become much harder, for lack of any convenient > > output function. > > No, wait, scratch that. We could fold the three existing types > pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say > "pg_stats_ext_data", where the actual storage would need to have an > ID field (so we'd waste a byte or two duplicating the externally > visible stxkind field inside stxdata). The output function for this > type is just a switch over the existing code. The big advantage of > this way compared to the current approach is that adding a new > ext-stats type requires *zero* work with adding new catalog entries. > Just add another switch case in pg_stats_ext_data_out() and you're > done. > This feels a little over-engineered to me. Presumably there'd be a compound key on (stxoid, stxkind) and we'd have to scan multiple rows to get all the applicable stats, whereas currently they're all in one row. And then the user-accessible view would probably need separate sub-queries for each stats kind. If the point is just to avoid adding columns to the catalog in future releases, I'm not sure it's worth the added complexity. We know that we will probably add histogram stats in a future release. I'm not sure how many more kinds we'll end up adding, but it doesn't seem likely to be a huge number. I think we'll add far more columns to other catalog tables as we add new features to each release. Regards, Dean
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sun, May 19, 2019 at 10:28:43AM -0400, Tom Lane wrote: >> No, wait, scratch that. We could fold the three existing types >> pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say >> "pg_stats_ext_data", where the actual storage would need to have an >> ID field (so we'd waste a byte or two duplicating the externally >> visible stxkind field inside stxdata). The output function for this >> type is just a switch over the existing code. The big advantage of >> this way compared to the current approach is that adding a new >> ext-stats type requires *zero* work with adding new catalog entries. >> Just add another switch case in pg_stats_ext_data_out() and you're >> done. > The annoying thing is that this undoes the protections provided by special > data types generated only in internally. It's not possible to generate > e.g. pg_mcv_list values in user code (except for C code, at which points > all bets are off anyway). By abandoning this and reverting to bytea anyone > could craft a bytea and claim it's a statistic value. That would have been true of the original proposal, but not of this modified one. regards, tom lane
On Sun, May 19, 2019 at 02:14:54PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Sun, May 19, 2019 at 10:28:43AM -0400, Tom Lane wrote: >>> No, wait, scratch that. We could fold the three existing types >>> pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say >>> "pg_stats_ext_data", where the actual storage would need to have an >>> ID field (so we'd waste a byte or two duplicating the externally >>> visible stxkind field inside stxdata). The output function for this >>> type is just a switch over the existing code. The big advantage of >>> this way compared to the current approach is that adding a new >>> ext-stats type requires *zero* work with adding new catalog entries. >>> Just add another switch case in pg_stats_ext_data_out() and you're >>> done. > >> The annoying thing is that this undoes the protections provided by special >> data types generated only in internally. It's not possible to generate >> e.g. pg_mcv_list values in user code (except for C code, at which points >> all bets are off anyway). By abandoning this and reverting to bytea anyone >> could craft a bytea and claim it's a statistic value. > >That would have been true of the original proposal, but not of this >modified one. > Oh, right. It still has the disadvantage that it obfuscates the actual data stored in the pg_stats_ext_data (or whatever would it be called), so e.g. functions would have to do additional checks to make sure it actually is the right statistic type. For example pg_mcv_list_items() could not rely on receiving pg_mcv_list values, as per the signature, but would have to check the value. Of course, I don't expect to have too many such functions, but overall this approach with a single type feels a bit too like EAV to my taste. I think Dean is right we should not expect many more statistic types than what we already have - a histogram, and perhaps one or two more. So I agree with Dean the current design with separate statistic types is not such a big issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Oh, right. It still has the disadvantage that it obfuscates the actual > data stored in the pg_stats_ext_data (or whatever would it be called), > so e.g. functions would have to do additional checks to make sure it > actually is the right statistic type. For example pg_mcv_list_items() > could not rely on receiving pg_mcv_list values, as per the signature, > but would have to check the value. > Yes. In fact, since the user-accessible view would want to expose datatypes specific to the stats kinds rather than bytea or cstring values, we would need SQL-callable conversion functions for each kind: * to_pg_ndistinct(pg_extended_stats_ext_data) returns pg_ndistinct * to_pg_dependencies(pg_extended_stats_ext_data) returns pg_dependencies * to_pg_mcv(pg_extended_stats_ext_data) returns pg_mcv * ... and each of these would throw an error if it weren't given an extended stats object of the right kind. Then to extract MCV data, you'd have to do pg_mcv_list_items(to_pg_mcv(ext_data)), and presumably there'd be something similar for histograms. IMO, that's not a nice model, compared to just having columns of the right types in the first place. Also this model presupposes that all future stats kinds are most conveniently represented in a single column, but maybe that won't be the case. It's conceivable that a future stats kind would benefit from splitting its data across multiple columns. > Of course, I don't expect to have too many such functions, but overall > this approach with a single type feels a bit too like EAV to my taste. > Yes, I think it is an EAV model. I think EAV models do have their place, but I think that's largely where adding new columns is a common operation and involves adding little to no extra code. I don't think either of those is true for extended stats. What we've seen over the last couple of years is that adding each new stats kind is a large undertaking, involving lots of new code. That alone is going to limit just how many ever get added, and compared to that effort, adding new columns to the catalog is small fry. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Oh, right. It still has the disadvantage that it obfuscates the actual >> data stored in the pg_stats_ext_data (or whatever would it be called), >> so e.g. functions would have to do additional checks to make sure it >> actually is the right statistic type. For example pg_mcv_list_items() >> could not rely on receiving pg_mcv_list values, as per the signature, >> but would have to check the value. > Yes. In fact, since the user-accessible view would want to expose > datatypes specific to the stats kinds rather than bytea or cstring > values, we would need SQL-callable conversion functions for each kind: It seems like people are willfully misunderstanding my suggestion. You'd only need *one* conversion function, which would look at the embedded ID field and then emit the appropriate text representation. I don't see a reason why we'd have the separate pg_ndistinct etc. types any more at all. > Also this model presupposes that all future stats kinds are most > conveniently represented in a single column, but maybe that won't be > the case. It's conceivable that a future stats kind would benefit from > splitting its data across multiple columns. Hm, that's possible I suppose, but it seems a little far-fetched. You could equally well argue that pg_ndistinct etc. should have been broken down into smaller types, but we didn't. > Yes, I think it is an EAV model. I think EAV models do have their > place, but I think that's largely where adding new columns is a common > operation and involves adding little to no extra code. I don't think > either of those is true for extended stats. What we've seen over the > last couple of years is that adding each new stats kind is a large > undertaking, involving lots of new code. That alone is going to limit > just how many ever get added, and compared to that effort, adding new > columns to the catalog is small fry. I can't argue with that --- the make-work is just a small part of the total. But it's still make-work. Anyway, it was just a suggestion, and if people don't like it that's fine. But I don't want it to be rejected on the basis of false arguments. regards, tom lane
On Mon, May 20, 2019 at 09:32:24AM -0400, Tom Lane wrote: >Dean Rasheed <dean.a.rasheed@gmail.com> writes: >> On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>> Oh, right. It still has the disadvantage that it obfuscates the actual >>> data stored in the pg_stats_ext_data (or whatever would it be called), >>> so e.g. functions would have to do additional checks to make sure it >>> actually is the right statistic type. For example pg_mcv_list_items() >>> could not rely on receiving pg_mcv_list values, as per the signature, >>> but would have to check the value. > >> Yes. In fact, since the user-accessible view would want to expose >> datatypes specific to the stats kinds rather than bytea or cstring >> values, we would need SQL-callable conversion functions for each kind: > >It seems like people are willfully misunderstanding my suggestion. >You'd only need *one* conversion function, which would look at the >embedded ID field and then emit the appropriate text representation. >I don't see a reason why we'd have the separate pg_ndistinct etc. types >any more at all. > That would however require having input functions, which we currently don't have. Otherwise people could not process the statistic values using functions like pg_mcv_list_items(). Which I think is useful. Of course, we could add input functions, but there was a reasoning for not having them (similarly to pg_node_tree). >> Also this model presupposes that all future stats kinds are most >> conveniently represented in a single column, but maybe that won't be >> the case. It's conceivable that a future stats kind would benefit from >> splitting its data across multiple columns. > >Hm, that's possible I suppose, but it seems a little far-fetched. >You could equally well argue that pg_ndistinct etc. should have been >broken down into smaller types, but we didn't. > True. I can't rule out adding such "split" statistic type, but don't think it's very likely. The extended statistic values tend to be complex and easier to represent in a single value. >> Yes, I think it is an EAV model. I think EAV models do have their >> place, but I think that's largely where adding new columns is a common >> operation and involves adding little to no extra code. I don't think >> either of those is true for extended stats. What we've seen over the >> last couple of years is that adding each new stats kind is a large >> undertaking, involving lots of new code. That alone is going to limit >> just how many ever get added, and compared to that effort, adding new >> columns to the catalog is small fry. > >I can't argue with that --- the make-work is just a small part of the >total. But it's still make-work. > >Anyway, it was just a suggestion, and if people don't like it that's >fine. But I don't want it to be rejected on the basis of false >arguments. > Sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 20 May 2019 at 14:32, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dean Rasheed <dean.a.rasheed@gmail.com> writes: > > On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> Oh, right. It still has the disadvantage that it obfuscates the actual > >> data stored in the pg_stats_ext_data (or whatever would it be called), > >> so e.g. functions would have to do additional checks to make sure it > >> actually is the right statistic type. For example pg_mcv_list_items() > >> could not rely on receiving pg_mcv_list values, as per the signature, > >> but would have to check the value. > > > Yes. In fact, since the user-accessible view would want to expose > > datatypes specific to the stats kinds rather than bytea or cstring > > values, we would need SQL-callable conversion functions for each kind: > > It seems like people are willfully misunderstanding my suggestion. I'm more than capable of inadvertently misunderstanding, without the need to willfully do so :-) > You'd only need *one* conversion function, which would look at the > embedded ID field and then emit the appropriate text representation. > I don't see a reason why we'd have the separate pg_ndistinct etc. types > any more at all. Hmm, OK. So then would you also make the user-accessible view agnostic about the kinds of stats supported in the same way, returning zero or more rows per STATISTICS object, depending on how many kinds of stats have been built? That would have the advantage of never needing to change the view definition again, as more stats kinds are supported. We'd need to change pg_mcv_list_items() to accept a pg_stats_ext_data value rather than a pg_mcv value, and it would be the user's responsibility to call it if they wanted to see the contents of the MCV list (I was originally thinking that we'd include a call to pg_mcv_list_items() in the view definition, so that it produced friendlier looking output, since the default textual representation of an MCV list is completely opaque, unlike the other stats kinds). Actually, I can see another advantage to not including pg_mcv_list_items() in the view definition -- in the future, we may dream up a better version of pg_mcv_list_items(), like say one that produced JSON, and then we'd regret using the current function. > Anyway, it was just a suggestion, and if people don't like it that's > fine. But I don't want it to be rejected on the basis of false > arguments. To be clear, I'm not intentionally rejecting your idea. I'm merely trying to fully understand the implications. At this stage, perhaps it would be helpful to prototype something for comparison. Regards, Dean
On Mon, May 20, 2019 at 04:09:24PM +0100, Dean Rasheed wrote: >On Mon, 20 May 2019 at 14:32, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Dean Rasheed <dean.a.rasheed@gmail.com> writes: >> > On Sun, 19 May 2019 at 23:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Oh, right. It still has the disadvantage that it obfuscates the actual >> >> data stored in the pg_stats_ext_data (or whatever would it be called), >> >> so e.g. functions would have to do additional checks to make sure it >> >> actually is the right statistic type. For example pg_mcv_list_items() >> >> could not rely on receiving pg_mcv_list values, as per the signature, >> >> but would have to check the value. >> >> > Yes. In fact, since the user-accessible view would want to expose >> > datatypes specific to the stats kinds rather than bytea or cstring >> > values, we would need SQL-callable conversion functions for each kind: >> >> It seems like people are willfully misunderstanding my suggestion. > >I'm more than capable of inadvertently misunderstanding, without the >need to willfully do so :-) > >> You'd only need *one* conversion function, which would look at the >> embedded ID field and then emit the appropriate text representation. >> I don't see a reason why we'd have the separate pg_ndistinct etc. types >> any more at all. > >Hmm, OK. So then would you also make the user-accessible view agnostic >about the kinds of stats supported in the same way, returning zero or >more rows per STATISTICS object, depending on how many kinds of stats >have been built? That would have the advantage of never needing to >change the view definition again, as more stats kinds are supported. > If I got Tom's proposal right, there'd be only one statistic value in each pg_stats_ext_data value. It'd be a very thin wrapper, essentially just the value itself + type flag. So for example if you did CREATE STATISTICS s (ndistinct, mcv, dependencies) ... you'd get three rows in pg_statistic_ext_data (assuming all the stats get actually built). >We'd need to change pg_mcv_list_items() to accept a pg_stats_ext_data >value rather than a pg_mcv value, and it would be the user's >responsibility to call it if they wanted to see the contents of the >MCV list (I was originally thinking that we'd include a call to >pg_mcv_list_items() in the view definition, so that it produced >friendlier looking output, since the default textual representation of >an MCV list is completely opaque, unlike the other stats kinds). >Actually, I can see another advantage to not including >pg_mcv_list_items() in the view definition -- in the future, we may >dream up a better version of pg_mcv_list_items(), like say one that >produced JSON, and then we'd regret using the current function. > Yeah. As I said, it obfuscates the "actual" type of the stats value, so we can no longer rely on the function machinery to verify the type. All functions dealing with the "wrapper" type would have to verify it actually contains the right statistic type. >> Anyway, it was just a suggestion, and if people don't like it that's >> fine. But I don't want it to be rejected on the basis of false >> arguments. > >To be clear, I'm not intentionally rejecting your idea. I'm merely >trying to fully understand the implications. > >At this stage, perhaps it would be helpful to prototype something for >comparison. > I'll look into that. I'll try to whip something up before pgcon, but I can't guarantee that :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, Attached are three patches tweaking the stats - two were already posted in this thread, the third one is just updating docs. 1) 0001 - split pg_statistic_ext into definition + data This is pretty much the patch Dean posted some time ago, rebased to current master (fixing just minor pgindent bitrot). 2) 0002 - update sgml docs to reflect changes from 0001 3) 0003 - define pg_stats_ext view, similar to pg_stats The question is whether we want to also redesign pg_statistic_ext_data per Tom's proposal (more about that later), but I think we can treat that as an additional step on top of 0001. So I propose we get those changes committed, and then perhaps also switch the data table to the EAV model. Barring objections, I'll do that early next week, after cleaning up those patches a bit more. One thing I think we should fix is naming of the attributes in the 0001 patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should probably switch to "stxd" in the _data catalog. Opinions? Now, back to the proposal to split the _data catalog rows to EAV form, with a new data type replacing the multiple types we have at the moment. I've started hacking on it today, but the more I work on it the less useful it seems to me. My understanding is that with that approach we'd replace the _data catalog (which currently has one column per statistic type, with a separate data type) with 1:M generic rows, with a generic data type. That is, we'd replace this CREATE TABLE pg_statistic_ext_data ( stxoid OID, stxdependencies pg_dependencies, stxndistinct pg_ndistinct, stxmcv pg_mcv_list, ... histograms ... ); with something like this: CREATE TABLE pg_statistiex_ext_data ( stxoid OID, stxkind CHAR, stxdata pg_statistic_ext_type ); where pg_statistic_ext would store all existing statistic types. along with a "flag" saying which value it actually stored (essentially a copy of the stxkind column, which we however need to lookup a statistic of a certain type, without having to detoast the statistic itself). As I mentioned before, I kinda dislike the fact that this obfuscates the actual statistic type by hiding it behing the "wrapper" type. The other thing is that we have to deal with 1:M relationship every time we (re)build the statistics, or when we need to access them. Now, it may not be a huge amount of code, but it just seems unnecessary. It would make sense if we planned to add large number of additional statistic types, but that seems unlikely - I personally can think of maybe one new statistic type, but that's about it. I'll continue working on it and I'll share the results early next week, after playing with it a bit, but I think we should get the existing patches committed and then continue discussing this as an additional improvement. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, 6 Jun 2019 at 21:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > Attached are three patches tweaking the stats - two were already posted > in this thread, the third one is just updating docs. > > 1) 0001 - split pg_statistic_ext into definition + data > > This is pretty much the patch Dean posted some time ago, rebased to > current master (fixing just minor pgindent bitrot). > > 2) 0002 - update sgml docs to reflect changes from 0001 > > 3) 0003 - define pg_stats_ext view, similar to pg_stats > Seems reasonable on a quick read-through, except I spotted a bug in the view (my fault) -- the statistics_owner column should come from s.stxowner rather than c.relowner. > The question is whether we want to also redesign pg_statistic_ext_data > per Tom's proposal (more about that later), but I think we can treat > that as an additional step on top of 0001. So I propose we get those > changes committed, and then perhaps also switch the data table to the > EAV model. > > Barring objections, I'll do that early next week, after cleaning up > those patches a bit more. > > One thing I think we should fix is naming of the attributes in the 0001 > patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is > in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should > probably switch to "stxd" in the _data catalog. Opinions? > Yes, that makes sense. Especially when joining the 2 tables, since it makes it more obvious which table a given column is coming from in a join clause. > Now, back to the proposal to split the _data catalog rows to EAV form, > with a new data type replacing the multiple types we have at the moment. > I've started hacking on it today, but the more I work on it the less > useful it seems to me. > > My understanding is that with that approach we'd replace the _data > catalog (which currently has one column per statistic type, with a > separate data type) with 1:M generic rows, with a generic data type. > That is, we'd replace this > > CREATE TABLE pg_statistic_ext_data ( > stxoid OID, > stxdependencies pg_dependencies, > stxndistinct pg_ndistinct, > stxmcv pg_mcv_list, > ... histograms ... > ); > > with something like this: > > CREATE TABLE pg_statistiex_ext_data ( > stxoid OID, > stxkind CHAR, > stxdata pg_statistic_ext_type > ); > > where pg_statistic_ext would store all existing statistic types. along > with a "flag" saying which value it actually stored (essentially a copy > of the stxkind column, which we however need to lookup a statistic of a > certain type, without having to detoast the statistic itself). > > As I mentioned before, I kinda dislike the fact that this obfuscates the > actual statistic type by hiding it behing the "wrapper" type. > > The other thing is that we have to deal with 1:M relationship every time > we (re)build the statistics, or when we need to access them. Now, it may > not be a huge amount of code, but it just seems unnecessary. It would > make sense if we planned to add large number of additional statistic > types, but that seems unlikely - I personally can think of maybe one new > statistic type, but that's about it. > > I'll continue working on it and I'll share the results early next week, > after playing with it a bit, but I think we should get the existing > patches committed and then continue discussing this as an additional > improvement. > I wonder ... would it be completely crazy to just use a JSON column to store the extended stats data? It wouldn't be as compact as your representation, but it would allow for future stats kinds without changing the catalog definitions, and it wouldn't obfuscate the stats types. You could keep the 1:1 relationship, and have top-level JSON keys for each stats kind built, and you wouldn't need the pg_mcv_list_items() function because you could just put the MCV data in JSON arrays, which would be much more transparent, and would make the user-accessible view much simpler. One could also imagine writing regression tests that checked for specific expected MCV values like "stxdata->'mcv'->'frequency'->0". Just a thought. Regards, Dean
On Fri, Jun 7, 2019 at 4:33 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > 2) 0002 - update sgml docs to reflect changes from 0001 There is some copypasta here in the new section referring to the old catalog: + <sect1 id="catalog-pg-statistic-ext-data"> + <title><structname>pg_statistic_ext_data</structname></title> + + <indexterm zone="catalog-pg-statistic-ext"> + <primary>pg_statistic_ext</primary> + </indexterm> + + <para> + The catalog <structname>pg_statistic_ext</structname> + holds extended planner statistics. + Each row in this catalog corresponds to a <firstterm>statistics object</firstterm> + created with <xref linkend="sql-createstatistics"/>. + </para> And a minor stylistic nit -- it might be good to capitalize "JOIN" and "ON" in the queries in the docs and tests. > One thing I think we should fix is naming of the attributes in the 0001 > patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is > in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should > probably switch to "stxd" in the _data catalog. Opinions? That's probably a good idea. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 10, 2019 at 02:32:04PM +0100, Dean Rasheed wrote: >On Thu, 6 Jun 2019 at 21:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Hi, >> >> Attached are three patches tweaking the stats - two were already posted >> in this thread, the third one is just updating docs. >> >> 1) 0001 - split pg_statistic_ext into definition + data >> >> This is pretty much the patch Dean posted some time ago, rebased to >> current master (fixing just minor pgindent bitrot). >> >> 2) 0002 - update sgml docs to reflect changes from 0001 >> >> 3) 0003 - define pg_stats_ext view, similar to pg_stats >> > >Seems reasonable on a quick read-through, except I spotted a bug in >the view (my fault) -- the statistics_owner column should come from >s.stxowner rather than c.relowner. > > >> The question is whether we want to also redesign pg_statistic_ext_data >> per Tom's proposal (more about that later), but I think we can treat >> that as an additional step on top of 0001. So I propose we get those >> changes committed, and then perhaps also switch the data table to the >> EAV model. >> >> Barring objections, I'll do that early next week, after cleaning up >> those patches a bit more. >> >> One thing I think we should fix is naming of the attributes in the 0001 >> patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is >> in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should >> probably switch to "stxd" in the _data catalog. Opinions? >> > >Yes, that makes sense. Especially when joining the 2 tables, since it >makes it more obvious which table a given column is coming from in a >join clause. > OK, attached are patches fixing the issues reported by you and John Naylor, and squashing the parts into just two patches (catalog split and pg_stats_ext). Barring objections, I'll push those tomorrow. I've renamed columns in the _data catalog from 'stx' to 'stxd', which I think is appropriate given the "data" in catalog name. I'm wondering if we should change the examples in SGML docs (say, in planstats.sgml) to use the new pg_stats_ext view, instead of querying the catalogs directly. I've tried doing that, but I found the results less readable than what we currently have (especially for the MCV list, where it'd require matching elements in multiple arrays). So I've left this unchanged for now. > >> Now, back to the proposal to split the _data catalog rows to EAV form, >> with a new data type replacing the multiple types we have at the moment. >> I've started hacking on it today, but the more I work on it the less >> useful it seems to me. >> >> My understanding is that with that approach we'd replace the _data >> catalog (which currently has one column per statistic type, with a >> separate data type) with 1:M generic rows, with a generic data type. >> That is, we'd replace this >> >> CREATE TABLE pg_statistic_ext_data ( >> stxoid OID, >> stxdependencies pg_dependencies, >> stxndistinct pg_ndistinct, >> stxmcv pg_mcv_list, >> ... histograms ... >> ); >> >> with something like this: >> >> CREATE TABLE pg_statistiex_ext_data ( >> stxoid OID, >> stxkind CHAR, >> stxdata pg_statistic_ext_type >> ); >> >> where pg_statistic_ext would store all existing statistic types. along >> with a "flag" saying which value it actually stored (essentially a copy >> of the stxkind column, which we however need to lookup a statistic of a >> certain type, without having to detoast the statistic itself). >> >> As I mentioned before, I kinda dislike the fact that this obfuscates the >> actual statistic type by hiding it behing the "wrapper" type. >> >> The other thing is that we have to deal with 1:M relationship every time >> we (re)build the statistics, or when we need to access them. Now, it may >> not be a huge amount of code, but it just seems unnecessary. It would >> make sense if we planned to add large number of additional statistic >> types, but that seems unlikely - I personally can think of maybe one new >> statistic type, but that's about it. >> >> I'll continue working on it and I'll share the results early next week, >> after playing with it a bit, but I think we should get the existing >> patches committed and then continue discussing this as an additional >> improvement. >> > >I wonder ... would it be completely crazy to just use a JSON column to >store the extended stats data? > >It wouldn't be as compact as your representation, but it would allow >for future stats kinds without changing the catalog definitions, and >it wouldn't obfuscate the stats types. You could keep the 1:1 >relationship, and have top-level JSON keys for each stats kind built, >and you wouldn't need the pg_mcv_list_items() function because you >could just put the MCV data in JSON arrays, which would be much more >transparent, and would make the user-accessible view much simpler. One >could also imagine writing regression tests that checked for specific >expected MCV values like "stxdata->'mcv'->'frequency'->0". > You mean storing it as JSONB, I presume? I've actually considered that at some point, but eventually concluded it's not a good match. I mean, JSON(B) is pretty versatile and can be whacked to store pretty much anything, but it has various limitations - e.g. it does not support arbitrary data types, so we'd have to store a lot of stuff as text (through input/output functions). That doesn't seem very nice, I guess. If we want JSONB output, that should not be difficult to generate. But I guess your point was about generic storage format. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Jun 13, 2019 at 07:37:45PM +0200, Tomas Vondra wrote: > ... > >OK, attached are patches fixing the issues reported by you and John >Naylor, and squashing the parts into just two patches (catalog split and >pg_stats_ext). Barring objections, I'll push those tomorrow. > >I've renamed columns in the _data catalog from 'stx' to 'stxd', which I >think is appropriate given the "data" in catalog name. > >I'm wondering if we should change the examples in SGML docs (say, in >planstats.sgml) to use the new pg_stats_ext view, instead of querying the >catalogs directly. I've tried doing that, but I found the results less >readable than what we currently have (especially for the MCV list, where >it'd require matching elements in multiple arrays). So I've left this >unchanged for now. > I've pushed those changes, after adding docs for the pg_stats_ext view. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote: > >While working on 1aebfbea83c, I noticed that the new multivariate MCV > >stats feature suffers from the same problem, and also the original > >problems that were fixed in e2d4ef8de8 and earlier --- namely that a > >user can see values in the MCV lists that they shouldn't see (values > >from tables that they don't have privileges on). > > > >I think there are 2 separate issues here: > > > >1). The table pg_statistic_ext is accessible to anyone, so any user > >can see the MCV lists of any table. I think we should give this the > >same treatment as pg_statistic, and hide it behind a security barrier > >view, revoking public access from the table. > > > >2). The multivariate MCV stats planner code can be made to invoke > >user-defined operators, so a user can create a leaky operator and use > >it to reveal data values from the MCV lists even if they have no > >permissions on the table. > > > >Attached is a draft patch to fix (2), which hooks into > >statext_is_compatible_clause(). > > > > I think that patch is good. > I realised that we forgot to push this second part, so I've just done so. Regards, Dean
On Sun, Jun 23, 2019 at 06:56:53PM +0100, Dean Rasheed wrote: >On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote: >> >While working on 1aebfbea83c, I noticed that the new multivariate MCV >> >stats feature suffers from the same problem, and also the original >> >problems that were fixed in e2d4ef8de8 and earlier --- namely that a >> >user can see values in the MCV lists that they shouldn't see (values >> >from tables that they don't have privileges on). >> > >> >I think there are 2 separate issues here: >> > >> >1). The table pg_statistic_ext is accessible to anyone, so any user >> >can see the MCV lists of any table. I think we should give this the >> >same treatment as pg_statistic, and hide it behind a security barrier >> >view, revoking public access from the table. >> > >> >2). The multivariate MCV stats planner code can be made to invoke >> >user-defined operators, so a user can create a leaky operator and use >> >it to reveal data values from the MCV lists even if they have no >> >permissions on the table. >> > >> >Attached is a draft patch to fix (2), which hooks into >> >statext_is_compatible_clause(). >> > >> >> I think that patch is good. >> > >I realised that we forgot to push this second part, so I've just done so. > Whoops! Too many patches in this thread. Thanks for noticing. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services