Re: API change advice: Passing plan invalidation info from the rewriter into the planner? - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: API change advice: Passing plan invalidation info from the rewriter into the planner?
Date
Msg-id 20140618181832.GE16098@tamriel.snowman.net
Whole thread Raw
In response to Re: API change advice: Passing plan invalidation info from the rewriter into the planner?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: API change advice: Passing plan invalidation info from the rewriter into the planner?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Jun 17, 2014 at 10:25 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > I agree that we want to support that, if we can do so reasonably.  What
> > I was trying to get at is simply this- don't we provide that already
> > with the leakproof attribute and functions?  If we don't have enough
> > there to allow index scans then we should be looking to add more, I'm
> > thinking.
>
> So the reason why we got onto this particular topic was because of the
> issue of multiple security policies for a single table.  Of course,
> multiple security policies can always be merged into a single
> more-complex policy, but the resulting policy may be so complex that
> the query-planner is no longer capable of doing a good job optimizing
> it.

Yeah, I could see that happening with some use-cases.

> >> ALTER TABLE tab ADD POLICY polname WHERE quals;
> >> GRANT SELECT (polname) ON TABLE tab TO role;
> >
> > Right, if we were to support multiple policies on a given table then we
> > would have to support adding and removing them individually, as well as
> > specify when they are to be applied- and what if that "when" overlaps?
> > Do we apply both and only a row which passed them all gets sent to the
> > user?  Essentially we'd be defining the RLS policies to be AND'd
> > together, right?  Would we want to support both AND-based and OR-based,
> > and allow users to pick what set of conditionals they want applied to
> > their various overlapping RLS policies?
>
> AND is not a sensible policy; it would need to be OR.  If you grant
> someone access to two different subsets of the rows in a table, it
> stands to reason that they will expect to have access to all of the
> rows that are in at least one of those subsets.

I think I can buy off on this.  What that also means is that any
'short-circuiting' that we try to do here would be based on "stop once
we get back a 'true'".  This could seriously change how we actually
implement RLS though as doing it all through query rewrites and making
this work with multiple security policies which are OR'd together and
yet keeping the optimization and qual push-down and index-based plans is
looking pretty daunting.

I'm also of the opinion that this isn't strictly necessary for the
initial RLS offering in PG- there's a clear way we could migrate
existing users to a multi-policy system from a single-policy system.
Sure, to get the performance and optimization benefits that we'd
presumably have in the multi-policy case they'd need to re-work their
RLS configuration, but for users who care, they'll likely be very happy
to do so to gain those benefits.

Perhaps the question here is- if we implement RLS one way for the single
case and then change the implementation all around for the multi case,
will we end up breaking the single case?  Or destroying the performance
for it?  I can't see either of those cases being allowed- if and when we
support multi, we must still support single and the whole point of multi
would be to allow more performant implementations and that solution will
require the single case to be at least as performant as what we're
proposing to do today, I believe.

Or are you thinking that we would never support calling user-defined
functions in any RLS scheme because we want to be able to do that
optimization?  I don't see that being acceptable from a feature
standpoint.

> Alternatively, we could:
>
> - Require the user to specify in some way which of the available
> policies they want applied, and then apply only that one.

I'd want to at least see a way to apply an ordering to the policies
being applied, or have PG work out which one is "cheapest" and try that
one first.

> - Decide that such scenarios constitute misconfiguration. Throw an
> error and make the table owner or other relevant local authority fix
> it.

Having them all be OR'd together feels simpler and easier to work with
than trying to provide the user with all the knobs necessary to select
which subset of users they want the policy applied to when (user X from
IP range a.b.c.d/24 at time 1500).  We could probably make it work with
exclusion constraints, range types, etc, and perhaps it'd be a reason to
bring btree_gist into core (which I'm all for) and make it work with
catalog tables, but... just 'yuck' all around, for my part.

> > Sounds all rather painful and much better done programatically by the
> > user in a language which is suited to that task- eg: pl/pgsql, perl, C,
> > or something besides our ALTER syntax + catalog representation.
>
> I think exactly the opposite, for the query planning reasons
> previously stated.  I think the policies will quickly get so
> complicated that they're no longer optimizable.  Here's a simple
> example:
>
> - Policy 1 allows the user to access rows for which complexfunc() returns true.
> - Policy 2 allows the user to access rows for which a = 1.
>
> Most users have access only through policy 2, but some have access
> through policy 1.  Users who have access through policy 1 will always
> get a sequential scan,

This is the thing which I most object to- if the quals being provided at
any level are leakproof and would be able to reduce the returned set
sufficiently that an index scan is the best bet, we should be doing
that.  I don't anticipate the RLS quals to be as selective as the
quals which the user is adding.

I agree that in cases where the user isn't using a leakproof function in
their quals and the policy is complex, a sequential scan would have to
be done over the table, but looking at the set of leakproof vs not
leakproof functions used by operators which return boolean, certainly
the most common of the index using cases are covered and we may be able
to add more leakproof functions, should we get user complaints that the
function they're using works fine with an index but isn't leakproof.

> but users who have access through policy 2 have
> an excellent chance of getting an index scan if the selectivity of a =
> 1 is high.  When you merge those two things into a single policy, no
> matter how you do it, everyone gets sequential scans all the time.
> That sucks.

It just strikes me as unlikely that in such a simple policy the
selectivity of the RLS qual used will be high and this feels like a lot
of mechanism and complication to be adding for that use-case.  If the
selectivity is actually high in terms of what the RLS qual will allow,
then it seems likely, to me at least, that it's going to need to depend
on another table or function, eg:

exists(select 1 from security_table      where (current_user(),a) = (sec_user,sec_label))

Still thinking about this approach in general.  Having a good answer to
the question about granularity and how this multiple RLS-policy would
actually work would certainly help.  Being able to pick a single policy
(rather than deal with overlapping policies that all have to be tested)
would definitely make this simpler, but I suppose we could build up "X
OR Y OR Z..." inside the query..
Thanks,
    Stephen

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Set new system identifier using pg_resetxlog
Next
From: Stephen Frost
Date:
Subject: Re: How about a proper TEMPORARY TABLESPACE?