Re: hyrax vs. RelationBuildPartitionDesc - Mailing list pgsql-hackers

From Robert Haas
Subject Re: hyrax vs. RelationBuildPartitionDesc
Date
Msg-id CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com
Whole thread Raw
In response to Re: hyrax vs. RelationBuildPartitionDesc  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: hyrax vs. RelationBuildPartitionDesc
Re: hyrax vs. RelationBuildPartitionDesc
List pgsql-hackers
On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> More to the point, we turned *one* rebuild = false situation into
> a bunch of rebuild = true situations.  I haven't studied it closely,
> but I think a CCA animal would probably see O(N^2) rebuild = true
> invocations in a query with N partitions, since each time
> expand_partitioned_rtentry opens another child table, we'll incur
> an AcceptInvalidationMessages call which leads to forced rebuilds
> of the previously-pinned rels.  In non-CCA situations, almost always
> nothing happens with the previously-examined relcache entries.

That's rather unfortunate.  I start to think that clobbering the cache
"always" is too hard a line.

> I agree that copying data isn't great.  What I don't agree is that this
> solution is better.  In particular, copying data out of the relcache
> rather than expecting the relcache to hold still over long periods
> is the way we've done things everywhere else, cf RelationGetIndexList,
> RelationGetStatExtList, RelationGetIndexExpressions,
> RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
> RelationGetExclusionInfo, GetRelationPublicationActions.  I don't care
> for a patch randomly deciding to do things differently on the basis of an
> unsupported-by-evidence argument that it might cost too much to copy the
> data.  If we're going to make a push to reduce the amount of copying of
> that sort that we do, it should be a separately (and carefully) designed
> thing that applies to all the relcache substructures that have the issue,
> not one-off hacks that haven't been reviewed thoroughly.

That's not an unreasonable argument.  On the other hand, if you never
try new stuff, you lose opportunities to explore things that might
turn out to be better and worth adopting more widely.

I am not very convinced that it makes sense to lump something like
RelationGetIndexAttrBitmap in with something like
RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
single Bitmapset, whereas the data structure RelationGetPartitionDesc
is vastly larger and more complex.  You can't say "well, if it's best
to copy 32 bytes of data out of the relcache every time we need it, it
must also be right to copy 10k or 100k of data out of the relcache
every time we need it."

There is another difference as well: there's a good chance that
somebody is going to want to mutate a Bitmapset, whereas they had
BETTER NOT think that they can mutate the PartitionDesc.  So returning
an uncopied Bitmapset is kinda risky in a way that returning an
uncopied PartitionDesc is not.

If we want an at-least-somewhat unified solution here, I think we need
to bite the bullet and make the planner hold a reference to the
relcache throughout planning.  (The executor already keeps it open, I
believe.) Otherwise, how is the relcache supposed to know when it can
throw stuff away and when it can't?  The only alternative seems to be
to have each subsystem hold its own reference count, as I did with the
PartitionDirectory stuff, which is not something we'd want to scale
out.

> I especially don't care for the much-less-than-half-baked kluge of
> chaining the old rd_pdcxt onto the new one and hoping that it will go away
> at a suitable time.

It will go away at a suitable time, but maybe not at the soonest
suitable time.  It wouldn't be hard to improve that, though.  The
easiest thing to do, I think, would be to have an rd_oldpdcxt and
stuff the old context there.  If there already is one there, parent
the new one under it.  When RelationDecrementReferenceCount reduces
the reference count to zero, destroy anything found in rd_oldpdcxt.
With that change, things get destroyed at the earliest time at which
we know the old things aren't referenced, instead of the earliest time
at which they are not referenced + an invalidation arrives.

> regression=# create table parent (a text, b int) partition by list (a);
> CREATE TABLE
> regression=# create table child (a text, b int);
> CREATE TABLE
> regression=# do $$
> regression$# begin
> regression$# for i in 1..10000000 loop
> regression$# alter table parent attach partition child for values in ('x');
> regression$# alter table parent detach partition child;
> regression$# end loop;
> regression$# end $$;

Interesting example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Michael Meskes
Date:
Subject: Re: SQL statement PREPARE does not work in ECPG
Next
From: Robert Haas
Date:
Subject: Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3