Thread: Switch to multi-inserts for pg_depend

Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
Hi all,

This is a continuation of the work that has been previously discussed
here, resulting mainly in e3931d0 for pg_attribute and pg_shdepend:
https://www.postgresql.org/message-id/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de

I have been looking at the amount of work that could be done
independently for pg_depend, and attached are two patches:
- 0001 switches recordMultipleDependencies() to use multi-inserts.
Contrary to pg_attribute and pg_shdepend, the number of items to
insert is known in advance, but some of them can be skipped if known
as a pinned dependency.  The data insertion is capped at 64kB, and the
number of slots is basically calculation from the maximum cap and the
number of items to insert.
- 0002 switches a bunch of code paths to make use of multi-inserts
instead of individual calls to recordDependencyOn(), grouping the
insertions of dependencies of the same time.  This relies on the
existing set of APIs to manipulate a set of object addresses, without
any new addition there (no reset-like routine either as I noticed that
it would have been useful in only one place).  The set of changes is
honestly a bit bulky here.

I am adding this thread to the next commit fest.  Thoughts are
welcome.

Thanks,
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Fri, Aug 07, 2020 at 03:16:19PM +0900, Michael Paquier wrote:
> I am adding this thread to the next commit fest.  Thoughts are
> welcome.

Forgot to mention that this is based on some initial work from Daniel,
and that we have discussed the issue before I worked on it.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Andres Freund
Date:
Hi,

On 2020-08-07 15:16:19 +0900, Michael Paquier wrote:
> From cd117fa88938c89ac953a5e3c036828337150b07 Mon Sep 17 00:00:00 2001
> From: Michael Paquier <michael@paquier.xyz>
> Date: Fri, 7 Aug 2020 10:57:40 +0900
> Subject: [PATCH 1/2] Use multi-inserts for pg_depend
> 
> This is a follow-up of the work done in e3931d01.  This case is a bit
> different than pg_attribute and pg_shdepend: the maximum number of items
> to insert is known in advance, but there is no need to handle pinned
> dependencies.  Hence, the base allocation for slots is done based on the
> number of items and the maximum allowed with a cap at 64kB, and items
> are initialized once used to minimize the overhead of the operation.
> 
> Author: Daniel Gustafsson, Michael Paquier
> Discussion: https://postgr.es/m/XXX
> ---
>  src/backend/catalog/pg_depend.c | 95 ++++++++++++++++++++++++---------
>  1 file changed, 69 insertions(+), 26 deletions(-)
> 
> diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
> index 70baf03178..596f0c5e29 100644
> --- a/src/backend/catalog/pg_depend.c
> +++ b/src/backend/catalog/pg_depend.c
> @@ -47,6 +47,12 @@ recordDependencyOn(const ObjectAddress *depender,
>      recordMultipleDependencies(depender, referenced, 1, behavior);
>  }
>  
> +/*
> + * Cap the maximum amount of bytes allocated for recordMultipleDependencies()
> + * slots.
> + */
> +#define MAX_PGDEPEND_INSERT_BYTES    65535
> +

Do we really want to end up with several separate defines for different
type of catalog batch inserts? That doesn't seem like a good
thing. Think there should be a single define for all catalog bulk
inserts.


>  /*
>   * Record multiple dependencies (of the same kind) for a single dependent
>   * object.  This has a little less overhead than recording each separately.
> @@ -59,10 +65,10 @@ recordMultipleDependencies(const ObjectAddress *depender,
>  {
>      Relation    dependDesc;
>      CatalogIndexState indstate;
> -    HeapTuple    tup;
> -    int            i;
> -    bool        nulls[Natts_pg_depend];
> -    Datum        values[Natts_pg_depend];
> +    int            slotCount, i;
> +    TupleTableSlot **slot;
> +    int            nslots, max_slots;
> +    bool        slot_init = true;
>  
>      if (nreferenced <= 0)
>          return;                    /* nothing to do */
> @@ -76,11 +82,18 @@ recordMultipleDependencies(const ObjectAddress *depender,
>  
>      dependDesc = table_open(DependRelationId, RowExclusiveLock);
>  
> +    /*
> +     * Allocate the slots to use, but delay initialization until we know that
> +     * they will be used.
> +     */
> +    max_slots = Min(nreferenced,
> +                    MAX_PGDEPEND_INSERT_BYTES / sizeof(FormData_pg_depend));
> +    slot = palloc(sizeof(TupleTableSlot *) * max_slots);
> +
>      /* Don't open indexes unless we need to make an update */
>      indstate = NULL;
>  
> -    memset(nulls, false, sizeof(nulls));
> -
> +    slotCount = 0;
>      for (i = 0; i < nreferenced; i++, referenced++)
>      {
>          /*
> @@ -88,38 +101,68 @@ recordMultipleDependencies(const ObjectAddress *depender,
>           * need to record dependencies on it.  This saves lots of space in
>           * pg_depend, so it's worth the time taken to check.
>           */
> -        if (!isObjectPinned(referenced, dependDesc))
> +        if (isObjectPinned(referenced, dependDesc))
> +            continue;
> +

Hm, would it be better to first iterate over the dependencies, compute
the number of dependencies to be inserted, and then go ahead and create
the right number of slots?


> From fcc0a11e9fc94d2fedc71dd10ba2a23713225963 Mon Sep 17 00:00:00 2001
> From: Michael Paquier <michael@paquier.xyz>
> Date: Fri, 7 Aug 2020 15:14:51 +0900
> Subject: [PATCH 2/2] Switch to multi-insert dependencies for many code paths
> 
> This makes use of the new APIs to insert dependencies in groups, instead
> of doing the operation one-by-one.

Seems several places have been modified to new APIs despite only
covering a single dependency. Perhaps worth mentioning?

Greetings,

Andres Freund



Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Mon, Aug 10, 2020 at 05:32:21PM -0700, Andres Freund wrote:
> Do we really want to end up with several separate defines for different
> type of catalog batch inserts? That doesn't seem like a good
> thing. Think there should be a single define for all catalog bulk
> inserts.

Unlikely so, but I kept them separate to potentially lower the
threshold of 64kB for catalog rows that have a lower average size than
pg_attribute.  catalog.h would be the natural location I would choose
for a single definition.

> Hm, would it be better to first iterate over the dependencies, compute
> the number of dependencies to be inserted, and then go ahead and create
> the right number of slots?

Not sure about that, but I am not wedded to the approach of the patch
either as the most consuming portion is the slot initialization/reset.
Computing the number of items in advance forces to go through the
dependency list twice, while doing a single pass makes the code
allocate 64 extra bytes for each slot not used.  It is of course
better to avoid calling isObjectPinned() twice for each dependency, so
we could use a bitmap, or just simply build a secondary list of
dependencies that we are sure will be inserted after doing a first
pass to discard the unwanted entries.

> Seems several places have been modified to new APIs despite only
> covering a single dependency. Perhaps worth mentioning?

Yeah, I need to think more about this commit message.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Robert Haas
Date:
On Tue, Aug 11, 2020 at 1:59 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Aug 10, 2020 at 05:32:21PM -0700, Andres Freund wrote:
> > Do we really want to end up with several separate defines for different
> > type of catalog batch inserts? That doesn't seem like a good
> > thing. Think there should be a single define for all catalog bulk
> > inserts.
>
> Unlikely so, but I kept them separate to potentially lower the
> threshold of 64kB for catalog rows that have a lower average size than
> pg_attribute.

Uh, why would we want to do that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Switch to multi-inserts for pg_depend

From
Alvaro Herrera
Date:
On 2020-Aug-11, Robert Haas wrote:

> On Tue, Aug 11, 2020 at 1:59 AM Michael Paquier <michael@paquier.xyz> wrote:
> > On Mon, Aug 10, 2020 at 05:32:21PM -0700, Andres Freund wrote:
> > > Do we really want to end up with several separate defines for different
> > > type of catalog batch inserts? That doesn't seem like a good
> > > thing. Think there should be a single define for all catalog bulk
> > > inserts.
> >
> > Unlikely so, but I kept them separate to potentially lower the
> > threshold of 64kB for catalog rows that have a lower average size than
> > pg_attribute.
> 
> Uh, why would we want to do that?

Yeah.  As I understand, the only reason to have this number is to avoid
an arbitrarily large number of entries created as a single multi-insert
WAL record ... but does that really ever happen?  I guess if you create
a table with some really complicated schema you might get, say, a
hundred pg_depend rows at once.  But to fill eight complete pages of
pg_depend entries sounds astoundingly ridiculous already -- I'd say it's
just an easy way to spell "infinity" for this.  Tweaking one infinity
value to become some other infinity value sounds useless.

So I agree with what Andres said.  Let's have just one such define and
be done with it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Wed, Aug 12, 2020 at 07:52:42PM -0400, Alvaro Herrera wrote:
> Yeah.  As I understand, the only reason to have this number is to avoid
> an arbitrarily large number of entries created as a single multi-insert
> WAL record ... but does that really ever happen?  I guess if you create
> a table with some really complicated schema you might get, say, a
> hundred pg_depend rows at once.  But to fill eight complete pages of
> pg_depend entries sounds astoundingly ridiculous already -- I'd say it's
> just an easy way to spell "infinity" for this.  Tweaking one infinity
> value to become some other infinity value sounds useless.
>
> So I agree with what Andres said.  Let's have just one such define and
> be done with it.

Okay.  Would src/include/catalog/catalog.h be a suited location for
this flag or somebody has a better idea?
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Alvaro Herrera
Date:
On 2020-Aug-13, Michael Paquier wrote:

> Okay.  Would src/include/catalog/catalog.h be a suited location for
> this flag or somebody has a better idea?

Next to the API definition I guess, is that dependency.h?



-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Thu, Aug 13, 2020 at 05:35:14AM -0400, Alvaro Herrera wrote:
> Next to the API definition I guess, is that dependency.h?

We need something more central, see also MAX_PGATTRIBUTE_INSERT_BYTES
and MAX_PGSHDEPEND_INSERT_BYTES.  And the definition should be named
something like MAX_CATALOG_INSERT_BYTES or such I guess.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Alvaro Herrera
Date:
On 2020-Aug-13, Michael Paquier wrote:

> On Thu, Aug 13, 2020 at 05:35:14AM -0400, Alvaro Herrera wrote:
> > Next to the API definition I guess, is that dependency.h?
> 
> We need something more central, see also MAX_PGATTRIBUTE_INSERT_BYTES
> and MAX_PGSHDEPEND_INSERT_BYTES.  And the definition should be named
> something like MAX_CATALOG_INSERT_BYTES or such I guess.

MAX_CATALOG_INSERT_BYTES sounds decent to me.  I mentioned dependency.h
because I was uncaffeinatedly thinking that this was used with API
defined there -- but in reality it's used with indexing.h functions, and
it seems to me that that file would be the place for it.

Looking at the existing contents of catalog.h, I would say it does not
fit in there.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Thu, Aug 13, 2020 at 11:45:52AM -0400, Alvaro Herrera wrote:
> MAX_CATALOG_INSERT_BYTES sounds decent to me.  I mentioned dependency.h
> because I was uncaffeinatedly thinking that this was used with API
> defined there -- but in reality it's used with indexing.h functions, and
> it seems to me that that file would be the place for it.

OK, let's live with indexing.h then.

Regarding the maximum number of slots allocated.  Do people like the
current approach taken by the patch to do a single loop of the
dependency entries at the cost of more allocating perhaps too much for
the array holding the set of TupleTableSlots (the actual slot
initialization happens only if necessary)?  Or would it be preferred
to scan twice the set of dependencies, discarding pinned dependencies
in a first scan to build the list of dependencies that would be
inserted?  This way, you can know the exact amount memory to allocated
for TupleTableSlots, though that's just 64B for each one of them.

I have read today through the patch set of Julien and Thomas to add a
version string to pg_depend, and I get the impression that the
current approach taken by the patch fits better in the whole picture.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Alvaro Herrera
Date:
On 2020-Aug-14, Michael Paquier wrote:

> Regarding the maximum number of slots allocated.  Do people like the
> current approach taken by the patch to do a single loop of the
> dependency entries at the cost of more allocating perhaps too much for
> the array holding the set of TupleTableSlots (the actual slot
> initialization happens only if necessary)?  Or would it be preferred
> to scan twice the set of dependencies, discarding pinned dependencies
> in a first scan to build the list of dependencies that would be
> inserted?  This way, you can know the exact amount memory to allocated
> for TupleTableSlots, though that's just 64B for each one of them.

It seems a bit silly to worry about allocating just the exact amount
needed; the current approach looked fine to me.  The logic to keep track
number of used slots used is baroque, though -- that could use a lot of
simplification.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Fri, Aug 14, 2020 at 02:23:16PM -0400, Alvaro Herrera wrote:
> It seems a bit silly to worry about allocating just the exact amount
> needed; the current approach looked fine to me.

Okay, thanks.

> The logic to keep track
> number of used slots used is baroque, though -- that could use a lot of
> simplification.

What are you suggesting here?  A new API layer to manage a set of
slots?
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Sat, Aug 15, 2020 at 10:50:37AM +0900, Michael Paquier wrote:
> What are you suggesting here?  A new API layer to manage a set of
> slots?

It has been a couple of weeks, and I am not really sure what is the
suggestion here.  So I would like to move on with this patch set as
the changes are straight-forward using the existing routines for
object addresses by grouping all insert dependencies of the same type.
Are there any objections?

Attached is a rebased set, where I have added in indexing.h a unique
definition for the hard limit of 64kB for the amount of data that can
be inserted at once, based on the suggestion from Alvaro and Andres.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Daniel Gustafsson
Date:
> On 14 Aug 2020, at 20:23, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> The logic to keep track number of used slots used is baroque, though -- that
> could use a lot of simplification.

What if slot_init was an integer which increments together with the loop
variable until max_slots is reached?  If so, maybe it should be renamed
slot_init_count and slotCount renamed slot_stored_count to make the their use
clearer?

> On 31 Aug 2020, at 09:56, Michael Paquier <michael@paquier.xyz> wrote:

> It has been a couple of weeks, and I am not really sure what is the
> suggestion here.  So I would like to move on with this patch set as
> the changes are straight-forward using the existing routines for
> object addresses by grouping all insert dependencies of the same type.
> Are there any objections?

I'm obviously biased but I think this patchset is a net win.  There are more
things we can do in this space, but it's a good start.

> Attached is a rebased set, where I have added in indexing.h a unique
> definition for the hard limit of 64kB for the amount of data that can
> be inserted at once, based on the suggestion from Alvaro and Andres.

+#define MAX_CATALOG_INSERT_BYTES 65535
This name, and inclusion in a headerfile, implies that the definition is
somewhat generic as opposed to its actual use.  Using MULTIINSERT rather than
INSERT in the name would clarify I reckon.

A few other comments:

+    /*
+     * Allocate the slots to use, but delay initialization until we know that
+     * they will be used.
+     */
I think this comment warrants a longer explanation on why they wont all be
used, or perhaps none of them (which is the real optimization win here).

+    ObjectAddresses *addrs_auto;
+    ObjectAddresses *addrs_normal;
It's not for this patch, but it seems a logical next step would be to be able
to record the DependencyType as well when collecting deps rather than having to
create multiple buckets.

+    /* Normal dependency from a domain to its collation. */
+    /* We know the default collation is pinned, so don't bother recording it */
It's just moved and not introduced in this patch, but shouldn't these two lines
be joined into a normal block comment?

cheers ./daniel


Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Tue, Sep 01, 2020 at 11:53:36AM +0200, Daniel Gustafsson wrote:
> On 14 Aug 2020, at 20:23, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>> The logic to keep track number of used slots used is baroque, though -- that
>> could use a lot of simplification.
>
> What if slot_init was an integer which increments together with the loop
> variable until max_slots is reached?  If so, maybe it should be renamed
> slot_init_count and slotCount renamed slot_stored_count to make the their use
> clearer?

Good idea, removing slot_init looks like a good thing for readability.
And the same can be done for pg_shdepend.

> On 31 Aug 2020, at 09:56, Michael Paquier <michael@paquier.xyz> wrote:
> +#define MAX_CATALOG_INSERT_BYTES 65535
> This name, and inclusion in a headerfile, implies that the definition is
> somewhat generic as opposed to its actual use.  Using MULTIINSERT rather than
> INSERT in the name would clarify I reckon.

Makes sense, I have switched to MAX_CATALOG_MULTI_INSERT_BYTES.

> A few other comments:
>
> +    /*
> +     * Allocate the slots to use, but delay initialization until we know that
> +     * they will be used.
> +     */
> I think this comment warrants a longer explanation on why they wont all be
> used, or perhaps none of them (which is the real optimization win here).

Okay, I have updated the comments where this formulation is used.
Does that look adapted to you?

> +    ObjectAddresses *addrs_auto;
> +    ObjectAddresses *addrs_normal;
> It's not for this patch, but it seems a logical next step would be to be able
> to record the DependencyType as well when collecting deps rather than having to
> create multiple buckets.

Yeah, agreed.  I am not sure yet how to design those APIs.  One option
is to use a set of an array with DependencyType elements, each one
storing a list of dependencies of the same type.

> +    /* Normal dependency from a domain to its collation. */
> +    /* We know the default collation is pinned, so don't bother recording it */
> It's just moved and not introduced in this patch, but shouldn't these two lines
> be joined into a normal block comment?

Okay, done.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Daniel Gustafsson
Date:
> On 3 Sep 2020, at 07:35, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Sep 01, 2020 at 11:53:36AM +0200, Daniel Gustafsson wrote:
>> On 14 Aug 2020, at 20:23, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>>> The logic to keep track number of used slots used is baroque, though -- that
>>> could use a lot of simplification.
>>
>> What if slot_init was an integer which increments together with the loop
>> variable until max_slots is reached?  If so, maybe it should be renamed
>> slot_init_count and slotCount renamed slot_stored_count to make the their use
>> clearer?
>
> Good idea, removing slot_init looks like a good thing for readability.
> And the same can be done for pg_shdepend.

I think this version is a clear improvement.  Nothing more sticks out from a
read-through.

cheers ./daniel


Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Thu, Sep 03, 2020 at 09:47:07AM +0200, Daniel Gustafsson wrote:
> I think this version is a clear improvement.  Nothing more sticks out from a
> read-through.

Thanks for taking the time to look at it, Daniel.  We of course could
still try to figure out how we could group all dependencies without
worrying about their type, but I'd like to leave that as future work
for now.  This is much more complex than what's proposed on this
thread, and I am not sure if we really need to make this stuff more
complex for this purpose.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Daniel Gustafsson
Date:
> On 3 Sep 2020, at 12:19, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 03, 2020 at 09:47:07AM +0200, Daniel Gustafsson wrote:
>> I think this version is a clear improvement.  Nothing more sticks out from a
>> read-through.
>
> Thanks for taking the time to look at it, Daniel.  We of course could
> still try to figure out how we could group all dependencies without
> worrying about their type, but I'd like to leave that as future work
> for now.  This is much more complex than what's proposed on this
> thread, and I am not sure if we really need to make this stuff more
> complex for this purpose.

Agreed,  I think that's a separate piece of work and discussion.

cheers ./daniel


Re: Switch to multi-inserts for pg_depend

From
Alvaro Herrera
Date:
I agree, this version looks much better, thanks.  Two very minor things:

On 2020-Sep-03, Michael Paquier wrote:

> @@ -76,11 +77,23 @@ recordMultipleDependencies(const ObjectAddress *depender,
>  
>      dependDesc = table_open(DependRelationId, RowExclusiveLock);
>  
> +    /*
> +     * Allocate the slots to use, but delay initialization until we know that
> +     * they will be used.  The slot initialization is the costly part, and the
> +     * exact number of dependencies inserted cannot be known in advance as it
> +     * depends on what is pinned by the system.
> +     */

I'm not sure you need the second sentence in this comment; keeping the
"delay initialization until ..." part seems sufficient.  If you really
want to highlight that initialization is costly, maybe just say "delay
costly initialization".

> +        /*
> +         * Record the Dependency.  Note we don't bother to check for duplicate
> +         * dependencies; there's no harm in them.
> +         */

No need to uppercase "dependency".  (I know this is carried forward from
prior comment, but it was equally unnecessary there.)

>      /*
>       * Allocate the slots to use, but delay initialization until we know that
> -     * they will be used.
> +     * they will be used.  A full scan of pg_shdepend is done to find all the
> +     * dependencies from the template database to copy.  Their number is not
> +     * known in advance and the slot initialization is the costly part.
>       */

As above, this change is not needed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Thu, Sep 03, 2020 at 10:50:49AM -0400, Alvaro Herrera wrote:
> I'm not sure you need the second sentence in this comment; keeping the
> "delay initialization until ..." part seems sufficient.  If you really
> want to highlight that initialization is costly, maybe just say "delay
> costly initialization".

Thanks for the review.

This extra comment was to answer to Daniel's suggestion upthread, and
the simple wording you are suggesting is much better than what I did,
so I have just added "costly initialization" in those two places.

>> +        /*
>> +         * Record the Dependency.  Note we don't bother to check for duplicate
>> +         * dependencies; there's no harm in them.
>> +         */
>
> No need to uppercase "dependency".  (I know this is carried forward from
> prior comment, but it was equally unnecessary there.)

Thanks, fixed.
--
Michael

Attachment

Re: Switch to multi-inserts for pg_depend

From
Michael Paquier
Date:
On Fri, Sep 04, 2020 at 10:15:57AM +0900, Michael Paquier wrote:
> Thanks, fixed.

With the two comment fixes included, I have looked at both patches
again today, and applied them.
--
Michael

Attachment