Thread: brininsert optimization opportunity

brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

03 July 2023, 22:21:58

Hello hackers,

My colleague, Ashwin, pointed out to me that brininsert's per-tuple init
of the revmap access struct can have non-trivial overhead.

Turns out he is right. We are saving 24 bytes of memory per-call for
the access struct, and a bit on buffer/locking overhead, with the
attached patch.

The implementation ties the revmap cleanup as a MemoryContext callback
to the IndexInfo struct's MemoryContext, as there is no teardown
function provided by the index AM for end-of-insert-command.

Test setup (local Ubuntu workstation):

# Drop caches and restart between each run:
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches;"
pg_ctl -D /usr/local/pgsql/data/ -l /tmp/logfile restart

\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000);

Results:
We see an improvement for 100M tuples and an even bigger improvement for
200M tuples.

Master (29cf61ade3f245aa40f427a1d6345287ef77e622):

test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 100000000);
INSERT 0 100000000
Time: 222762.159 ms (03:42.762)

-- 3 runs
test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000);
INSERT 0 200000000
Time: 471168.181 ms (07:51.168)
Time: 457071.883 ms (07:37.072)
TimeL 486969.205 ms (08:06.969)

Branch:

test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 100000000);
INSERT 0 100000000
Time: 200046.519 ms (03:20.047)

-- 3 runs
test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000);
INSERT 0 200000000
Time: 369041.832 ms (06:09.042)
Time: 365483.382 ms (06:05.483)
Time: 375506.144 ms (06:15.506)

# Profiled backend running INSERT of 100000000 rows
sudo perf record -p 11951 --call-graph fp sleep 180

Please see attached perf diff between master and branch. We see that we
save on a bit of overhead from brinRevmapInitialize(),
brinRevmapTerminate() and lock routines.

Regards,
Soumyadeep (VMware)

Attachment

Re: brininsert optimization opportunity

From

Alvaro Herrera

Date:

04 July 2023, 11:23:58

On 2023-Jul-03, Soumyadeep Chakraborty wrote:

> My colleague, Ashwin, pointed out to me that brininsert's per-tuple init
> of the revmap access struct can have non-trivial overhead.
> 
> Turns out he is right. We are saving 24 bytes of memory per-call for
> the access struct, and a bit on buffer/locking overhead, with the
> attached patch.

Hmm, yeah, I remember being bit bothered by this repeated
initialization.  Your patch looks reasonable to me.  I would set
bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
Also, please add comments atop these two new functions, to explain what
they are.

Nice results.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

04 July 2023, 11:59:21


On 7/4/23 13:23, Alvaro Herrera wrote:
> On 2023-Jul-03, Soumyadeep Chakraborty wrote:
> 
>> My colleague, Ashwin, pointed out to me that brininsert's per-tuple init
>> of the revmap access struct can have non-trivial overhead.
>>
>> Turns out he is right. We are saving 24 bytes of memory per-call for
>> the access struct, and a bit on buffer/locking overhead, with the
>> attached patch.
> 
> Hmm, yeah, I remember being bit bothered by this repeated
> initialization.  Your patch looks reasonable to me.  I would set
> bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
> Also, please add comments atop these two new functions, to explain what
> they are.
> 
> Nice results.
> 

Yeah. I wonder how much of that runtime is the generate_series(),
though. What's the speedup if that part is subtracted. It's guaranteed
to be even more significant, but by how much?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

04 July 2023, 19:25:33

Thank you both for reviewing!

On Tue, Jul 4, 2023 at 4:24AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Hmm, yeah, I remember being bit bothered by this repeated
> initialization. Your patch looks reasonable to me. I would set
> bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
> Also, please add comments atop these two new functions, to explain what
> they are.

Done. Set bistate->bs_desc = NULL; as well. Added comments.

On Tue, Jul 4, 2023 at 4:59AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> Yeah. I wonder how much of that runtime is the generate_series(),
> though. What's the speedup if that part is subtracted. It's guaranteed
> to be even more significant, but by how much?

When trying COPY, I got tripped by the following:

We get a buffer leak WARNING for the meta page and a revmap page.

WARNING:  buffer refcount leak: [094] (rel=base/156912/206068,
blockNum=1, flags=0x83000000, refcount=1 1)
WARNING:  buffer refcount leak: [093] (rel=base/156912/206068,
blockNum=0, flags=0x83000000, refcount=1 1)

PrintBufferLeakWarning bufmgr.c:3240
ResourceOwnerReleaseInternal resowner.c:554
ResourceOwnerRelease resowner.c:494
PortalDrop portalmem.c:563
exec_simple_query postgres.c:1284

We release the buffer during this resowner release and then we crash
with:

TRAP: failed Assert("bufnum <= NBuffers"), File:
"../../../../src/include/storage/bufmgr.h", Line: 305, PID: 86833
postgres: pivotal test4 [local] COPY(ExceptionalCondition+0xbb)[0x5572b55bcc79]
postgres: pivotal test4 [local] COPY(+0x61ccfc)[0x5572b537dcfc]
postgres: pivotal test4 [local] COPY(ReleaseBuffer+0x19)[0x5572b5384db2]
postgres: pivotal test4 [local] COPY(brinRevmapTerminate+0x1e)[0x5572b4e3fd39]
postgres: pivotal test4 [local] COPY(+0xcfc44)[0x5572b4e30c44]
postgres: pivotal test4 [local] COPY(+0x89e7f2)[0x5572b55ff7f2]
postgres: pivotal test4 [local] COPY(MemoryContextDelete+0xd7)[0x5572b55ff683]
postgres: pivotal test4 [local] COPY(PortalDrop+0x374)[0x5572b5602dc7]

Unfortunately, when we do COPY, the MemoryContext where makeIndexInfo
gets called is PortalContext and that is what is set in ii_Context.
Furthermore, we clean up the resource owner stuff before we can clean
up the MemoryContexts in PortalDrop().

The CurrentMemoryContext when initialize_brin_insertstate() is called
depends. For CopyMultiInsertBufferFlush() -> ExecInsertIndexTuples()
it is PortalContext, and for CopyFrom() -> ExecInsertIndexTuples() it is
ExecutorState/ExprContext. We can't rely on it to register the callback
neither.

What we can do is create a new MemoryContext for holding the
BrinInsertState, and we tie the callback to that so that cleanup is not
affected by all of these variables. See v2 patch attached. Passes make
installcheck-world and make installcheck -C src/test/modules/brin.

However, we do still have 1 issue with the v2 patch:
When we try to cancel (Ctrl-c) a running COPY command:
ERROR:  buffer 151 is not owned by resource owner TopTransaction

#4  0x0000559cbc54a934 in ResourceOwnerForgetBuffer
(owner=0x559cbd6fcf28, buffer=143) at resowner.c:997
#5  0x0000559cbc2c45e7 in UnpinBuffer (buf=0x7f8d4a8f3f80) at bufmgr.c:2390
#6  0x0000559cbc2c7e49 in ReleaseBuffer (buffer=143) at bufmgr.c:4488
#7  0x0000559cbbd82d53 in brinRevmapTerminate (revmap=0x559cbd7a03b8)
at brin_revmap.c:105
#8  0x0000559cbbd73c44 in brininsertCleanupCallback
(arg=0x559cbd7a5b68) at brin.c:168
#9  0x0000559cbc54280c in MemoryContextCallResetCallbacks
(context=0x559cbd7a5a50) at mcxt.c:506
#10 0x0000559cbc54269d in MemoryContextDelete (context=0x559cbd7a5a50)
at mcxt.c:421
#11 0x0000559cbc54273e in MemoryContextDeleteChildren
(context=0x559cbd69ae90) at mcxt.c:457
#12 0x0000559cbc54625c in AtAbort_Portals () at portalmem.c:850

Haven't found a way to fix this ^ yet.

Maybe there is a better way of doing our cleanup? I'm not sure. Would
love your input!

The other alternative for all this is to introduce new AM callbacks for
insert_begin and insert_end. That might be a tougher sell?

Now, to finally answer your question about the speedup without
generate_series(). We do see an even higher speedup!

seq 1 200000000 > /tmp/data.csv
\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
COPY heap FROM '/tmp/data.csv';

-- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
COPY 200000000
Time: 205072.444 ms (03:25.072)
Time: 215380.369 ms (03:35.380)
Time: 203492.347 ms (03:23.492)

-- 3 runs (branch v2)

COPY 200000000
Time: 135052.752 ms (02:15.053)
Time: 135093.131 ms (02:15.093)
Time: 138737.048 ms (02:18.737)

Regards,
Soumyadeep (VMware)

Attachment

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

04 July 2023, 21:54:29

On 7/4/23 21:25, Soumyadeep Chakraborty wrote:
> Thank you both for reviewing!
> 
> On Tue, Jul 4, 2023 at 4:24AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> 
>> Hmm, yeah, I remember being bit bothered by this repeated
>> initialization. Your patch looks reasonable to me. I would set
>> bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
>> Also, please add comments atop these two new functions, to explain what
>> they are.
> 
> Done. Set bistate->bs_desc = NULL; as well. Added comments.
> 
> 
> On Tue, Jul 4, 2023 at 4:59AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>> Yeah. I wonder how much of that runtime is the generate_series(),
>> though. What's the speedup if that part is subtracted. It's guaranteed
>> to be even more significant, but by how much?
> 
> When trying COPY, I got tripped by the following:
> 
> We get a buffer leak WARNING for the meta page and a revmap page.
> 
> WARNING:  buffer refcount leak: [094] (rel=base/156912/206068,
> blockNum=1, flags=0x83000000, refcount=1 1)
> WARNING:  buffer refcount leak: [093] (rel=base/156912/206068,
> blockNum=0, flags=0x83000000, refcount=1 1)
> 
> PrintBufferLeakWarning bufmgr.c:3240
> ResourceOwnerReleaseInternal resowner.c:554
> ResourceOwnerRelease resowner.c:494
> PortalDrop portalmem.c:563
> exec_simple_query postgres.c:1284
> 
> We release the buffer during this resowner release and then we crash
> with:
> 
> TRAP: failed Assert("bufnum <= NBuffers"), File:
> "../../../../src/include/storage/bufmgr.h", Line: 305, PID: 86833
> postgres: pivotal test4 [local] COPY(ExceptionalCondition+0xbb)[0x5572b55bcc79]
> postgres: pivotal test4 [local] COPY(+0x61ccfc)[0x5572b537dcfc]
> postgres: pivotal test4 [local] COPY(ReleaseBuffer+0x19)[0x5572b5384db2]
> postgres: pivotal test4 [local] COPY(brinRevmapTerminate+0x1e)[0x5572b4e3fd39]
> postgres: pivotal test4 [local] COPY(+0xcfc44)[0x5572b4e30c44]
> postgres: pivotal test4 [local] COPY(+0x89e7f2)[0x5572b55ff7f2]
> postgres: pivotal test4 [local] COPY(MemoryContextDelete+0xd7)[0x5572b55ff683]
> postgres: pivotal test4 [local] COPY(PortalDrop+0x374)[0x5572b5602dc7]
> 
> Unfortunately, when we do COPY, the MemoryContext where makeIndexInfo
> gets called is PortalContext and that is what is set in ii_Context.
> Furthermore, we clean up the resource owner stuff before we can clean
> up the MemoryContexts in PortalDrop().
> 
> The CurrentMemoryContext when initialize_brin_insertstate() is called
> depends. For CopyMultiInsertBufferFlush() -> ExecInsertIndexTuples()
> it is PortalContext, and for CopyFrom() -> ExecInsertIndexTuples() it is
> ExecutorState/ExprContext. We can't rely on it to register the callback
> neither.
> > What we can do is create a new MemoryContext for holding the
> BrinInsertState, and we tie the callback to that so that cleanup is not
> affected by all of these variables. See v2 patch attached. Passes make
> installcheck-world and make installcheck -C src/test/modules/brin.
>> However, we do still have 1 issue with the v2 patch:
> When we try to cancel (Ctrl-c) a running COPY command:
> ERROR:  buffer 151 is not owned by resource owner TopTransaction
> 

I'm not sure if memory context callbacks are the right way to rely on
for this purpose. The primary purpose of memory contexts is to track
memory, so using them for this seems a bit weird.

There are cases that do something similar, like expandendrecord.c where
we track refcounted tuple slot, but IMHO there's a big difference
between tracking one slot allocated right there, and unknown number of
buffers allocated much later.

The fact that even with the extra context is still doesn't handle query
cancellations is another argument against that approach (I wonder how
expandedrecord.c handles that, but I haven't checked).

> 
> Maybe there is a better way of doing our cleanup? I'm not sure. Would
> love your input!
> 
> The other alternative for all this is to introduce new AM callbacks for
> insert_begin and insert_end. That might be a tougher sell?
> 

That's the approach I wanted to suggest, more or less - to do the
cleanup from ExecCloseIndices() before index_close(). I wonder if it's
even correct to do that later, once we release the locks etc.

I don't think ii_AmCache was intended for stuff like this - GIN and GiST
only use it to cache stuff that can be just pfree-d, but for buffers
that's no enough. It's not surprising we need to improve this.

FWIW while debugging this (breakpoint on MemoryContextDelete) I was
rather annoyed the COPY keeps dropping and recreating the two BRIN
contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
reuse those too, but I don't know how much it'd help.

> Now, to finally answer your question about the speedup without
> generate_series(). We do see an even higher speedup!
> 
> seq 1 200000000 > /tmp/data.csv
> \timing
> DROP TABLE heap;
> CREATE TABLE heap(i int);
> CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
> COPY heap FROM '/tmp/data.csv';
> 
> -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
> COPY 200000000
> Time: 205072.444 ms (03:25.072)
> Time: 215380.369 ms (03:35.380)
> Time: 203492.347 ms (03:23.492)
> 
> -- 3 runs (branch v2)
> 
> COPY 200000000
> Time: 135052.752 ms (02:15.053)
> Time: 135093.131 ms (02:15.093)
> Time: 138737.048 ms (02:18.737)
> 

That's nice, but it still doesn't say how much of that is reading the
data. If you do just copy into a table without any indexes, how long
does it take?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

05 July 2023, 00:35:15

On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> I'm not sure if memory context callbacks are the right way to rely on
> for this purpose. The primary purpose of memory contexts is to track
> memory, so using them for this seems a bit weird.

Yeah, this just kept getting dirtier and dirtier.

> There are cases that do something similar, like expandendrecord.c where
> we track refcounted tuple slot, but IMHO there's a big difference
> between tracking one slot allocated right there, and unknown number of
> buffers allocated much later.

Yeah, the following code in ER_mc_callbackis is there I think to prevent double
freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal())
(The part about: /* Ditto for tupdesc references */).

if (tupdesc->tdrefcount > 0)
{
    if (--tupdesc->tdrefcount == 0)
    FreeTupleDesc(tupdesc);
}
Plus the above code doesn't try anything with Resource owner stuff, whereas
releasing a buffer means:
ReleaseBuffer() -> UnpinBuffer() ->
ResourceOwnerForgetBuffer(CurrentResourceOwner, b);

> The fact that even with the extra context is still doesn't handle query
> cancellations is another argument against that approach (I wonder how
> expandedrecord.c handles that, but I haven't checked).
>
> >
> > Maybe there is a better way of doing our cleanup? I'm not sure. Would
> > love your input!
> >
> > The other alternative for all this is to introduce new AM callbacks for
> > insert_begin and insert_end. That might be a tougher sell?
> >
>
> That's the approach I wanted to suggest, more or less - to do the
> cleanup from ExecCloseIndices() before index_close(). I wonder if it's
> even correct to do that later, once we release the locks etc.

I'll try this out and introduce a couple of new index AM callbacks. I
think it's best to do it before releasing the locks - otherwise it
might be weird
to manipulate buffers of an index relation, without having some sort of lock on
it. I'll think about it some more.

> I don't think ii_AmCache was intended for stuff like this - GIN and GiST
> only use it to cache stuff that can be just pfree-d, but for buffers
> that's no enough. It's not surprising we need to improve this.

Hmmm, yes, although the docs state:
"If the index AM wishes to cache data across successive index insertions within
an SQL statement, it can allocate space in indexInfo->ii_Context and
store a pointer
to the data in indexInfo->ii_AmCache (which will be NULL initially)."
they don't mention anything about buffer usage. Well we will fix it!

PS: It should be possible to make GIN and GiST use the new index AM APIs
as well.

> FWIW while debugging this (breakpoint on MemoryContextDelete) I was
> rather annoyed the COPY keeps dropping and recreating the two BRIN
> contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
> reuse those too, but I don't know how much it'd help.
>

Interesting, I will investigate that.

> > Now, to finally answer your question about the speedup without
> > generate_series(). We do see an even higher speedup!
> >
> > seq 1 200000000 > /tmp/data.csv
> > \timing
> > DROP TABLE heap;
> > CREATE TABLE heap(i int);
> > CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
> > COPY heap FROM '/tmp/data.csv';
> >
> > -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
> > COPY 200000000
> > Time: 205072.444 ms (03:25.072)
> > Time: 215380.369 ms (03:35.380)
> > Time: 203492.347 ms (03:23.492)
> >
> > -- 3 runs (branch v2)
> >
> > COPY 200000000
> > Time: 135052.752 ms (02:15.053)
> > Time: 135093.131 ms (02:15.093)
> > Time: 138737.048 ms (02:18.737)
> >
>
> That's nice, but it still doesn't say how much of that is reading the
> data. If you do just copy into a table without any indexes, how long
> does it take?

So, I loaded the same heap table without any indexes and at the same
scale. I got:

postgres=# COPY heap FROM '/tmp/data.csv';
COPY 200000000
Time: 116161.545 ms (01:56.162)
Time: 114182.745 ms (01:54.183)
Time: 114975.368 ms (01:54.975)

perf diff also attached between the three: w/ no indexes (baseline),
master and v2.

Regards,
Soumyadeep (VMware)

Attachment

perf_diff_3way.out

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

05 July 2023, 10:16:48


On 7/5/23 02:35, Soumyadeep Chakraborty wrote:
> On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>> I'm not sure if memory context callbacks are the right way to rely on
>> for this purpose. The primary purpose of memory contexts is to track
>> memory, so using them for this seems a bit weird.
> 
> Yeah, this just kept getting dirtier and dirtier.
> 
>> There are cases that do something similar, like expandendrecord.c where
>> we track refcounted tuple slot, but IMHO there's a big difference
>> between tracking one slot allocated right there, and unknown number of
>> buffers allocated much later.
> 
> Yeah, the following code in ER_mc_callbackis is there I think to prevent double
> freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal())
> (The part about: /* Ditto for tupdesc references */).
> 
> if (tupdesc->tdrefcount > 0)
> {
>     if (--tupdesc->tdrefcount == 0)
>     FreeTupleDesc(tupdesc);
> }
> Plus the above code doesn't try anything with Resource owner stuff, whereas
> releasing a buffer means:
> ReleaseBuffer() -> UnpinBuffer() ->
> ResourceOwnerForgetBuffer(CurrentResourceOwner, b);
> 

Well, my understanding is the expandedrecord.c code increments the
refcount exactly to prevent the resource owner to release the slot too
early. My assumption is we'd need to do something similar for the revmap
buffers by calling IncrBufferRefCount or something. But that's going to
be messy, because the buffers are read elsewhere.

>> The fact that even with the extra context is still doesn't handle query
>> cancellations is another argument against that approach (I wonder how
>> expandedrecord.c handles that, but I haven't checked).
>>
>>>
>>> Maybe there is a better way of doing our cleanup? I'm not sure. Would
>>> love your input!
>>>
>>> The other alternative for all this is to introduce new AM callbacks for
>>> insert_begin and insert_end. That might be a tougher sell?
>>>
>>
>> That's the approach I wanted to suggest, more or less - to do the
>> cleanup from ExecCloseIndices() before index_close(). I wonder if it's
>> even correct to do that later, once we release the locks etc.
> 
> I'll try this out and introduce a couple of new index AM callbacks. I
> think it's best to do it before releasing the locks - otherwise it
> might be weird
> to manipulate buffers of an index relation, without having some sort of lock on
> it. I'll think about it some more.
> 

I don't understand why would this need more than just a callback to
release the cache.

>> I don't think ii_AmCache was intended for stuff like this - GIN and GiST
>> only use it to cache stuff that can be just pfree-d, but for buffers
>> that's no enough. It's not surprising we need to improve this.
> 
> Hmmm, yes, although the docs state:
> "If the index AM wishes to cache data across successive index insertions within
> an SQL statement, it can allocate space in indexInfo->ii_Context and
> store a pointer
> to the data in indexInfo->ii_AmCache (which will be NULL initially)."
> they don't mention anything about buffer usage. Well we will fix it!
> 
> PS: It should be possible to make GIN and GiST use the new index AM APIs
> as well.
> 

Why should GIN/GiST use the new API? I think it's perfectly sensible to
only require the "cleanup callback" when just pfree() is not enough.

>> FWIW while debugging this (breakpoint on MemoryContextDelete) I was
>> rather annoyed the COPY keeps dropping and recreating the two BRIN
>> contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
>> reuse those too, but I don't know how much it'd help.
>>
> 
> Interesting, I will investigate that.
> 
>>> Now, to finally answer your question about the speedup without
>>> generate_series(). We do see an even higher speedup!
>>>
>>> seq 1 200000000 > /tmp/data.csv
>>> \timing
>>> DROP TABLE heap;
>>> CREATE TABLE heap(i int);
>>> CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
>>> COPY heap FROM '/tmp/data.csv';
>>>
>>> -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
>>> COPY 200000000
>>> Time: 205072.444 ms (03:25.072)
>>> Time: 215380.369 ms (03:35.380)
>>> Time: 203492.347 ms (03:23.492)
>>>
>>> -- 3 runs (branch v2)
>>>
>>> COPY 200000000
>>> Time: 135052.752 ms (02:15.053)
>>> Time: 135093.131 ms (02:15.093)
>>> Time: 138737.048 ms (02:18.737)
>>>
>>
>> That's nice, but it still doesn't say how much of that is reading the
>> data. If you do just copy into a table without any indexes, how long
>> does it take?
> 
> So, I loaded the same heap table without any indexes and at the same
> scale. I got:
> 
> postgres=# COPY heap FROM '/tmp/data.csv';
> COPY 200000000
> Time: 116161.545 ms (01:56.162)
> Time: 114182.745 ms (01:54.183)
> Time: 114975.368 ms (01:54.975)
> 

OK, so the baseline is 115 seconds. With the current code, an index
means +100 seconds. With the optimization, it's just +20. Nice.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

05 July 2023, 18:57:58

On Wed, Jul 5, 2023 at 3:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> > I'll try this out and introduce a couple of new index AM callbacks. I
> > think it's best to do it before releasing the locks - otherwise it
> > might be weird
> > to manipulate buffers of an index relation, without having some sort of lock on
> > it. I'll think about it some more.
> >
>
> I don't understand why would this need more than just a callback to
> release the cache.

We wouldn't. I thought that it would be slightly cleaner and slightly more
performant if we moved the (if !state) branches out of the XXXinsert()
functions.
But I guess, let's minimize the changes here. One cleanup callback is enough.

> > PS: It should be possible to make GIN and GiST use the new index AM APIs
> > as well.
> >
>
> Why should GIN/GiST use the new API? I think it's perfectly sensible to
> only require the "cleanup callback" when just pfree() is not enough.

Yeah no need.

Attached v3 of the patch w/ a single index AM callback.

Regards,
Soumyadeep (VMware)

Attachment

v3-0001-Reuse-revmap-and-brin-desc-in-brininsert.patch

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

29 July 2023, 16:28:34

Attached v4 of the patch, rebased against latest HEAD.

Regards,
Soumyadeep (VMware)

Attachment

v4-0001-Reuse-revmap-and-brin-desc-in-brininsert.patch

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

05 August 2023, 07:53:00

Created an entry for the Sep CF: https://commitfest.postgresql.org/44/4484/

Regards,
Soumyadeep (VMware)

On Sat, Jul 29, 2023 at 9:28 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
>
> Attached v4 of the patch, rebased against latest HEAD.
>
> Regards,
> Soumyadeep (VMware)

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

04 September 2023, 23:43:59

Rebased against latest HEAD.

Regards,
Soumyadeep (VMware)

Attachment

v5-0001-Reuse-revmap-and-brin-desc-in-brininsert.patch

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

03 November 2023, 18:35:26

Hi,

I took a look at this patch today. I had to rebase the patch, due to
some minor bitrot related to 9f0602539d (but nothing major). I also did
a couple tiny cosmetic tweaks, but other than that the patch seems OK.
See the attached v6.

I did some simple performance tests too, similar to those in the initial
message:

  CREATE UNLOGGED TABLE heap (i int);
  CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
  --
  TRUNCATE heap;
  INSERT INTO heap SELECT 1 FROM generate_series(1, 20000000);

And the results look like this (5 runs each):

  master:   16448.338  16066.473  16039.166  16067.420  16080.066
  patched:  13260.065  13229.800  13254.454  13265.479  13273.693

So that's a nice improvement, even though enabling WAL will make the
relative speedup somewhat smaller.

The one thing I'm not entirely sure about is adding new stuff to the
IndexAmRoutine. I don't think we want to end up with too many callbacks
that all AMs have to initialize etc. I can't think of a different/better
way to do this, though.

Barring objections, I'll try to push this early next week, after another
round of cleanup.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

v6-0001-Reuse-revmap-and-brin-desc-in-brininsert.patch

Re: brininsert optimization opportunity

From

Matthias van de Meent

Date:

03 November 2023, 19:16:22

On Fri, 3 Nov 2023 at 19:37, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> I took a look at this patch today. I had to rebase the patch, due to
> some minor bitrot related to 9f0602539d (but nothing major). I also did
> a couple tiny cosmetic tweaks, but other than that the patch seems OK.
> See the attached v6.
> [...]
> Barring objections, I'll try to push this early next week, after another
> round of cleanup.

No hard objections: The principle looks fine.

I do think we should choose a better namespace than bs_* for the
fields of BrinInsertState, as BrinBuildState already uses the bs_*
namespace for its fields in the same file, but that's only cosmetic.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

04 November 2023, 18:58:31

On Fri, 3 Nov 2023 at 19:37, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> The one thing I'm not entirely sure about is adding new stuff to the
> IndexAmRoutine. I don't think we want to end up with too many callbacks
> that all AMs have to initialize etc. I can't think of a different/better
> way to do this, though.

Yes there is not really an alternative. Also, aminsertcleanup() is very similar
to amvacuumcleanup(), so it is not awkward. Why should vacuum be an
exclusive VIP? :)
And there are other indexam callbacks that not every AM implements. So this
addition is not unprecedented in that sense.

> Barring objections, I'll try to push this early next week, after another
> round of cleanup.

Many thanks for resurrecting this patch!

On Fri, Nov 3, 2023 at 12:16PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

>
> I do think we should choose a better namespace than bs_* for the
> fields of BrinInsertState, as BrinBuildState already uses the bs_*
> namespace for its fields in the same file, but that's only cosmetic.
>

bis_* then.

Regards,
Soumyadeep (VMware)

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

25 November 2023, 20:06:03

I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.

Thanks for the patch!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Ashwin Agrawal

Date:

25 November 2023, 22:33:52

On Sat, Nov 25, 2023 at 12:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.

Thanks a lot Tomas for helping to drive the patch to completion iteratively and realizing the benefits.

- Ashwin

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

25 November 2023, 23:23:31

Thanks a lot for reviewing and pushing! :)

Regards,
Soumyadeep (VMware)

Re: brininsert optimization opportunity

From

Richard Guo

Date:

27 November 2023, 05:28:22

On Sun, Nov 26, 2023 at 4:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.

It seems that we have an oversight in this commit. If there is no tuple
that has been inserted, we wouldn't have an available insert state in
the clean up phase. So the Assert in brininsertcleanup() is not always
right. For example:

regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly

So I wonder if we should check 'bistate' and do the clean up only if
there is an available one, something like below.

--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -359,7 +359,9 @@ brininsertcleanup(IndexInfo *indexInfo)
{
BrinInsertState *bistate = (BrinInsertState *) indexInfo->ii_AmCache;

- Assert(bistate);
+ /* We don't have an available insert state, nothing to do */
+ if (!bistate)
+ return;

Thanks
Richard

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

27 November 2023, 05:52:48

On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> On Sun, Nov 26, 2023 at 4:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> I've done a bit more cleanup on the last version of the patch (renamed
>> the fields to start with bis_ as agreed, rephrased the comments / docs /
>> commit message a bit) and pushed.
>
>
> It seems that we have an oversight in this commit.  If there is no tuple
> that has been inserted, we wouldn't have an available insert state in
> the clean up phase.  So the Assert in brininsertcleanup() is not always
> right.  For example:
>
> regression=# update brin_summarize set value = brin_summarize.value;
> server closed the connection unexpectedly
>

I wasn't able to repro the issue on 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
with UPDATE/INSERT:

postgres=# drop table a;
DROP TABLE
postgres=# create table a(i int);
CREATE TABLE
postgres=# create index on a using brin(i);
CREATE INDEX
postgres=# insert into a select 1 where 1!=1;
INSERT 0 0
postgres=# update a set i = 2 where i = 0;
UPDATE 0
postgres=# update a set i = a.i;
UPDATE 0

This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
we have moved ExecOpenIndices()
from ExecInitModifyTable() to ExecInsert(). Since we never open the
indices if nothing is
inserted, we would never attempt to close them with ExecCloseIndices()
while the ii_AmCache
is NULL (which is what causes this assertion failure).

However, it is possible to repro the issue with:
# create empty file
touch /tmp/a.csv
postgres=# create table a(i int);
CREATE TABLE
postgres=# create index on a using brin(i);
CREATE INDEX
postgres=# copy a from '/tmp/a.csv';
TRAP: failed Assert("bistate"), File: "brin.c", Line: 362, PID: 995511

> So I wonder if we should check 'bistate' and do the clean up only if
there is an available one, something like below.

Yes, this is the right way to go. Thanks for reporting!

Regards,
Soumyadeep (VMware)

Re: brininsert optimization opportunity

From

Richard Guo

Date:

27 November 2023, 07:37:20

On Mon, Nov 27, 2023 at 1:53 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:

On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com> wrote:
> It seems that we have an oversight in this commit. If there is no tuple
> that has been inserted, we wouldn't have an available insert state in
> the clean up phase. So the Assert in brininsertcleanup() is not always
> right. For example:
>
> regression=# update brin_summarize set value = brin_summarize.value;
> server closed the connection unexpectedly

I wasn't able to repro the issue on 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
with UPDATE/INSERT:

This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
we have moved ExecOpenIndices()
from ExecInitModifyTable() to ExecInsert(). Since we never open the
indices if nothing is
inserted, we would never attempt to close them with ExecCloseIndices()
while the ii_AmCache
is NULL (which is what causes this assertion failure).

AFAICS we would also open the indices from ExecUpdate(). So if we
update the table in a way that no new tuples are inserted, we will have
this issue. As I showed previously, the query below crashes for me on
latest master (dc9f8a7983).

regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly

There are other code paths that call ExecOpenIndices(), such as
ExecMerge(). I believe it's not hard to create queries that trigger this
Assert for those cases.

Thanks
Richard

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

27 November 2023, 10:34:56


On 11/27/23 08:37, Richard Guo wrote:
> 
> On Mon, Nov 27, 2023 at 1:53 PM Soumyadeep Chakraborty
> <soumyadeep2007@gmail.com <mailto:soumyadeep2007@gmail.com>> wrote:
> 
>     On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com
>     <mailto:guofenglinux@gmail.com>> wrote:
>     > It seems that we have an oversight in this commit.  If there is no
>     tuple
>     > that has been inserted, we wouldn't have an available insert state in
>     > the clean up phase.  So the Assert in brininsertcleanup() is not
>     always
>     > right.  For example:
>     >
>     > regression=# update brin_summarize set value = brin_summarize.value;
>     > server closed the connection unexpectedly
> 
>     I wasn't able to repro the issue on
>     86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
>     with UPDATE/INSERT:
> 
>     This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
>     we have moved ExecOpenIndices()
>     from ExecInitModifyTable() to ExecInsert(). Since we never open the
>     indices if nothing is
>     inserted, we would never attempt to close them with ExecCloseIndices()
>     while the ii_AmCache
>     is NULL (which is what causes this assertion failure).
> 
> 
> AFAICS we would also open the indices from ExecUpdate().  So if we
> update the table in a way that no new tuples are inserted, we will have
> this issue.  As I showed previously, the query below crashes for me on
> latest master (dc9f8a7983).
> 
> regression=# update brin_summarize set value = brin_summarize.value;
> server closed the connection unexpectedly
> 
> There are other code paths that call ExecOpenIndices(), such as 
> ExecMerge().  I believe it's not hard to create queries that trigger
> this Assert for those cases.
> 

FWIW I can readily reproduce it like this:

  drop table t;
  create table t (a int);
  insert into t values (1);
  create index on t using brin (a);
  update t set a = a;

I however wonder if maybe we should do the check in index_insert_cleanup
and not in the AM callback. That seems simpler / better, because the AM
callbacks then can't make this mistake.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

27 November 2023, 15:54:20

On 11/27/23 11:34, Tomas Vondra wrote:
> 
> 
> On 11/27/23 08:37, Richard Guo wrote:
>>
>> On Mon, Nov 27, 2023 at 1:53 PM Soumyadeep Chakraborty
>> <soumyadeep2007@gmail.com <mailto:soumyadeep2007@gmail.com>> wrote:
>>
>>     On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com
>>     <mailto:guofenglinux@gmail.com>> wrote:
>>     > It seems that we have an oversight in this commit.  If there is no
>>     tuple
>>     > that has been inserted, we wouldn't have an available insert state in
>>     > the clean up phase.  So the Assert in brininsertcleanup() is not
>>     always
>>     > right.  For example:
>>     >
>>     > regression=# update brin_summarize set value = brin_summarize.value;
>>     > server closed the connection unexpectedly
>>
>>     I wasn't able to repro the issue on
>>     86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
>>     with UPDATE/INSERT:
>>
>>     This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
>>     we have moved ExecOpenIndices()
>>     from ExecInitModifyTable() to ExecInsert(). Since we never open the
>>     indices if nothing is
>>     inserted, we would never attempt to close them with ExecCloseIndices()
>>     while the ii_AmCache
>>     is NULL (which is what causes this assertion failure).
>>
>>
>> AFAICS we would also open the indices from ExecUpdate().  So if we
>> update the table in a way that no new tuples are inserted, we will have
>> this issue.  As I showed previously, the query below crashes for me on
>> latest master (dc9f8a7983).
>>
>> regression=# update brin_summarize set value = brin_summarize.value;
>> server closed the connection unexpectedly
>>
>> There are other code paths that call ExecOpenIndices(), such as 
>> ExecMerge().  I believe it's not hard to create queries that trigger
>> this Assert for those cases.
>>
> 
> FWIW I can readily reproduce it like this:
> 
>   drop table t;
>   create table t (a int);
>   insert into t values (1);
>   create index on t using brin (a);
>   update t set a = a;
> 
> I however wonder if maybe we should do the check in index_insert_cleanup
> and not in the AM callback. That seems simpler / better, because the AM
> callbacks then can't make this mistake.
> 

I did it this way (check in index_insert_cleanup) and pushed. Thanks for
the report!

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Alexander Lakhin

Date:

11 December 2023, 15:00:00

Hello Tomas and Soumyadeep,

25.11.2023 23:06, Tomas Vondra wrote:
> I've done a bit more cleanup on the last version of the patch (renamed
> the fields to start with bis_ as agreed, rephrased the comments / docs /
> commit message a bit) and pushed.

Please look at a query, which produces warnings similar to the ones
observed upthread:
CREATE TABLE t(a INT);
INSERT INTO t SELECT x FROM generate_series(1,10000) x;
CREATE INDEX idx ON t USING brin (a);
REINDEX index CONCURRENTLY idx;

WARNING:  resource was not closed: [1863] (rel=base/16384/16389, blockNum=1, flags=0x93800000, refcount=1 1)
WARNING:  resource was not closed: [1862] (rel=base/16384/16389, blockNum=0, flags=0x93800000, refcount=1 1)

The first bad commit for this anomaly is c1ec02be1.

May be you would also want to fix in passing some typos/inconsistencies
introduced with recent brin-related commits:
s/bs_blkno/bt_blkno/
s/emptry/empty/
s/firt/first/
s/indexinsertcleanup/aminsertcleanup/
s/ maxRange/ nextRange/
s/paga /page /

Best regards,
Alexander

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

11 December 2023, 15:41:45

On 12/11/23 16:00, Alexander Lakhin wrote:
> Hello Tomas and Soumyadeep,
> 
> 25.11.2023 23:06, Tomas Vondra wrote:
>> I've done a bit more cleanup on the last version of the patch (renamed
>> the fields to start with bis_ as agreed, rephrased the comments / docs /
>> commit message a bit) and pushed.
> 
> Please look at a query, which produces warnings similar to the ones
> observed upthread:
> CREATE TABLE t(a INT);
> INSERT INTO t SELECT x FROM generate_series(1,10000) x;
> CREATE INDEX idx ON t USING brin (a);
> REINDEX index CONCURRENTLY idx;
> 
> WARNING:  resource was not closed: [1863] (rel=base/16384/16389,
> blockNum=1, flags=0x93800000, refcount=1 1)
> WARNING:  resource was not closed: [1862] (rel=base/16384/16389,
> blockNum=0, flags=0x93800000, refcount=1 1)
> 
> The first bad commit for this anomaly is c1ec02be1.
> 

Thanks for the report. I haven't investigated what the issue is, but it
seems we fail to release the buffers in some situations - I'd bet we
fail to call the cleanup callback in some place, or something like that.
I'll take a look tomorrow.

> May be you would also want to fix in passing some typos/inconsistencies
> introduced with recent brin-related commits:
> s/bs_blkno/bt_blkno/
> s/emptry/empty/
> s/firt/first/
> s/indexinsertcleanup/aminsertcleanup/
> s/ maxRange/ nextRange/
> s/paga /page /
> 

Definitely. Thanks for noticing those!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

12 December 2023, 11:25:23

On 12/11/23 16:41, Tomas Vondra wrote:
> On 12/11/23 16:00, Alexander Lakhin wrote:
>> Hello Tomas and Soumyadeep,
>>
>> 25.11.2023 23:06, Tomas Vondra wrote:
>>> I've done a bit more cleanup on the last version of the patch (renamed
>>> the fields to start with bis_ as agreed, rephrased the comments / docs /
>>> commit message a bit) and pushed.
>>
>> Please look at a query, which produces warnings similar to the ones
>> observed upthread:
>> CREATE TABLE t(a INT);
>> INSERT INTO t SELECT x FROM generate_series(1,10000) x;
>> CREATE INDEX idx ON t USING brin (a);
>> REINDEX index CONCURRENTLY idx;
>>
>> WARNING:  resource was not closed: [1863] (rel=base/16384/16389,
>> blockNum=1, flags=0x93800000, refcount=1 1)
>> WARNING:  resource was not closed: [1862] (rel=base/16384/16389,
>> blockNum=0, flags=0x93800000, refcount=1 1)
>>
>> The first bad commit for this anomaly is c1ec02be1.
>>
> 
> Thanks for the report. I haven't investigated what the issue is, but it
> seems we fail to release the buffers in some situations - I'd bet we
> fail to call the cleanup callback in some place, or something like that.
> I'll take a look tomorrow.
> 

Yeah, just as I expected this seems to be a case of failing to call the
index_insert_cleanup after doing some inserts - in this case the inserts
happen in table_index_validate_scan, but validate_index has no idea it
needs to do the cleanup.

The attached 0001 patch fixes this by adding the call to validate_index,
which seems like the proper place as it's where the indexInfo is
allocated and destroyed.

But this makes me think - are there any other places that might call
index_insert without then also doing the cleanup? I did grep for the
index_insert() calls, and it seems OK except for this one.

There's a call in toast_internals.c, but that seems OK because that only
deals with btree indexes (and those don't need any cleanup). The same
logic applies to unique_key_recheck(). The rest goes through
execIndexing.c, which should do the cleanup in ExecCloseIndices().

Note: We should probably call the cleanup even in the btree cases, even
if only to make it clear it needs to be called after index_insert().

I was thinking maybe we should have some explicit call to destroy the
IndexInfo, but that seems rather inconvenient - it'd force everyone to
carefully track lifetimes of the IndexInfo instead of just relying on
memory context reset/destruction. That seems quite error-prone.

I propose we do a much simpler thing instead - allow the cache may be
initialized / cleaned up repeatedly, and make sure it gets reset at
convenient place (typically after index_insert calls that don't go
through execIndexing). That'd mean the cleanup does not need to happen
very far from the index_insert(), which makes the reasoning much easier.
0002 does this.

But maybe there's a better way to ensure the cleanup gets called even
when not using execIndexing.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: brininsert optimization opportunity

From

Soumyadeep Chakraborty

Date:

13 December 2023, 08:12:25

On Tue, Dec 12, 2023 at 3:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> The attached 0001 patch fixes this by adding the call to validate_index,
which seems like the proper place as it's where the indexInfo is
allocated and destroyed.

Yes, and by reading validate_index's header comment, there is a clear
expectation that we will be adding tuples to the index in the table AM call
table_index_validate_scan (albeit "validating" doesn't necessarily convey this
side effect). So, it makes perfect sense to call it here.

> But this makes me think - are there any other places that might call
> index_insert without then also doing the cleanup? I did grep for the
> index_insert() calls, and it seems OK except for this one.
>
> There's a call in toast_internals.c, but that seems OK because that only
> deals with btree indexes (and those don't need any cleanup). The same
> logic applies to unique_key_recheck(). The rest goes through
> execIndexing.c, which should do the cleanup in ExecCloseIndices().
>
> Note: We should probably call the cleanup even in the btree cases, even
> if only to make it clear it needs to be called after index_insert().

Agreed. Doesn't feel great, but yeah all of these btree specific code does call
through index_* functions, instead of calling btree functions directly. So,
ideally we should follow through with that pattern and call the cleanup
explicitly. But we are introducing per-tuple overhead that is totally wasted.
Maybe we can add a comment instead like:

void
toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock)
{
int i;

/*
* Save a few cycles by choosing not to call index_insert_cleanup(). Toast
* indexes are btree, which don't have a aminsertcleanup() anyway.
*/

/* Close relations and clean up things */
...
}

And add something similar for unique_key_recheck()? That said, I am also not
opposed to just accepting these wasted cycles, if the commenting seems wonky.

> I propose we do a much simpler thing instead - allow the cache may be
> initialized / cleaned up repeatedly, and make sure it gets reset at
> convenient place (typically after index_insert calls that don't go
> through execIndexing). That'd mean the cleanup does not need to happen
> very far from the index_insert(), which makes the reasoning much easier.
> 0002 does this.

That kind of goes against the ethos of the ii_AmCache, which is meant to capture
state to be used across multiple index inserts. Also, quoting the current docs:

"If the index AM wishes to cache data across successive index insertions
within an SQL statement, it can allocate space
in <literal>indexInfo->ii_Context</literal> and store a pointer to the
data in <literal>indexInfo->ii_AmCache</literal> (which will be NULL
initially). After the index insertions complete, the memory will be freed
automatically. If additional cleanup is required (e.g. if the cache contains
buffers and tuple descriptors), the AM may define an optional function
<literal>indexinsertcleanup</literal>, called before the memory is released."

The memory will be freed automatically (as soon as ii_Context goes away). So,
why would we explicitly free it, like in the attached 0002 patch? And the whole
point of the cleanup function is to do something other than freeing memory, as
the docs note highlights so well.

Also, the 0002 patch does introduce per-tuple function call overhead in
heapam_index_validate_scan().

Also, now we have split brain...in certain situations we want to call
index_insert_cleanup() per index insert and in certain cases we would like it
called once for all inserts. Not very easy to understand IMO.

Finally, I don't think that any index AM would have anything to clean up after
every insert.

I had tried (abused) an approach with MemoryContextCallbacks upthread and that
seems a no-go. And yes I agree, having a dual to makeIndexInfo() (like
destroyIndexInfo()) means that we lose the benefits of ii_Context. That could
be disruptive to existing AMs in-core and outside.

All said and done, v1 has my vote.

Regards,
Soumyadeep (VMware)

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

17 December 2023, 21:27:23

On 12/13/23 09:12, Soumyadeep Chakraborty wrote:
> On Tue, Dec 12, 2023 at 3:25 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
> 
>> The attached 0001 patch fixes this by adding the call to validate_index,
> which seems like the proper place as it's where the indexInfo is
> allocated and destroyed.
> 
> Yes, and by reading validate_index's header comment, there is a clear
> expectation that we will be adding tuples to the index in the table AM call
> table_index_validate_scan (albeit "validating" doesn't necessarily convey this
> side effect). So, it makes perfect sense to call it here.
> 

OK

> 
>> But this makes me think - are there any other places that might call
>> index_insert without then also doing the cleanup? I did grep for the
>> index_insert() calls, and it seems OK except for this one.
>>
>> There's a call in toast_internals.c, but that seems OK because that only
>> deals with btree indexes (and those don't need any cleanup). The same
>> logic applies to unique_key_recheck(). The rest goes through
>> execIndexing.c, which should do the cleanup in ExecCloseIndices().
>>
>> Note: We should probably call the cleanup even in the btree cases, even
>> if only to make it clear it needs to be called after index_insert().
> 
> Agreed. Doesn't feel great, but yeah all of these btree specific code does call
> through index_* functions, instead of calling btree functions directly. So,
> ideally we should follow through with that pattern and call the cleanup
> explicitly. But we are introducing per-tuple overhead that is totally wasted.

I haven't tried but I very much doubt this will be measurable. It's just
a trivial check if a pointer is NULL. We do far more expensive stuff in
this code path.

> Maybe we can add a comment instead like:
> 
> void
> toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock)
> {
> int i;
> 
> /*
> * Save a few cycles by choosing not to call index_insert_cleanup(). Toast
> * indexes are btree, which don't have a aminsertcleanup() anyway.
> */
> 
> /* Close relations and clean up things */
> ...
> }
> 
> And add something similar for unique_key_recheck()? That said, I am also not
> opposed to just accepting these wasted cycles, if the commenting seems wonky.
> 

I really don't want to do this sort of stuff unless we know it actually
saves something.

>> I propose we do a much simpler thing instead - allow the cache may be
>> initialized / cleaned up repeatedly, and make sure it gets reset at
>> convenient place (typically after index_insert calls that don't go
>> through execIndexing). That'd mean the cleanup does not need to happen
>> very far from the index_insert(), which makes the reasoning much easier.
>> 0002 does this.
> 
> That kind of goes against the ethos of the ii_AmCache, which is meant
> to capture state to be used across multiple index inserts.

Why would it be against the ethos? The point is that we reuse stuff over
multiple index_insert() calls. If we can do that for all inserts, cool.
But if that's difficult, it's maybe better to cache for smaller batches
of inserts (instead of making it more complex for all index AMs, even
those not doing any caching).

> Also, quoting the current docs:
> 
> "If the index AM wishes to cache data across successive index insertions
> within an SQL statement, it can allocate space
> in <literal>indexInfo->ii_Context</literal> and store a pointer to the
> data in <literal>indexInfo->ii_AmCache</literal> (which will be NULL
> initially). After the index insertions complete, the memory will be freed
> automatically. If additional cleanup is required (e.g. if the cache contains
> buffers and tuple descriptors), the AM may define an optional function
> <literal>indexinsertcleanup</literal>, called before the memory is released."
> 
> The memory will be freed automatically (as soon as ii_Context goes away). So,
> why would we explicitly free it, like in the attached 0002 patch? And the whole
> point of the cleanup function is to do something other than freeing memory, as
> the docs note highlights so well.
> 

Not sure I follow. The whole reason for having the index_insert_cleanup
callback is we can't rely on ii_Context going away, exactly because that
just throws away the memory and we need to release buffers etc.

The only reason why the 0002 patch does pfree() is that it clearly
indicates whether the cache is initialized. We could have a third state
"allocated but not initialized", but that doesn't seem worth it.

If you're saying the docs are misleading in some way, then maybe we need
to clarify that.

> Also, the 0002 patch does introduce per-tuple function call overhead in
> heapam_index_validate_scan().
> 

How come? The cleanup is done only once after the scan completes. Isn't
that exactly what we want to do?

> Also, now we have split brain...in certain situations we want to call
> index_insert_cleanup() per index insert and in certain cases we would like it
> called once for all inserts. Not very easy to understand IMO.
> 
> Finally, I don't think that any index AM would have anything to clean up after
> every insert.
> 

But I didn't suggest to call the cleanup after every index insert. If a
function does a bunch of index_insert calls, it'd do the cleanup only
once. The point is that it'd happen "close" to the inserts, when we know
it needs to be done. Yes, it might happen multiple times for the same
query, but that still likely saves quite a bit of work (compared to
per-insert init+cleanup).

We need to call the cleanup at some point, and the only alternative I
can think of is to call it in every place that calls BuildIndexInfo
(unless we can guarantee the place can't do index_insert).

> I had tried (abused) an approach with MemoryContextCallbacks upthread and that
> seems a no-go. And yes I agree, having a dual to makeIndexInfo() (like
> destroyIndexInfo()) means that we lose the benefits of ii_Context. That could
> be disruptive to existing AMs in-core and outside.
> 

What do you mean by "dual makeIndexInfo"?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

James Wang

Date:

21 December 2023, 10:05:11

Hi All,  not sure how to "Specify thread msgid"  - choose one which i think is close to my new feature request.

query:

SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' OR
t2.a_indexed_col='some_vable';

can the server automatically replace the OR logic above with UNION please? i.e. replace it with:

(SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' )
UNION
(SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE  t2.a_indexed_col='some_vable');

Thanks

Re: brininsert optimization opportunity

From

Alvaro Herrera

Date:

08 January 2024, 11:23:14

On 2023-Dec-21, James Wang wrote:

> Hi All,  not sure how to "Specify thread msgid"  - choose one which i think is close to my new feature request.

Hello James, based on the "Specify thread msgid" message it looks like
you were trying to request a feature using the Commitfest website?  That
won't work; the commitfest website is only intended as a tracker of
in-progress development work.  Without a Postgres code patch, that
website doesn't help you any.  What you have done amounts to hijacking
an unrelated mailing list thread, which is discouraged and frowned upon.

That said, sadly we don't have any official feature request system,
Please start a new thread by composing an entirely new message to
pgsql-hackers@lists.postgresql.org, and don't use the commitfest
website for it.

That said,

> query:
> 
> SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' OR
t2.a_indexed_col='some_vable';
> 
> can the server automatically replace the OR logic above with UNION please? i.e. replace it with:
> 
> (SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' )
> UNION
> (SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE  t2.a_indexed_col='some_vable');

I have the feeling that this has already been discussed, but I can't
find anything useful in the mailing list archives.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: brininsert optimization opportunity

From

Alvaro Herrera

Date:

08 January 2024, 15:51:22

On 2023-Dec-12, Tomas Vondra wrote:

> I propose we do a much simpler thing instead - allow the cache may be
> initialized / cleaned up repeatedly, and make sure it gets reset at
> convenient place (typically after index_insert calls that don't go
> through execIndexing). That'd mean the cleanup does not need to happen
> very far from the index_insert(), which makes the reasoning much easier.
> 0002 does this.

I'm not in love with this 0002 patch; I think the layering after 0001 is
correct in that the insert_cleanup call should remain in validate_index
and called after the whole thing is done, but 0002 changes things so
that now every table AM has to worry about doing this correctly; and a
bug of omission will not be detected unless you have a BRIN index on
such a table and happen to use CREATE INDEX CONCURRENTLY.  So a
developer has essentially zero chance to do things correctly, which I
think we'd rather avoid.

So I think we should do 0001 and perhaps some further tweaks to the
original brininsert optimization commit: I think the aminsertcleanup
callback should receive the indexRelation as first argument; and also I
think it's not index_insert_cleanup() job to worry about whether
ii_AmCache is NULL or not, but instead the callback should be invoked
always, and then it's aminsertcleanup job to do nothing if ii_AmCache is
NULL.  That way, the index AM API doesn't have to worry about which
parts of IndexInfo (or the indexRelation) is aminsertcleanup going to
care about.  If we decide to change this, then the docs also need a bit
of tweaking I think.

Lastly, I kinda disagree with the notion that only some of the callers
of aminsert should call aminsertcleanup, even though btree doesn't have
an aminsertcleanup and thus it can't affect TOAST or catalogs.  Maybe we
can turn index_insert_cleanup into an inline function, which can quickly
do nothing if aminsertcleanup isn't defined.  Then we no longer have the
layering violation where we assume that btree doesn't care.  But the
proposed change in this paragraph can be maybe handled separately to
avoid confusing things with the bugfix in the two paragraphs above.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)

Re: brininsert optimization opportunity

From

Alvaro Herrera

Date:

09 January 2024, 10:43:56

On 2024-Jan-08, Alvaro Herrera wrote:

> So I think we should do 0001 and perhaps some further tweaks to the
> original brininsert optimization commit: [...]

So I propose the attached patch, which should fix the reported bug and
the things I mentioned above, and also the typos Alexander mentioned
elsewhere in the thread.

> Lastly, I kinda disagree with the notion that only some of the callers
> of aminsert should call aminsertcleanup, even though btree doesn't have
> an aminsertcleanup and thus it can't affect TOAST or catalogs. [...]

I didn't do anything about this.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment

v2-0001-call-cleanup-in-validate_index.patch

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

13 February 2024, 15:08:16

On 1/8/24 16:51, Alvaro Herrera wrote:
> On 2023-Dec-12, Tomas Vondra wrote:
> 
>> I propose we do a much simpler thing instead - allow the cache may be
>> initialized / cleaned up repeatedly, and make sure it gets reset at
>> convenient place (typically after index_insert calls that don't go
>> through execIndexing). That'd mean the cleanup does not need to happen
>> very far from the index_insert(), which makes the reasoning much easier.
>> 0002 does this.
> 
> I'm not in love with this 0002 patch; I think the layering after 0001 is
> correct in that the insert_cleanup call should remain in validate_index
> and called after the whole thing is done, but 0002 changes things so
> that now every table AM has to worry about doing this correctly; and a
> bug of omission will not be detected unless you have a BRIN index on
> such a table and happen to use CREATE INDEX CONCURRENTLY.  So a
> developer has essentially zero chance to do things correctly, which I
> think we'd rather avoid.
> 

True. If the AM code does not need to worry about this kind of stuff,
that would be good / less error prone.

One thing that is not very clear to me is that I don't think there's a
very good way to determine which places need the cleanup call. Because
it depends on (a) what kind of index is used and (b) what happens in the
code called earlier (which may easily do arbitrary stuff). Which means
we have to call the cleanup whenever the code *might* have done inserts
into the index. Maybe it's not such an issue in practice, though.

> So I think we should do 0001 and perhaps some further tweaks to the
> original brininsert optimization commit: I think the aminsertcleanup
> callback should receive the indexRelation as first argument; and also I
> think it's not index_insert_cleanup() job to worry about whether
> ii_AmCache is NULL or not, but instead the callback should be invoked
> always, and then it's aminsertcleanup job to do nothing if ii_AmCache is
> NULL.  That way, the index AM API doesn't have to worry about which
> parts of IndexInfo (or the indexRelation) is aminsertcleanup going to
> care about.  If we decide to change this, then the docs also need a bit
> of tweaking I think.
> 

Yeah, passing the indexRelation to the am callback seems reasonable.
It's more consistent what we do for other callbacks, and perhaps the
callback might need the indexRelation.

I don't quite see why we should invoke the callback with ii_AmCache=NULL
though. If there's nothing cached, why bother? It just means all cleanup
callbacks have to do this NULL check on their own.

> Lastly, I kinda disagree with the notion that only some of the callers
> of aminsert should call aminsertcleanup, even though btree doesn't have
> an aminsertcleanup and thus it can't affect TOAST or catalogs.  Maybe we
> can turn index_insert_cleanup into an inline function, which can quickly
> do nothing if aminsertcleanup isn't defined.  Then we no longer have the
> layering violation where we assume that btree doesn't care.  But the
> proposed change in this paragraph can be maybe handled separately to
> avoid confusing things with the bugfix in the two paragraphs above.
> 

After thinking about this a bit more I agree with you - we should call
the cleanup from each place calling aminsert, even if it's for nbtree
(or other index types that don't require cleanup at the moment).

I wonder if there's a nice way to check this in assert-enabled builds?
Could we tweak nbtree (or index AM in general) to check that all places
that called aminsert also called aminsertcleanup?

For example, I was thinking we might add a flag to IndexInfo (separate
from the ii_AmCache), tracking if aminsert() was called, and then later
check the aminsertcleanup() got called too. The problem however is
there's no obviously convenient place for this check, because IndexInfo
is not freed explicitly ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Alvaro Herrera

Date:

29 February 2024, 12:20:39

On 2024-Feb-13, Tomas Vondra wrote:

> One thing that is not very clear to me is that I don't think there's a
> very good way to determine which places need the cleanup call. Because
> it depends on (a) what kind of index is used and (b) what happens in the
> code called earlier (which may easily do arbitrary stuff). Which means
> we have to call the cleanup whenever the code *might* have done inserts
> into the index. Maybe it's not such an issue in practice, though.

I think it's not an issue, or rather that we should not try to guess.
Instead make it a simple rule: if aminsert is called, then
aminsertcleanup must be called afterwards, period.

> On 1/8/24 16:51, Alvaro Herrera wrote:

> > So I think we should do 0001 and perhaps some further tweaks to the
> > original brininsert optimization commit: I think the aminsertcleanup
> > callback should receive the indexRelation as first argument; and also I
> > think it's not index_insert_cleanup() job to worry about whether
> > ii_AmCache is NULL or not, but instead the callback should be invoked
> > always, and then it's aminsertcleanup job to do nothing if ii_AmCache is
> > NULL. [...]

> I don't quite see why we should invoke the callback with ii_AmCache=NULL
> though. If there's nothing cached, why bother? It just means all cleanup
> callbacks have to do this NULL check on their own.

Guessing that aminsertcleanup is not needed when ii_AmCache is NULL
seems like a leaky abstraction.  I propose to have only the AM know
whether the cleanup call is important or not, without
index_insert_cleanup assuming that it's related to ii_AmCache.  Somebody
could decide to have something completely different during insert
cleanup, which is not in ii_AmCache.

> I wonder if there's a nice way to check this in assert-enabled builds?
> Could we tweak nbtree (or index AM in general) to check that all places
> that called aminsert also called aminsertcleanup?
> 
> For example, I was thinking we might add a flag to IndexInfo (separate
> from the ii_AmCache), tracking if aminsert() was called, and Then later
> check the aminsertcleanup() got called too. The problem however is
> there's no obviously convenient place for this check, because IndexInfo
> is not freed explicitly ...

I agree it would be nice to have a way to verify, but it doesn't seem
100% essential.  After all, it's not very common to add new calls to
aminsert.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: brininsert optimization opportunity

From

Michael Paquier

Date:

18 April 2024, 07:07:57

On Thu, Feb 29, 2024 at 01:20:39PM +0100, Alvaro Herrera wrote:
> I think it's not an issue, or rather that we should not try to guess.
> Instead make it a simple rule: if aminsert is called, then
> aminsertcleanup must be called afterwards, period.
>
> I agree it would be nice to have a way to verify, but it doesn't seem
> 100% essential.  After all, it's not very common to add new calls to
> aminsert.

This thread is listed as an open item.  What's the follow-up plan?
The last email of this thread is dated as of the 29th of February,
which was 6 weeks ago.
--
Michael

Attachment

signature.asc

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

18 April 2024, 09:35:43

On 4/18/24 09:07, Michael Paquier wrote:
> On Thu, Feb 29, 2024 at 01:20:39PM +0100, Alvaro Herrera wrote:
>> I think it's not an issue, or rather that we should not try to guess.
>> Instead make it a simple rule: if aminsert is called, then
>> aminsertcleanup must be called afterwards, period.
>>
>> I agree it would be nice to have a way to verify, but it doesn't seem
>> 100% essential.  After all, it's not very common to add new calls to
>> aminsert.
> 
> This thread is listed as an open item.  What's the follow-up plan?
> The last email of this thread is dated as of the 29th of February,
> which was 6 weeks ago.

Apologies, I got distracted by the other patches. The bug is still
there, I believe the patch shared by Alvaro in [1] is the right way to
deal with it. I'll take care of that today/tomorrow.


[1]
https://www.postgresql.org/message-id/202401091043.e3nrqiad6gb7@alvherre.pgsql

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

18 April 2024, 22:13:46

Hi,

Here's two patched to deal with this open item. 0001 is a trivial fix of
typos and wording, I moved it into a separate commit for clarity. 0002
does the actual fix - adding the index_insert_cleanup(). It's 99% the
patch Alvaro shared some time ago, with only some minor formatting
tweaks by me.

I've also returned to this Alvaro's comment:

> Lastly, I kinda disagree with the notion that only some of the callers
> of aminsert should call aminsertcleanup, even though btree doesn't
> have an aminsertcleanup and thus it can't affect TOAST or catalogs.

which was a reaction to my earlier statement about places calling
index_insert():

> There's a call in toast_internals.c, but that seems OK because that
> only deals with btree indexes (and those don't need any cleanup). The
> same logic applies to unique_key_recheck(). The rest goes through
> execIndexing.c, which should do the cleanup in ExecCloseIndices().

I think Alvaro is right, so I went through all index_insert() callers
and checked which need the cleanup. Luckily there's not that many of
them, only 5 in total call index_insert() directly:

1) toast_save_datum (src/backend/access/common/toast_internals.c)

  This is safe, because the index_insert() passes indexInfo=NULL, so
  there can't possibly be any cache. If we ever decide to pass a valid
  indexInfo, we can add the cleanup, now it seems pointless.

  Note: If someone created a BRIN index on a TOAST table, that'd already
  crash, because BRIN blindly dereferences the indexInfo. Maybe that
  should be fixed, but we don't support CREATE INDEX on TOAST tables.

2) heapam_index_validate_scan (src/backend/access/heap/heapam_handler.c)

  Covered by the committed fix, adding cleanup to validate_index.

3) CatalogIndexInsert (src/backend/catalog/indexing.c)

  Covered by all callers also calling CatalogCloseIndexes, which in turn
  calls ExecCloseIndices and cleanup.

4) unique_key_recheck (src/backend/commands/constraint.c)

  This seems like the only place missing the cleanup call.

5) ExecInsertIndexTuples (src/backend/executor/execIndexing.c)

  Should be covered by ExecCloseIndices, called after the insertions.


So it seems only (4) unique_key_recheck needs the extra call (it can't
really happen higher because the indexInfo is a local variable). So the
0002 patch adds the call.

The patch also adds a test for this (or rather tweaks an existing one).


It's a bit too late for me to push this now, I'll do so early tomorrow.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: brininsert optimization opportunity

From

Tomas Vondra

Date:

19 April 2024, 14:14:00

On 4/19/24 00:13, Tomas Vondra wrote:
> ...
> 
> It's a bit too late for me to push this now, I'll do so early tomorrow.
> 

FWIW I've pushed both patches, which resolves the open item, so I've
moved it to the "resolved" part on wiki.

thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

From

Michael Paquier

Date:

19 April 2024, 23:25:47

On Fri, Apr 19, 2024 at 04:14:00PM +0200, Tomas Vondra wrote:
> FWIW I've pushed both patches, which resolves the open item, so I've
> moved it to the "resolved" part on wiki.

Thanks, Tomas!
--
Michael

Attachment

signature.asc