Thread: [PoC] Improve dead tuple storage for lazy vacuum

[PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi all,

Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.

lazy_tid_reaped() is essentially an existence check; for every index
tuple, it checks if the TID of the heap it points to exists in the set
of TIDs of dead tuples. The maximum size of dead tuples is limited by
maintenance_work_mem, and if the upper limit is reached, the heap scan
is suspended, index vacuum and heap vacuum are performed, and then
heap scan is resumed again. Therefore, in terms of the performance of
index vacuuming, there are two important factors: the performance of
lookup TIDs from the set of dead tuples and its memory usage. The
former is obvious whereas the latter affects the number of Index
vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete)
performs a full scan of the index, so it is important in terms of
performance to avoid index vacuuming from being executed more than
once during lazy vacuum.

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Problems Solutions
===============

Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c)  and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3]  (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.

The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.

Evaluation
========

Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4]. It has some functions: generating
TIDs for both index tuples and dead tuples, loading dead tuples to the
data structure, simulating lazy_tid_reaped() using those virtual heap
tuples and heap dead tuples. So the code lacks many features such as
iteration and DSM/DSA support but it makes testing of data structure
easier.

FYI I've confirmed the validity of this tool. When I ran a vacuum on
the table with 3GB size, index vacuuming took 12.3 sec and
lazy_tid_reaped() took approximately 8.5 sec. Simulating a similar
situation with the tool, the lookup benchmark with the array data
structure took approximately 8.0 sec. Given that the tool doesn't
simulate the cost of function calls, it seems to reasonably simulate
it.

I've evaluated the lookup performance and memory foot point against
the four types of data structure: array, integerset (intset),
tidbitmap (tbm), roaring tidbitmap (rtbm) while changing the
distribution of dead tuples in blocks. Since tbm doesn't have a
function for existence check I've added it and allocate enough memory
to make sure that tbm never be lossy during the evaluation. In all
test cases, I simulated that the table has 1,000,000 blocks and every
block has at least one dead tuple. The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:

* Test-case 1 (10 dead tuples in 20 offsets interval)

An array container is selected in this test case, using 20 bytes for each block.

          Execution Time  Memory Usage
array       14,140.91          60,008,248
intset        9,350.08           50,339,840
tbm          1,299.62         100,671,544
rtbm         1,892.52           58,744,944

* Test-case 2 (10 consecutive dead tuples from offset 1)

A bitmap container is selected in this test case, using 2 bytes for each block.

          Execution Time  Memory Usage
array        1,056.60         60,008,248
intset           650.85          50,339,840
tbm             194.61        100,671,544
rtbm            154.57         27,287,664

* Test-case 3 (2 dead tuples at 1 and 100 offsets)

An array container is selected in this test case, using 4 bytes for
each block. Since 'array' data structure (not array container of rtbm)
uses only 12 bytes for each block, given that the size of hash table
entry size in 'rtbm', 'array' data structure uses less memory.

          Execution Time  Memory Usage
array        6,054.22          12,008,248
intset       4,203.41           16,785,408
tbm             759.17         100,671,544
rtbm           750.08            29,384,816

* Test-case 4 (100 consecutive dead tuples from 1)

A run container is selected in this test case, using 4 bytes for each block.

          Execution Time  Memory Usage
array       8,883.03        600,008,248
intset       7,358.23        100,671,488
tbm            758.81         100,671,544
rtbm           764.33          29,384,816

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

Feedback is very welcome. Thank you for reading the email through to the end.

Regards,

[1] https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com
[2] http://roaringbitmap.org/
[3] https://github.com/RoaringBitmap/CRoaring
[4] https://github.com/MasahikoSawada/pgtools/tree/master/bdbench

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Matthias van de Meent
Date:
On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi all,
>
> Index vacuuming is one of the most time-consuming processes in lazy
> vacuuming. lazy_tid_reaped() is a large part among them. The attached
> the flame graph shows a profile of a vacuum on a table that has one index
> and 80 million live rows and 20 million dead rows, where
> lazy_tid_reaped() accounts for about 47% of the total vacuum execution
> time.
>
> [...]
>
> Overall, 'rtbm' has a much better lookup performance and good memory
> usage especially if there are relatively many dead tuples. However, in
> some cases, 'intset' and 'array' have a better memory usage.

Those are some great results, with a good path to meaningful improvements.

> Feedback is very welcome. Thank you for reading the email through to the end.

The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0], and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).

Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/flat/0bbeb784050503036344e1f08513f13b2083244b.camel%40j-davis.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Peter Geoghegan
Date:
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Currently, the TIDs of dead tuples are stored in an array that is
> collectively allocated at the start of lazy vacuum and TID lookup uses
> bsearch(). There are the following challenges and limitations:
>
> 1. Don't allocate more than 1GB. There was a discussion to eliminate
> this limitation by using MemoryContextAllocHuge() but there were
> concerns about point 2[1].

I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.

My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.

> A run container is selected in this test case, using 4 bytes for each block.
>
>           Execution Time  Memory Usage
> array       8,883.03        600,008,248
> intset       7,358.23        100,671,488
> tbm            758.81         100,671,544
> rtbm           764.33          29,384,816
>
> Overall, 'rtbm' has a much better lookup performance and good memory
> usage especially if there are relatively many dead tuples. However, in
> some cases, 'intset' and 'array' have a better memory usage.

This seems very promising.

I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.

The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.

I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.

Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.

-- 
Peter Geoghegan



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Peter Geoghegan
Date:
On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I wonder how much it would help to break up that loop into two loops.
> Make the callback into a batch operation that generates state that
> describes what to do with each and every index tuple on the leaf page.
> The first loop would build a list of TIDs, then you'd call into
> vacuumlazy.c and get it to process the TIDs, and finally the second
> loop would physically delete the TIDs that need to be deleted. This
> would mean that there would be only one call per leaf page per
> btbulkdelete(). This would reduce the number of calls to the callback
> by at least 100x, and maybe more than 1000x.

Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.

Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.

If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).

When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.

-- 
Peter Geoghegan



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jul 7, 2021 at 11:25 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Hi all,
> >
> > Index vacuuming is one of the most time-consuming processes in lazy
> > vacuuming. lazy_tid_reaped() is a large part among them. The attached
> > the flame graph shows a profile of a vacuum on a table that has one index
> > and 80 million live rows and 20 million dead rows, where
> > lazy_tid_reaped() accounts for about 47% of the total vacuum execution
> > time.
> >
> > [...]
> >
> > Overall, 'rtbm' has a much better lookup performance and good memory
> > usage especially if there are relatively many dead tuples. However, in
> > some cases, 'intset' and 'array' have a better memory usage.
>
> Those are some great results, with a good path to meaningful improvements.
>
> > Feedback is very welcome. Thank you for reading the email through to the end.
>
> The current available infrastructure for TIDs is quite ill-defined for
> TableAM authors [0], and other TableAMs might want to use more than
> just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
> identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
> BLCKSZ, which requires 11 bits to fit).
>
> Could you also check what the (performance, memory) impact would be if
> these proposed structures were to support the maximum
> MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
> numbers that could be supported by our current TID struct?

I think tbm will be the most affected by the memory impact of the
larger maximum MaxHeapTuplesPerPage. For example, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), even if there is only one dead tuple in
a block, it will always require at least 147 bytes per block.

Rtbm chooses the container type among array, bitmap, or run depending
on the number and distribution of dead tuples in a block, and only
bitmap containers can be searched with O(1). Run containers depend on
the distribution of dead tuples within a block. So let’s compare array
and bitmap containers.

With 8kB blocks  (MaxHeapTuplesPerPage = 291), 36 bytes are needed for
a bitmap container at maximum. In other words, when compared to an
array container, bitmap will be chosen if there are more than 18 dead
tuples in a block. On the other hand, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), 147 bytes are needed for a bitmap
container at maximum, so bitmap container will be chosen if there are
more than 74 dead tuples in a block. And, with full uint16-range
(MaxHeapTuplesPerPage = 65535), 8192 bytes are needed at maximum, so
bitmap container will be chosen if there are more than 4096 dead
tuples in a block. Therefore, in any case, if more than about 6% of
tuples in a block are garbage, a bitmap container will be chosen and
bring a faster lookup performance. (Of course, if a run container is
chosen, the container size gets smaller but the lookup performance is
O(logN).) But if the number of dead tuples in the table is small and
we have the larger MaxHeapTuplesPerPage, it’s likely to choose an
array container, and the lookup performance becomes O(logN). Still, it
should be faster than the array data structure because the range of
search targets in an array container is much smaller.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Currently, the TIDs of dead tuples are stored in an array that is
> > collectively allocated at the start of lazy vacuum and TID lookup uses
> > bsearch(). There are the following challenges and limitations:
> >
> > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > this limitation by using MemoryContextAllocHuge() but there were
> > concerns about point 2[1].
>
> I think that the main problem with the 1GB limitation is that it is
> surprising -- it can cause disruption when we first exceed the magical
> limit of ~174 million TIDs. This can cause us to dirty index pages a
> second time when we might have been able to just do it once with
> sufficient memory for TIDs. OTOH there are actually cases where having
> less memory for TIDs makes performance *better* because of locality
> effects. This perverse behavior with memory sizing isn't a rare case
> that we can safely ignore -- unfortunately it's fairly common.
>
> My point is that we should be careful to choose the correct goal.
> Obviously memory use matters. But it might be more helpful to think of
> memory use as just a proxy for what truly matters, not a goal in
> itself. It's hard to know what this means (what is the "real goal"?),
> and hard to measure it even if you know for sure. It could still be
> useful to think of it like this.

As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.

>
> > A run container is selected in this test case, using 4 bytes for each block.
> >
> >           Execution Time  Memory Usage
> > array       8,883.03        600,008,248
> > intset       7,358.23        100,671,488
> > tbm            758.81         100,671,544
> > rtbm           764.33          29,384,816
> >
> > Overall, 'rtbm' has a much better lookup performance and good memory
> > usage especially if there are relatively many dead tuples. However, in
> > some cases, 'intset' and 'array' have a better memory usage.
>
> This seems very promising.
>
> I wonder how much you have thought about the index AM side. It makes
> sense to initially evaluate these techniques using this approach of
> separating the data structure from how it is used by VACUUM -- I think
> that that was a good idea. But at the same time there may be certain
> important theoretical questions that cannot be answered this way --
> questions about how everything "fits together" in a real VACUUM might
> matter a lot. You've probably thought about this at least a little
> already. Curious to hear how you think it "fits together" with the
> work that you've done already.

Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.

>
> The loop inside btvacuumpage() makes each loop iteration call the
> callback -- this is always a call to lazy_tid_reaped() in practice.
> And that's where we do binary searches. These binary searches are
> usually where we see a huge number of cycles spent when we look at
> profiles, including the profile that produced your flame graph. But I
> worry that that might be a bit misleading -- the way that profilers
> attribute costs is very complicated and can never be fully trusted.
> While it is true that lazy_tid_reaped() often accesses main memory,
> which will of course add a huge amount of latency and make it a huge
> bottleneck, the "big picture" is still relevant.
>
> I think that the compiler currently has to make very conservative
> assumptions when generating the machine code used by the loop inside
> btvacuumpage(), which calls through an opaque function pointer at
> least once per loop iteration -- anything can alias, so the compiler
> must be conservative. The data dependencies are hard for both the
> compiler and the CPU to analyze. The cost of using a function pointer
> compared to a direct function call is usually quite low, but there are
> important exceptions -- cases where it prevents other useful
> optimizations. Maybe this is an exception.
>
> I wonder how much it would help to break up that loop into two loops.
> Make the callback into a batch operation that generates state that
> describes what to do with each and every index tuple on the leaf page.
> The first loop would build a list of TIDs, then you'd call into
> vacuumlazy.c and get it to process the TIDs, and finally the second
> loop would physically delete the TIDs that need to be deleted. This
> would mean that there would be only one call per leaf page per
> btbulkdelete(). This would reduce the number of calls to the callback
> by at least 100x, and maybe more than 1000x.
>
> This approach would make btbulkdelete() similar to
> _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
> an independent idea to your ideas -- I imagine that this would work
> far better when combined with a more compact data structure, which is
> naturally more capable of batch processing than a simple array of
> TIDs. Maybe this will help the compiler and the CPU to fully
> understand the *natural* data dependencies, so that they can be as
> effective as possible in making the code run fast. It's possible that
> a modern CPU will be able to *hide* the latency more intelligently
> than what we have today. The latency is such a big problem that we may
> be able to justify "wasting" other CPU resources, just because it
> sometimes helps with hiding the latency. For example, it might
> actually be okay to sort all of the TIDs on the page to make the bulk
> processing work -- though you might still do a precheck that is
> similar to the precheck inside lazy_tid_reaped() that was added by you
> in commit bbaf315309e.

Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Hannu Krosing
Date:
Very nice results.

I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.

Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.

Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.

That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which

1) postpones the memory allocation to the beginning of index cleanups

2) lets you select the correct size and structure as you know more
about the distribution at that time

3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup

Also, collecting dead tids into a file makes it trivial (well, almost
:) ) to parallelize the initial heap scan, so more resources can be
thrown at it if available.

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.




On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Currently, the TIDs of dead tuples are stored in an array that is
> > > collectively allocated at the start of lazy vacuum and TID lookup uses
> > > bsearch(). There are the following challenges and limitations:
> > >
> > > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > > this limitation by using MemoryContextAllocHuge() but there were
> > > concerns about point 2[1].
> >
> > I think that the main problem with the 1GB limitation is that it is
> > surprising -- it can cause disruption when we first exceed the magical
> > limit of ~174 million TIDs. This can cause us to dirty index pages a
> > second time when we might have been able to just do it once with
> > sufficient memory for TIDs. OTOH there are actually cases where having
> > less memory for TIDs makes performance *better* because of locality
> > effects. This perverse behavior with memory sizing isn't a rare case
> > that we can safely ignore -- unfortunately it's fairly common.
> >
> > My point is that we should be careful to choose the correct goal.
> > Obviously memory use matters. But it might be more helpful to think of
> > memory use as just a proxy for what truly matters, not a goal in
> > itself. It's hard to know what this means (what is the "real goal"?),
> > and hard to measure it even if you know for sure. It could still be
> > useful to think of it like this.
>
> As I wrote in the first email, I think there are two important factors
> in index vacuuming performance: the performance to check if heap TID
> that an index tuple points to is dead, and the number of times to
> perform index bulk-deletion. The flame graph I attached in the first
> mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
> disk-intensive operation in practice. Given that most index AM's
> bulk-deletion does a full index scan and a table could have multiple
> indexes, reducing the number of times to perform index bulk-deletion
> really contributes to reducing the execution time, especially for
> large tables. I think that a more compact data structure for dead
> tuple TIDs is one of the ways to achieve that.
>
> >
> > > A run container is selected in this test case, using 4 bytes for each block.
> > >
> > >           Execution Time  Memory Usage
> > > array       8,883.03        600,008,248
> > > intset       7,358.23        100,671,488
> > > tbm            758.81         100,671,544
> > > rtbm           764.33          29,384,816
> > >
> > > Overall, 'rtbm' has a much better lookup performance and good memory
> > > usage especially if there are relatively many dead tuples. However, in
> > > some cases, 'intset' and 'array' have a better memory usage.
> >
> > This seems very promising.
> >
> > I wonder how much you have thought about the index AM side. It makes
> > sense to initially evaluate these techniques using this approach of
> > separating the data structure from how it is used by VACUUM -- I think
> > that that was a good idea. But at the same time there may be certain
> > important theoretical questions that cannot be answered this way --
> > questions about how everything "fits together" in a real VACUUM might
> > matter a lot. You've probably thought about this at least a little
> > already. Curious to hear how you think it "fits together" with the
> > work that you've done already.
>
> Yeah, that definitely needs to be considered. Currently, what we need
> for the dead tuple storage for lazy vacuum are store, lookup, and
> iteration. And given the parallel vacuum, it has to be able to be
> allocated on DSM or DSA. While implementing the PoC code, I'm trying
> to integrate it with the current lazy vacuum code. As far as I've seen
> so far, the integration is not hard, at least with the *current* lazy
> vacuum code and index AMs code.
>
> >
> > The loop inside btvacuumpage() makes each loop iteration call the
> > callback -- this is always a call to lazy_tid_reaped() in practice.
> > And that's where we do binary searches. These binary searches are
> > usually where we see a huge number of cycles spent when we look at
> > profiles, including the profile that produced your flame graph. But I
> > worry that that might be a bit misleading -- the way that profilers
> > attribute costs is very complicated and can never be fully trusted.
> > While it is true that lazy_tid_reaped() often accesses main memory,
> > which will of course add a huge amount of latency and make it a huge
> > bottleneck, the "big picture" is still relevant.
> >
> > I think that the compiler currently has to make very conservative
> > assumptions when generating the machine code used by the loop inside
> > btvacuumpage(), which calls through an opaque function pointer at
> > least once per loop iteration -- anything can alias, so the compiler
> > must be conservative. The data dependencies are hard for both the
> > compiler and the CPU to analyze. The cost of using a function pointer
> > compared to a direct function call is usually quite low, but there are
> > important exceptions -- cases where it prevents other useful
> > optimizations. Maybe this is an exception.
> >
> > I wonder how much it would help to break up that loop into two loops.
> > Make the callback into a batch operation that generates state that
> > describes what to do with each and every index tuple on the leaf page.
> > The first loop would build a list of TIDs, then you'd call into
> > vacuumlazy.c and get it to process the TIDs, and finally the second
> > loop would physically delete the TIDs that need to be deleted. This
> > would mean that there would be only one call per leaf page per
> > btbulkdelete(). This would reduce the number of calls to the callback
> > by at least 100x, and maybe more than 1000x.
> >
> > This approach would make btbulkdelete() similar to
> > _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
> > an independent idea to your ideas -- I imagine that this would work
> > far better when combined with a more compact data structure, which is
> > naturally more capable of batch processing than a simple array of
> > TIDs. Maybe this will help the compiler and the CPU to fully
> > understand the *natural* data dependencies, so that they can be as
> > effective as possible in making the code run fast. It's possible that
> > a modern CPU will be able to *hide* the latency more intelligently
> > than what we have today. The latency is such a big problem that we may
> > be able to justify "wasting" other CPU resources, just because it
> > sometimes helps with hiding the latency. For example, it might
> > actually be okay to sort all of the TIDs on the page to make the bulk
> > processing work -- though you might still do a precheck that is
> > similar to the precheck inside lazy_tid_reaped() that was added by you
> > in commit bbaf315309e.
>
> Interesting idea. I remember you mentioned this idea somewhere and
> I've considered this idea too while implementing the PoC code. It's
> definitely worth trying. Maybe we can write a patch for this as a
> separate patch? It will change index AM and could improve also the
> current bulk-deletion. We can consider a better data structure on top
> of this idea.
>
> Regards,
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>
>



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Hannu Krosing
Date:
Resending as forgot to send to the list (thanks Peter :) )

On Wed, Jul 7, 2021 at 10:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> The loop inside btvacuumpage() makes each loop iteration call the
> callback -- this is always a call to lazy_tid_reaped() in practice.
> And that's where we do binary searches. These binary searches are
> usually where we see a huge number of cycles spent when we look at
> profiles, including the profile that produced your flame graph. But I
> worry that that might be a bit misleading -- the way that profilers
> attribute costs is very complicated and can never be fully trusted.
> While it is true that lazy_tid_reaped() often accesses main memory,
> which will of course add a huge amount of latency and make it a huge
> bottleneck, the "big picture" is still relevant.

This is why I have mainly focused on making it possible to use SIMD and
run 4-8 binary searches in parallel, mostly 8, for AVX2.

How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.

There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.

I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.

> I think that the compiler currently has to make very conservative
> assumptions when generating the machine code used by the loop inside
> btvacuumpage(), which calls through an opaque function pointer at
> least once per loop iteration -- anything can alias, so the compiler
> must be conservative.

Definitely this! The lookup function needs to be turned into an inline
function or #define as well to give the compiler maximum freedoms.

> The data dependencies are hard for both the
> compiler and the CPU to analyze. The cost of using a function pointer
> compared to a direct function call is usually quite low, but there are
> important exceptions -- cases where it prevents other useful
> optimizations. Maybe this is an exception.

Yes. Also this could be a place where unrolling the loop could make a
real difference.

Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.

The 32x unroll would not be really that bad for performance if all 32
loops were needed, but mostly we would need to jump into last 10 to 20
loops for lookup min 1000 to 1000000 pages and I suspect this is such
a weird corner case that compiler is really unlikely to have this
optimisation supported. Of course I may be wrong and ith is a common
enough case for the optimiser.

>
> I wonder how much it would help to break up that loop into two loops.
> Make the callback into a batch operation that generates state that
> describes what to do with each and every index tuple on the leaf page.
> The first loop would build a list of TIDs, then you'd call into
> vacuumlazy.c and get it to process the TIDs, and finally the second
> loop would physically delete the TIDs that need to be deleted. This
> would mean that there would be only one call per leaf page per
> btbulkdelete(). This would reduce the number of calls to the callback
> by at least 100x, and maybe more than 1000x.

While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.

This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.

> This approach would make btbulkdelete() similar to
> _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
> an independent idea to your ideas -- I imagine that this would work
> far better when combined with a more compact data structure, which is
> naturally more capable of batch processing than a simple array of
> TIDs. Maybe this will help the compiler and the CPU to fully
> understand the *natural* data dependencies, so that they can be as
> effective as possible in making the code run fast. It's possible that
> a modern CPU will be able to *hide* the latency more intelligently
> than what we have today. The latency is such a big problem that we may
> be able to justify "wasting" other CPU resources, just because it
> sometimes helps with hiding the latency. For example, it might
> actually be okay to sort all of the TIDs on the page to make the bulk
> processing work

Then again it may be so much extra work that it starts to dominate
some parts of profiles.

For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
:)

So only testing will tell.

> -- though you might still do a precheck that is
> similar to the precheck inside lazy_tid_reaped() that was added by you
> in commit bbaf315309e.
>
> Of course it's very easy to be wrong about stuff like this. But it
> might not be that hard to prototype. You can literally copy and paste
> code from _bt_delitems_delete_check() to do this. It does the same
> basic thing already.

Also a lot of testing would be needed to figure out which strategy
fits best for which distribution of dead tuples, and possibly their
relation to the order of tuples to check from indexes .


Cheers

--
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Peter Geoghegan
Date:
On Thu, Jul 8, 2021 at 1:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> As I wrote in the first email, I think there are two important factors
> in index vacuuming performance: the performance to check if heap TID
> that an index tuple points to is dead, and the number of times to
> perform index bulk-deletion. The flame graph I attached in the first
> mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
> disk-intensive operation in practice.

Maybe. But I recently bought an NVME SSD that can read at over
6GB/second. So "disk-intensive" is not what it used to be -- at least
not for reads. In general it's not good if we do multiple scans of an
index -- no question. But there is a danger in paying a little too
much attention to what is true in general -- we should not ignore what
might be true in specific cases either. Maybe we can solve some
problems by spilling the TID data structure to disk -- if we trade
sequential I/O for random I/O, we may be able to do only one pass over
the index (especially when we have *almost* enough memory to fit all
TIDs, but not quite enough).

The big problem with multiple passes over the index is not the extra
read bandwidth  -- it's the extra page dirtying (writes), especially
with things like indexes on UUID columns. We want to dirty each leaf
page in each index at most once per VACUUM, and should be willing to
pay some cost in order to get a larger benefit with page dirtying.
After all, writes are much more expensive on modern flash devices --
if we have to do more random read I/O to spill the TIDs then that
might actually be 100% worth it. And, we don't need much memory for
something that works well as a negative cache, either -- so maybe the
extra random read I/O needed to spill the TIDs will be very limited
anyway.

There are many possibilities. You can probably think of other
trade-offs yourself. We could maybe use a cost model for all this --
it is a little like a hash join IMV. This is just something to think
about while refining the design.

> Interesting idea. I remember you mentioned this idea somewhere and
> I've considered this idea too while implementing the PoC code. It's
> definitely worth trying. Maybe we can write a patch for this as a
> separate patch? It will change index AM and could improve also the
> current bulk-deletion. We can consider a better data structure on top
> of this idea.

I'm happy to write it as a separate patch, either by leaving it to you
or by collaborating directly. It's not necessary to tie it to the
first patch. But at the same time it is highly related to what you're
already doing.

As I said I am totally prepared to be wrong here. But it seems worth
it to try. In Postgres 14, the _bt_delitems_vacuum() function (which
actually carries out VACUUM's physical page modifications to a leaf
page) is almost identical to _bt_delitems_delete(). And
_bt_delitems_delete() was already built with these kinds of problems
in mind -- it batches work to get the most out of synchronizing with
distant state describing which tuples to delete. It's not exactly the
same situation, but it's *kinda* similar. More importantly, it's a
relatively cheap and easy experiment to run, since we already have
most of what we need (we can take it from
_bt_delitems_delete_check()).

Usually this kind of micro optimization is not very valuable -- 99.9%+
of all code just isn't that sensitive to having the right
optimizations. But this is one of the rare important cases where we
really should look at the raw machine code, and do some kind of
microarchitectural level analysis through careful profiling, using
tools like perf. The laws of physics (or electronic engineering) make
it inevitable that searching for TIDs to match is going to be kind of
slow. But we should at least make sure that we use every trick
available to us to reduce the bottleneck, since it really does matter
a lot to users. Users should be able to expect that this code will at
least be as fast as the hardware that they paid for can allow (or
close to it). There is a great deal of microarchitectural
sophistication with modern CPUs, much of which is designed to make
problems like this one less bad [1].

[1] https://www.agner.org/optimize/microarchitecture.pdf
-- 
Peter Geoghegan



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Peter Geoghegan
Date:
On Thu, Jul 8, 2021 at 1:53 PM Hannu Krosing <hannuk@google.com> wrote:
> How I am approaching this is separating "page search" tyo run over a
> (naturally) sorted array of 32 bit page pointers and only when the
> page is found the indexes in this array are used to look up the
> in-page bitmaps.
> This allows the heavier bsearch activity to run on smaller range of
> memory, hopefully reducing the cache trashing.

I think that the really important thing is to figure out roughly the
right data structure first.

> There are opportunities to optimise this further for cash hits, buy
> collecting the tids from indexes in larger patches and then
> constraining the searches in the main is-deleted-bitmap to run over
> sections of it, but at some point this becomes a very complex
> balancing act, as the manipulation of the bits-to-check from indexes
> also takes time, not to mention the need to release the index pages
> and then later chase the tid pointers in case they have moved while
> checking them.

I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?

> I have not measured anything yet, but one of my concerns in case of
> very large dead tuple collections searched by 8-way parallel bsearch
> could actually get close to saturating RAM bandwidth by reading (8 x
> 32bits x cache-line-size) bytes from main memory every few cycles, so
> we may need some inner-loop level throttling similar to current
> vacuum_cost_limit for data pages.

If it happens then it'll be a nice problem to have, I suppose.

> Maybe not unrolling the full 32 loops for 32 bit bserach, but
> something like 8-loop unroll for getting most of the benefit.

My current assumption is that we're bound by memory speed right now,
and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.

> While it may make sense to have different bitmap encodings for
> different distributions, it likely would not be good for optimisations
> if all these are used at the same time.

To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.

> This is why I propose the first bitmap collecting phase to collect
> into a file and then - when reading into memory for lookups phase -
> possibly rewrite the initial structure to something else if it sees
> that it is more efficient. Like for example where the first half of
> the file consists of only empty pages.

Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising, since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples. We don't need to rewrite the data structure
to make it do that well, AFAICT.

When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days. In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.

> Then again it may be so much extra work that it starts to dominate
> some parts of profiles.
>
> For example see the work that was done in improving the mini-vacuum
> part where it was actually faster to copy data out to a separate
> buffer and then back in than shuffle it around inside the same 8k page

Some of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance. What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.

> So only testing will tell.

True.

-- 
Peter Geoghegan



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Hannu Krosing
Date:
On Fri, Jul 9, 2021 at 12:34 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
...
>
> I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
> leaf page is not uncommon (with deduplication). Seems like that might
> be enough?

Likely yes, and also it would have the nice property of not changing
the index page locking behaviour.

Are deduplicated tids in the leaf page already sorted in heap order ?
This could potentially simplify / speed up the sort.

> > I have not measured anything yet, but one of my concerns in case of
> > very large dead tuple collections searched by 8-way parallel bsearch
> > could actually get close to saturating RAM bandwidth by reading (8 x
> > 32bits x cache-line-size) bytes from main memory every few cycles, so
> > we may need some inner-loop level throttling similar to current
> > vacuum_cost_limit for data pages.
>
> If it happens then it'll be a nice problem to have, I suppose.
>
> > Maybe not unrolling the full 32 loops for 32 bit bserach, but
> > something like 8-loop unroll for getting most of the benefit.
>
> My current assumption is that we're bound by memory speed right now,

Most likely yes, and this should be also easy to check with manually
unrolling perhaps 4 loops and measuring any speed increase.

> and that that is the big bottleneck to eliminate -- we must keep the
> CPU busy with data to process first. That seems like the most
> promising thing to focus on right now.

This has actually two parts
  - trying to make sure that we can make as much as possible from cache
  - if we need to get out of cache then try to parallelise this as
much as possible

at the same time we need to watch that we are not making the index
tuple preparation work so heavy that it starts to dominate over memory
access

> > While it may make sense to have different bitmap encodings for
> > different distributions, it likely would not be good for optimisations
> > if all these are used at the same time.
>
> To some degree designs like Roaring bitmaps are just that -- a way of
> dynamically figuring out which strategy to use based on data
> characteristics.

it is, but as I am keeping one eye open for vectorisation, I don't
like when different parts of the same bitmap have radically different
encoding strategies.

> > This is why I propose the first bitmap collecting phase to collect
> > into a file and then - when reading into memory for lookups phase -
> > possibly rewrite the initial structure to something else if it sees
> > that it is more efficient. Like for example where the first half of
> > the file consists of only empty pages.
>
> Yeah, I agree that something like that could make sense. Although
> rewriting it doesn't seem particularly promising,

yeah, I hope to prove (or verify :) ) the structure is good enough so
that it does not need the rewrite.

> since we can easily
> make it cheap to process any TID that falls into a range of blocks
> that have no dead tuples.

I actually meant the opposite case, where we could replace a full  80
bytes 291-bit "all dead" bitmap with just a range - int4 for page and
two int2-s for min and max tid-in page for extra 10x reduction, on top
of original 21x reduction from current 6 bytes / bit encoding to my
page_bsearch_vector bitmaps which encodes one page to maximum of 80
bytes (5 x int4 sub-page pointers + 5 x int4 bitmaps).

I also started out by investigating RoaringBitmaps, but when I
realized that we will likely have to rewrite it anyway I continued
working on getting to a single uniform encoding which fits most use
cases Good Enough and then use that uniformity to enable the compiler
to do its optimisation and hopefully also vectoriziation magic.

> We don't need to rewrite the data structure
> to make it do that well, AFAICT.
>
> When I said that I thought of this a little like a hash join, I was
> being more serious than you might imagine. Note that the number of
> index tuples that VACUUM will delete from each index can now be far
> less than the total number of TIDs stored in memory. So even when we
> have (say) 20% of all of the TIDs from the table in our in memory list
> managed by vacuumlazy.c, it's now quite possible that VACUUM will only
> actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
> it finds in the index (there really is no way to predict how many).
> The opportunistic deletion stuff could easily be doing most of the
> required cleanup in an eager fashion following recent improvements --
> VACUUM need only take care of "floating garbage" these days.

Ok, this points to the need to mainly optimise for quite sparse
population of dead tuples, which is still mainly clustered page-wise ?

> In other
> words, thinking about this as something that is a little bit like a
> hash join makes sense because hash joins do very well with high join
> selectivity, and high join selectivity is common in the real world.
> The intersection of TIDs from each leaf page with the in-memory TID
> delete structure will often be very small indeed.

The hard to optimize case is still when we have dead tuple counts in
hundreds of millions, or even billions, like on a HTAP database after
a few hours of OLAP query have accumulated loads of dead tuples in
tables getting heavy OLTP traffic.

There of course we could do a totally different optimisation, where we
also allow reaping tuples newer than the OLAP queries snapshot if we
can prove that when the snapshot moves forward next time, it has to
jump over said transactions making them indeed DEAD and not RECENTLY
DEAD. Currently we let a single OLAP query ruin everything :)

> > Then again it may be so much extra work that it starts to dominate
> > some parts of profiles.
> >
> > For example see the work that was done in improving the mini-vacuum
> > part where it was actually faster to copy data out to a separate
> > buffer and then back in than shuffle it around inside the same 8k page
>
> Some of what I'm saying is based on the experience of improving
> similar code used by index tuple deletion in Postgres 14. That did
> quite a lot of sorting of TIDs and things like that. In the end the
> sorting had no more than a negligible impact on performance.

Good to know :)

> What
> really mattered was that we efficiently coordinate with distant heap
> pages that describe which index tuples we can delete from a given leaf
> page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
> locations in memory (or even far fewer) is not so cheap. It might even
> be very slow indeed. Sorting in order to batch could end up looking
> like cheap insurance that we should be glad to pay for.

If the most expensive operation is sorting a few hundred of tids, then
this should be fast enough.

My worries were more that after the sorting we can not to dsimple
index lookups for them, but each needs to be found via bseach or maybe
even just search if that is faster under some size limit, and that
these could add up. Or some other needed thing that also has to be
done, like allocating extra memory or moving other data around in a
way that CPU does not like.

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,


On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> 1. Don't allocate more than 1GB. There was a discussion to eliminate
> this limitation by using MemoryContextAllocHuge() but there were
> concerns about point 2[1].
>
> 2. Allocate the whole memory space at once.
>
> 3. Slow lookup performance (O(logN)).
>
> I’ve done some experiments in this area and would like to share the
> results and discuss ideas.

Yea, this is a serious issue.


3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().

The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf

Imo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.


That's not to say we ought to stay with binary search...



> Problems Solutions
> ===============
>
> Firstly, I've considered using existing data structures:
> IntegerSet(src/backend/lib/integerset.c)  and
> TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
> only either point 2 or 3. IntegerSet uses lower memory thanks to
> simple-8b encoding but is slow at lookup, still O(logN), since it’s a
> tree structure. On the other hand, TIDBitmap has a good lookup
> performance, O(1), but could unnecessarily use larger memory in some
> cases since it always allocates the space for bitmap enough to store
> all possible offsets. With 8kB blocks, the maximum number of line
> pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
> bitmap is 40 bytes long and we always need 46 bytes in total per block
> including other meta information.

Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.

It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.


> The data structure is somewhat similar to TIDBitmap. It consists of
> the hash table and the container area; the hash table has entries per
> block and each block entry allocates its memory space, called a
> container, in the container area to store its offset numbers. The
> container area is actually an array of bytes and can be enlarged as
> needed. In the container area, the data representation of offset
> numbers varies depending on their cardinality. It has three container
> types: array, bitmap, and run.

Not a huge fan of encoding this much knowledge about the tid layout...


> For example, if there are two dead tuples at offset 1 and 150, it uses
> the array container that has an array of two 2-byte integers
> representing 1 and 150, using 4 bytes in total. If we used the bitmap
> container in this case, we would need 20 bytes instead. On the other
> hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
> uses the run container that has an array of 2-byte integers. The first
> value in each pair represents a starting offset number, whereas the
> second value represents its length. Therefore, in this case, the run
> container uses only 4 bytes in total. Finally, if there are dead
> tuples at every other offset from 1 to 100, it uses the bitmap
> container that has an uncompressed bitmap, using 13 bytes. We need
> another 16 bytes per block entry for hash table entry.
>
> The lookup complexity of a bitmap container is O(1) whereas the one of
> an array and a run container is O(N) or O(logN) but the number of
> elements in those two containers should not be large it would not be a
> problem.

Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.


> Evaluation
> ========
>
> Before implementing this idea and integrating it with lazy vacuum
> code, I've implemented a benchmark tool dedicated to evaluating
> lazy_tid_reaped() performance[4].

Good idea!


> In all test cases, I simulated that the table has 1,000,000 blocks and
> every block has at least one dead tuple.

That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...


> The benchmark scenario is that for
> each virtual heap tuple we check if there is its TID in the dead
> tuple storage. Here are the results of execution time in milliseconds
> and memory usage in bytes:

In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.


Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-08 20:53:32 -0700, Andres Freund wrote:
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > this limitation by using MemoryContextAllocHuge() but there were
> > concerns about point 2[1].
> >
> > 2. Allocate the whole memory space at once.
> >
> > 3. Slow lookup performance (O(logN)).
> >
> > I’ve done some experiments in this area and would like to share the
> > results and discuss ideas.
>
> Yea, this is a serious issue.
>
>
> 3) could possibly be addressed to a decent degree without changing the
> fundamental datastructure too much. There's some sizable and trivial
> wins by just changing vac_cmp_itemptr() to compare int64s and by using
> an open coded bsearch().

Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
machine.

Another thing I just noticed is that you didn't include the build times for the
datastructures. They are lower than the lookups currently, but it does seem
like a relevant thing to measure as well. E.g. for #1 I see the following build
times

array    24.943 ms
tbm     206.456 ms
intset   93.575 ms
vtbm    134.315 ms
rtbm    145.964 ms

that's a significant range...


Randomizing the lookup order (using a random shuffle in
generate_index_tuples()) changes the benchmark results for #1 significantly:

        shuffled time    unshuffled time
array    6551.726 ms      6478.554 ms
intset  67590.879 ms     10815.810 ms
rtbm    17992.487 ms      2518.492 ms
tbm       364.917 ms       360.128 ms
vtbm    12227.884 ms      1288.123 ms



FWIW, I get an assertion failure when using an assertion build:

#2  0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11
"FailedAssertion",
 
    fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at
/home/andres/src/postgresql/src/backend/utils/error/assert.c:69
#3  0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at
rtbm.c:242
#4  0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
#5  0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143,
maxblk=2139062143,maxoff=32639)
 
    at bdbench.c:587
#6  0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143,
maxblk=2139062143,maxoff=32639)
 
    at bdbench.c:658
#7  0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873

I assume you just inverted the Assert(found) assertion?

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Jul 9, 2021 at 12:53 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
>
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > this limitation by using MemoryContextAllocHuge() but there were
> > concerns about point 2[1].
> >
> > 2. Allocate the whole memory space at once.
> >
> > 3. Slow lookup performance (O(logN)).
> >
> > I’ve done some experiments in this area and would like to share the
> > results and discuss ideas.
>
> Yea, this is a serious issue.
>
>
> 3) could possibly be addressed to a decent degree without changing the
> fundamental datastructure too much. There's some sizable and trivial
> wins by just changing vac_cmp_itemptr() to compare int64s and by using
> an open coded bsearch().
>
> The big problem with bsearch isn't imo the O(log(n)) complexity - it's
> that it has an abominally bad cache locality. And that can be addressed
> https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf
>
> Imo 2) isn't really that a hard problem to improve, even if we were to
> stay with the current bsearch approach. Reallocation with an aggressive
> growth factor or such isn't that bad.
>
>
> That's not to say we ought to stay with binary search...
>
>
>
> > Problems Solutions
> > ===============
> >
> > Firstly, I've considered using existing data structures:
> > IntegerSet(src/backend/lib/integerset.c)  and
> > TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
> > only either point 2 or 3. IntegerSet uses lower memory thanks to
> > simple-8b encoding but is slow at lookup, still O(logN), since it’s a
> > tree structure. On the other hand, TIDBitmap has a good lookup
> > performance, O(1), but could unnecessarily use larger memory in some
> > cases since it always allocates the space for bitmap enough to store
> > all possible offsets. With 8kB blocks, the maximum number of line
> > pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
> > bitmap is 40 bytes long and we always need 46 bytes in total per block
> > including other meta information.
>
> Imo tidbitmap isn't particularly good, even in the current use cases -
> it's constraining in what we can store (a problem for other AMs), not
> actually that dense, the lossy mode doesn't choose what information to
> loose well etc.
>
> It'd be nice if we came up with a datastructure that could also replace
> the bitmap scan cases.

Agreed.

>
>
> > The data structure is somewhat similar to TIDBitmap. It consists of
> > the hash table and the container area; the hash table has entries per
> > block and each block entry allocates its memory space, called a
> > container, in the container area to store its offset numbers. The
> > container area is actually an array of bytes and can be enlarged as
> > needed. In the container area, the data representation of offset
> > numbers varies depending on their cardinality. It has three container
> > types: array, bitmap, and run.
>
> Not a huge fan of encoding this much knowledge about the tid layout...
>
>
> > For example, if there are two dead tuples at offset 1 and 150, it uses
> > the array container that has an array of two 2-byte integers
> > representing 1 and 150, using 4 bytes in total. If we used the bitmap
> > container in this case, we would need 20 bytes instead. On the other
> > hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
> > uses the run container that has an array of 2-byte integers. The first
> > value in each pair represents a starting offset number, whereas the
> > second value represents its length. Therefore, in this case, the run
> > container uses only 4 bytes in total. Finally, if there are dead
> > tuples at every other offset from 1 to 100, it uses the bitmap
> > container that has an uncompressed bitmap, using 13 bytes. We need
> > another 16 bytes per block entry for hash table entry.
> >
> > The lookup complexity of a bitmap container is O(1) whereas the one of
> > an array and a run container is O(N) or O(logN) but the number of
> > elements in those two containers should not be large it would not be a
> > problem.
>
> Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
> many tuples have been deleted. In cases where the "run" storage is
> cheaper (e.g. because there's high offset numbers due to HOT pruning),
> we could end up regularly scanning a few hundred entries for a
> match. That's not cheap anymore.

With 8kB blocks, the maximum size of a bitmap container is 37 bytes.
IOW, other two types of containers are always smaller than 37 bytes.
Since the run container uses 4 bytes per run, the number of runs in a
run container never be more than 9. Even with 32kB blocks, we don’t
have more than 37 runs. So I think N is small enough in this case.

>
>
> > Evaluation
> > ========
> >
> > Before implementing this idea and integrating it with lazy vacuum
> > code, I've implemented a benchmark tool dedicated to evaluating
> > lazy_tid_reaped() performance[4].
>
> Good idea!
>
>
> > In all test cases, I simulated that the table has 1,000,000 blocks and
> > every block has at least one dead tuple.
>
> That doesn't strike me as a particularly common scenario? I think it's
> quite rare for there to be so evenly but sparse dead tuples. In
> particularly it's very common for there to be long runs of dead tuples
> separated by long ranges of no dead tuples at all...

Agreed. I'll test with such scenarios.

>
>
> > The benchmark scenario is that for
> > each virtual heap tuple we check if there is its TID in the dead
> > tuple storage. Here are the results of execution time in milliseconds
> > and memory usage in bytes:
>
> In which order are the dead tuples checked? Looks like in sequential
> order? In the case of an index over a column that's not correlated with
> the heap order the lookups are often much more random - which can
> influence lookup performance drastically, due to cache differences in
> cache locality. Which will make some structures look worse/better than
> others.

Good point. It's sequential order, which is not good. I'll test again
after shuffling virtual index tuples.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Jul 9, 2021 at 2:37 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-07-08 20:53:32 -0700, Andres Freund wrote:
> > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > > this limitation by using MemoryContextAllocHuge() but there were
> > > concerns about point 2[1].
> > >
> > > 2. Allocate the whole memory space at once.
> > >
> > > 3. Slow lookup performance (O(logN)).
> > >
> > > I’ve done some experiments in this area and would like to share the
> > > results and discuss ideas.
> >
> > Yea, this is a serious issue.
> >
> >
> > 3) could possibly be addressed to a decent degree without changing the
> > fundamental datastructure too much. There's some sizable and trivial
> > wins by just changing vac_cmp_itemptr() to compare int64s and by using
> > an open coded bsearch().
>
> Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
> machine.
>
> Another thing I just noticed is that you didn't include the build times for the
> datastructures. They are lower than the lookups currently, but it does seem
> like a relevant thing to measure as well. E.g. for #1 I see the following build
> times
>
> array    24.943 ms
> tbm     206.456 ms
> intset   93.575 ms
> vtbm    134.315 ms
> rtbm    145.964 ms
>
> that's a significant range...

Good point. I got similar results when measuring on my machine:

array 57.987 ms
tbm 297.720 ms
intset 113.796 ms
vtbm 165.268 ms
rtbm 199.658 ms

>
> Randomizing the lookup order (using a random shuffle in
> generate_index_tuples()) changes the benchmark results for #1 significantly:
>
>         shuffled time    unshuffled time
> array    6551.726 ms      6478.554 ms
> intset  67590.879 ms     10815.810 ms
> rtbm    17992.487 ms      2518.492 ms
> tbm       364.917 ms       360.128 ms
> vtbm    12227.884 ms      1288.123 ms

I believe that in your test, tbm_reaped() actually always returned
true. That could explain tbm was very fast in both cases. Since
TIDBitmap in the core doesn't support the existence check tbm_reaped()
in bdbench.c always returns true. I added a patch in the repository to
add existence check support to TIDBitmap, although it assumes bitmap
never be lossy.

That being said, I'm surprised that rtbm is slower than array even in
the unshuffled case. I've also measured the shuffle cases and got
different results. To be clear, I used prepare() SQL function to
prepare both virtual dead tuples and index tuples, load them by
attach_dead_tuples() SQL function, and executed bench() SQL function
for each data structure. Here are the results:

         shuffled time       unshuffled time
array  88899.513 ms   12616.521 ms
intset  73476.055 ms   10063.405 ms
rtbm   22264.671 ms    2073.171 ms
tbm    10285.092   ms  1417.312 ms
vtbm   14488.581 ms    1240.666  ms

>
> FWIW, I get an assertion failure when using an assertion build:
>
> #2  0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11
"FailedAssertion",
>     fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at
/home/andres/src/postgresql/src/backend/utils/error/assert.c:69
> #3  0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at
rtbm.c:242
> #4  0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
> #5  0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143,
maxblk=2139062143,maxoff=32639) 
>     at bdbench.c:587
> #6  0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143,
maxblk=2139062143,maxoff=32639) 
>     at bdbench.c:658
> #7  0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873
>
> I assume you just inverted the Assert(found) assertion?

Right. Fixed it.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 8, 2021 at 7:51 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I wonder how much it would help to break up that loop into two loops.
> > Make the callback into a batch operation that generates state that
> > describes what to do with each and every index tuple on the leaf page.
> > The first loop would build a list of TIDs, then you'd call into
> > vacuumlazy.c and get it to process the TIDs, and finally the second
> > loop would physically delete the TIDs that need to be deleted. This
> > would mean that there would be only one call per leaf page per
> > btbulkdelete(). This would reduce the number of calls to the callback
> > by at least 100x, and maybe more than 1000x.
>
> Maybe for something like rtbm.c (which is inspired by Roaring
> bitmaps), you would really want to use an "intersection" operation for
> this. The TIDs that we need to physically delete from the leaf page
> inside btvacuumpage() are the intersection of two bitmaps: our bitmap
> of all TIDs on the leaf page, and our bitmap of all TIDs that need to
> be deleting by the ongoing btbulkdelete() call.

Agreed. In such a batch operation, what we need to do here is to
compute the intersection of two bitmaps.

>
> Obviously the typical case is that most TIDs in the index do *not* get
> deleted -- needing to delete more than ~20% of all TIDs in the index
> will be rare. Ideally it would be very cheap to figure out that a TID
> does not need to be deleted at all. Something a little like a negative
> cache (but not a true negative cache). This is a little bit like how
> hash joins can be made faster by adding a Bloom filter -- most hash
> probes don't need to join a tuple in the real world, and we can make
> these hash probes even faster by using a Bloom filter as a negative
> cache.

Agreed.

>
> If you had the list of TIDs from a leaf page sorted for batch
> processing, and if you had roaring bitmap style "chunks" with
> "container" metadata stored in the data structure, you could then use
> merging/intersection -- that has some of the same advantages. I think
> that this would be a lot more efficient than having one binary search
> per TID. Most TIDs from the leaf page can be skipped over very
> quickly, in large groups. It's very rare for VACUUM to need to delete
> TIDs from completely random heap table blocks in the real world (some
> kind of pattern is much more common).
>
> When this merging process finds 1 TID that might really be deletable
> then it's probably going to find much more than 1 -- better to make
> that cache miss take care of all of the TIDs together. Also seems like
> the CPU could do some clever prefetching with this approach -- it
> could prefetch TIDs where the initial chunk metadata is insufficient
> to eliminate them early -- these are the groups of TIDs that will have
> many TIDs that we actually need to delete. ISTM that improving
> temporal locality through batching could matter a lot here.

That's a promising approach.

In rtbm, the pair of one hash entry and one container is used per
block. Therefore, we can skip TID from the leaf page by checking the
hash table, if there is no dead tuple in the block. If there is the
hash entry, since it means the block has at least one dead tuple, we
can look for the offset of TID from the leaf page from the container.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 8, 2021 at 10:40 PM Hannu Krosing <hannuk@google.com> wrote:
>
> Very nice results.
>
> I have been working on the same problem but a bit different solution -
> a mix of binary search for (sub)pages and 32-bit bitmaps for
> tid-in-page.
>
> Even with currebnt allocation heuristics (allocate 291 tids per page)
> it initially allocate much less space, instead of current 291*6=1746
> bytes per page it needs to allocate 80 bytes.
>
> Also it can be laid out so that it is friendly to parallel SIMD
> searches doing up to 8 tid lookups in parallel.

Interesting.

>
> That said, for allocating the tid array, the best solution is to
> postpone it as much as possible and to do the initial collection into
> a file, which
>
> 1) postpones the memory allocation to the beginning of index cleanups
>
> 2) lets you select the correct size and structure as you know more
> about the distribution at that time
>
> 3) do the first heap pass in one go and then advance frozenxmin
> *before* index cleanup

I think we have to do index vacuuming before heap vacuuming (2nd heap
pass). So do you mean that it advances relfrozenxid of pg_class before
both index vacuuming and heap vacuuming?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> Currently, the TIDs of dead tuples are stored in an array that is
> collectively allocated at the start of lazy vacuum and TID lookup uses
> bsearch(). There are the following challenges and limitations:

> So I prototyped a new data structure dedicated to storing dead tuples
> during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> The authors provide an implementation of Roaring Bitmap[3]  (Apache
> 2.0 license). But I've implemented this idea from scratch because we
> need to integrate it with Dynamic Shared Memory/Area to support
> parallel vacuum and need to support ItemPointerData, 6-bytes integer
> in total, whereas the implementation supports only 4-bytes integers.
> Also, when it comes to vacuum, we neither need to compute the
> intersection, the union, nor the difference between sets, but need
> only an existence check.
> 
> The data structure is somewhat similar to TIDBitmap. It consists of
> the hash table and the container area; the hash table has entries per
> block and each block entry allocates its memory space, called a
> container, in the container area to store its offset numbers. The
> container area is actually an array of bytes and can be enlarged as
> needed. In the container area, the data representation of offset
> numbers varies depending on their cardinality. It has three container
> types: array, bitmap, and run.

How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > Currently, the TIDs of dead tuples are stored in an array that is
> > collectively allocated at the start of lazy vacuum and TID lookup uses
> > bsearch(). There are the following challenges and limitations:
> 
> > So I prototyped a new data structure dedicated to storing dead tuples
> > during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> > The authors provide an implementation of Roaring Bitmap[3]  (Apache
> > 2.0 license). But I've implemented this idea from scratch because we
> > need to integrate it with Dynamic Shared Memory/Area to support
> > parallel vacuum and need to support ItemPointerData, 6-bytes integer
> > in total, whereas the implementation supports only 4-bytes integers.
> > Also, when it comes to vacuum, we neither need to compute the
> > intersection, the union, nor the difference between sets, but need
> > only an existence check.
> > 
> > The data structure is somewhat similar to TIDBitmap. It consists of
> > the hash table and the container area; the hash table has entries per
> > block and each block entry allocates its memory space, called a
> > container, in the container area to store its offset numbers. The
> > container area is actually an array of bytes and can be enlarged as
> > needed. In the container area, the data representation of offset
> > numbers varies depending on their cardinality. It has three container
> > types: array, bitmap, and run.
> 
> How are you thinking of implementing iteration efficiently for rtbm? The
> second heap pass needs that obviously... I think the only option would
> be to qsort the whole thing?

I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.

The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.

My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.

The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.

A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.


This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.


The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Jul 10, 2021 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > Currently, the TIDs of dead tuples are stored in an array that is
> > collectively allocated at the start of lazy vacuum and TID lookup uses
> > bsearch(). There are the following challenges and limitations:
>
> > So I prototyped a new data structure dedicated to storing dead tuples
> > during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> > The authors provide an implementation of Roaring Bitmap[3]  (Apache
> > 2.0 license). But I've implemented this idea from scratch because we
> > need to integrate it with Dynamic Shared Memory/Area to support
> > parallel vacuum and need to support ItemPointerData, 6-bytes integer
> > in total, whereas the implementation supports only 4-bytes integers.
> > Also, when it comes to vacuum, we neither need to compute the
> > intersection, the union, nor the difference between sets, but need
> > only an existence check.
> >
> > The data structure is somewhat similar to TIDBitmap. It consists of
> > the hash table and the container area; the hash table has entries per
> > block and each block entry allocates its memory space, called a
> > container, in the container area to store its offset numbers. The
> > container area is actually an array of bytes and can be enlarged as
> > needed. In the container area, the data representation of offset
> > numbers varies depending on their cardinality. It has three container
> > types: array, bitmap, and run.
>
> How are you thinking of implementing iteration efficiently for rtbm? The
> second heap pass needs that obviously... I think the only option would
> be to qsort the whole thing?

Yes, I'm thinking that the iteration of rtbm is somewhat similar to
tbm. That is, we iterate and collect hash table entries and do qsort
hash entries by the block number. Then fetch the entry along with its
container one by one in order of the block number.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Sorry for the late reply.

On Sat, Jul 10, 2021 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
> > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > > Currently, the TIDs of dead tuples are stored in an array that is
> > > collectively allocated at the start of lazy vacuum and TID lookup uses
> > > bsearch(). There are the following challenges and limitations:
> >
> > > So I prototyped a new data structure dedicated to storing dead tuples
> > > during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> > > The authors provide an implementation of Roaring Bitmap[3]  (Apache
> > > 2.0 license). But I've implemented this idea from scratch because we
> > > need to integrate it with Dynamic Shared Memory/Area to support
> > > parallel vacuum and need to support ItemPointerData, 6-bytes integer
> > > in total, whereas the implementation supports only 4-bytes integers.
> > > Also, when it comes to vacuum, we neither need to compute the
> > > intersection, the union, nor the difference between sets, but need
> > > only an existence check.
> > >
> > > The data structure is somewhat similar to TIDBitmap. It consists of
> > > the hash table and the container area; the hash table has entries per
> > > block and each block entry allocates its memory space, called a
> > > container, in the container area to store its offset numbers. The
> > > container area is actually an array of bytes and can be enlarged as
> > > needed. In the container area, the data representation of offset
> > > numbers varies depending on their cardinality. It has three container
> > > types: array, bitmap, and run.
> >
> > How are you thinking of implementing iteration efficiently for rtbm? The
> > second heap pass needs that obviously... I think the only option would
> > be to qsort the whole thing?
>
> I experimented further, trying to use an old radix tree implementation I
> had lying around to store dead tuples. With a bit of trickery that seems
> to work well.

Thank you for experimenting with another approach.

>
> The radix tree implementation I have basically maps an int64 to another
> int64. Each level of the radix tree stores 6 bits of the key, and uses
> those 6 bits to index a 1<<64 long array leading to the next level.
>
> My first idea was to use itemptr_encode() to convert tids into an int64
> and store the lower 6 bits in the value part of the radix tree. That
> turned out to work well performance wise, but awfully memory usage
> wise. The problem is that we at most use 9 bits for offsets, but reserve
> 16 bits for it in the ItemPointerData. Which means that there's often a
> lot of empty "tree levels" for those 0 bits, making it hard to get to a
> decent memory usage.
>
> The simplest way to address that was to simply compress out those
> guaranteed-to-be-zero bits. That results in memory usage that's quite
> good - nearly always beating array, occasionally beating rtbm. It's an
> ordered datastructure, so the latter isn't too surprising. For lookup
> performance the radix approach is commonly among the best, if not the
> best.

How were its both lookup performance and memory usage comparing to
intset? I guess the performance trends of those two approaches are
similar since both consists of a tree. Intset encodes uint64 by
simple-8B encoding so I'm interested also in the comparison in terms
of memory usage.

>
> A variation of the storage approach is to just use the block number as
> the index, and store the tids as the value. Even with the absolutely
> naive approach of just using a Bitmapset that reduces memory usage
> substantially - at a small cost to search performance. Of course it'd be
> better to use an adaptive approach like you did for rtbm, I just thought
> this is good enough.
>
>
> This largely works well, except when there are a large number of evenly
> spread out dead tuples. I don't think that's a particularly common
> situation, but it's worth considering anyway.
>
> The reason the memory usage can be larger for sparse workloads obviously
> can lead to tree nodes with only one child. As they are quite large
> (1<<6 pointers to further children) that then can lead to large increase
> in memory usage.

Interesting. How big was it in such workloads comparing to other data
structures?

I personally like adaptive approaches especially in the context of
vacuum improvements. We know common patterns of dead tuple
distribution but it’s not necessarily true since it depends on data
distribution and timings of autovacuum etc even with the same
workload. And we might be able to provide a new approach that works
well in 95% of use cases but if things get worse than before in
another 5% I think the approach is not a good approach. Ideally, it
should be better in common cases and at least be the same as before in
other cases.

BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.

>
> I have toyed with implementing adaptively large radix nodes like
> proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> gotten it quite working.

That seems promising approach.

Regards,

[1] https://www.postgresql.org/message-id/CA%2BTgmoakKFXwUv1Cx2mspUuPQHzYF74BfJ8koF5YdgVLCvhpwA%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
> BTW is the implementation of the radix tree approach available
> somewhere? If so I'd like to experiment with that too.
>
> >
> > I have toyed with implementing adaptively large radix nodes like
> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> > gotten it quite working.
>
> That seems promising approach.

I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).

E.g. for

select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
        attach  size    shuffled    ordered
array    69 ms  120 MB  84.87 s          8.66 s
intset  173 ms   65 MB  68.82 s         11.75 s
rtbm    201 ms   67 MB  11.54 s          1.35 s
tbm     232 ms  100 MB   8.33 s          1.26 s
vtbm    162 ms   58 MB  10.01 s          1.22 s
radix    88 ms   42 MB  11.49 s          1.67 s

and for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);

        attach  size    shuffled    ordered
array    24 ms   60MB   3.74s            1.02 s
intset   97 ms   49MB   3.14s            0.75 s
rtbm    138 ms   36MB   0.41s            0.14 s
tbm     198 ms  101MB   0.41s            0.14 s
vtbm    118 ms   27MB   0.39s            0.12 s
radix    33 ms   10MB   0.28s            0.10 s

(this is an almost unfairly good case for radix)

Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.


The radix tree code isn't good right now. A ridiculous amount of
duplication etc. The naming clearly shows its origins from a buffer
mapping radix tree...


Currently in a bunch of the cases 20% of the time is spent in
radix_reaped(). If I move that into radix.c and for bfm_lookup() to be
inlined, I get reduced overhead. rbtm for example essentially already
does that, because it does splitting of ItemPointer in rtbm.c.


I've attached my current patches against your tree.

Greetings,

Andres Freund

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-19 16:49:15 -0700, Andres Freund wrote:
> E.g. for
> 
> select prepare(
> 1000000, -- max block
> 20, -- # of dead tuples per page
> 10, -- dead tuples interval within a page
> 1 -- page inteval
> );
>         attach  size    shuffled    ordered
> array    69 ms  120 MB  84.87 s          8.66 s
> intset  173 ms   65 MB  68.82 s         11.75 s
> rtbm    201 ms   67 MB  11.54 s          1.35 s
> tbm     232 ms  100 MB   8.33 s          1.26 s
> vtbm    162 ms   58 MB  10.01 s          1.22 s
> radix    88 ms   42 MB  11.49 s          1.67 s
> 
> and for
> select prepare(
> 1000000, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within a page
> 1 -- page inteval
> );
> 
>         attach  size    shuffled    ordered
> array    24 ms   60MB   3.74s            1.02 s
> intset   97 ms   49MB   3.14s            0.75 s
> rtbm    138 ms   36MB   0.41s            0.14 s
> tbm     198 ms  101MB   0.41s            0.14 s
> vtbm    118 ms   27MB   0.39s            0.12 s
> radix    33 ms   10MB   0.28s            0.10 s

Oh, I forgot: The performance numbers are with the fixes in
https://www.postgresql.org/message-id/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
applied.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Yura Sokolov
Date:
Hi,

I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.

Let me join to friendly competition.

I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Novelties:
- 32 consecutive pages are stored together in a single sparse array
   (called "chunks").
   Chunk contains:
   - its number,
   - 4 byte bitmap of non-empty pages,
   - array of non-empty page headers 2 byte each.
     Page header contains offset of page's bitmap in bitmaps container.
     (Except if there is just one dead tuple in a page. Then it is
     written into header itself).
   - container of concatenated bitmaps.

   Ie, page metadata overhead varies from 2.4byte (32pages in single 
chunk)
   to 18byte (1 page in single chunk) per page.

- If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
   it is compressed by removing zero byte and indexing with two-level
   bitmap index.
   Two-level index - zero bytes in first level are removed using
   second level. It is mostly done for 32kb pages, but let it stay since
   it is almost free.

- If page's bitmaps contains a lot of "all-one" bytes, it is inverted
   and then encoded as sparse.

- Chunks are allocated with custom "allocator" that has no
   per-allocation overhead. It is possible because there is no need
   to perform "free": allocator is freed as whole at once.

- Array of pointers to chunks is also bitmap indexed. It saves cpu time
   when not every 32 consecutive pages has at least one dead tuple.
   But consumes time otherwise. Therefore additional optimization is 
added
   to quick skip lookup for first non-empty run of chunks.
   (Ahhh, I believe this explanation is awful).

Andres Freund wrote 2021-07-20 02:49:
> Hi,
> 
> On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
>> BTW is the implementation of the radix tree approach available
>> somewhere? If so I'd like to experiment with that too.
>> 
>> >
>> > I have toyed with implementing adaptively large radix nodes like
>> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
>> > gotten it quite working.
>> 
>> That seems promising approach.
> 
> I've since implemented some, but not all of the ideas of that paper
> (adaptive node sizes, but not the tree compression pieces).
> 
> E.g. for
> 
> select prepare(
> 1000000, -- max block
> 20, -- # of dead tuples per page
> 10, -- dead tuples interval within a page
> 1 -- page inteval
> );
>         attach  size    shuffled    ordered
> array    69 ms  120 MB  84.87 s          8.66 s
> intset  173 ms   65 MB  68.82 s         11.75 s
> rtbm    201 ms   67 MB  11.54 s          1.35 s
> tbm     232 ms  100 MB   8.33 s          1.26 s
> vtbm    162 ms   58 MB  10.01 s          1.22 s
> radix    88 ms   42 MB  11.49 s          1.67 s
> 
> and for
> select prepare(
> 1000000, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within a page
> 1 -- page inteval
> );
> 
>         attach  size    shuffled    ordered
> array    24 ms   60MB   3.74s            1.02 s
> intset   97 ms   49MB   3.14s            0.75 s
> rtbm    138 ms   36MB   0.41s            0.14 s
> tbm     198 ms  101MB   0.41s            0.14 s
> vtbm    118 ms   27MB   0.39s            0.12 s
> radix    33 ms   10MB   0.28s            0.10 s
> 
> (this is an almost unfairly good case for radix)
> 
> Running out of time to format the results of the other testcases before
> I have to run, unfortunately. radix uses 42MB both in test case 3 and
> 4.

My results (Ubuntu 20.04 Intel Core i7-1165G7):

Test1.

select prepare(1000000, 10, 20, 1); -- original

        attach  size   shuffled
array   29ms    60MB   93.99s
intset  93ms    49MB   80.94s
rtbm   171ms    67MB   14.05s
tbm    238ms   100MB    8.36s
vtbm   148ms    59MB    9.12s
radix  100ms    42MB   11.81s
svtm    75ms    29MB    8.90s

select prepare(1000000, 20, 10, 1); -- Andres's variant

        attach  size   shuffled
array   61ms   120MB  111.91s
intset 163ms    66MB   85.00s
rtbm   236ms    67MB   10.72s
tbm    290ms   100MB    8.40s
vtbm   190ms    59MB    9.28s
radix  117ms    42MB   12.00s
svtm    98ms    29MB    8.77s

Test2.

select prepare(1000000, 10, 1, 1);

        attach  size   shuffled
array   31ms    60MB    4.68s
intset  97ms    49MB    4.03s
rtbm   163ms    36MB    0.42s
tbm    240ms   100MB    0.42s
vtbm   136ms    27MB    0.36s
radix   60ms    10MB    0.72s
svtm    39ms     6MB    0.19s

(Bad radix result probably due to smaller cache in notebook's CPU ?)

Test3

select prepare(1000000, 2, 100, 1);

        attach  size   shuffled
array    6ms    12MB   53.42s
intset  23ms    16MB   54.99s
rtbm   115ms    38MB    8.19s
tbm    186ms   100MB    8.37s
vtbm   105ms    59MB    9.08s
radix   64ms    42MB   10.41s
svtm    73ms    10MB    7.49s

Test4

select prepare(1000000, 100, 1, 1);

        attach  size   shuffled
array  304ms   600MB   75.12s
intset 775ms    98MB   47.49s
rtbm   356ms    38MB    4.11s
tbm    539ms   100MB    4.20s
vtbm   493ms    42MB    4.44s
radix  263ms    42MB    6.05s
svtm   360ms     8MB    3.49s

Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.


(I've applied Andres's patch for slab allocator before testing)

Attached patch is against 6753911a444e12e4b55 commit of your pgtools 
with
applied Andres's patches for radix method.

I've also pushed it to github:
https://github.com/funny-falcon/pgtools/tree/svtm/bdbench

regards,
Yura Sokolov
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
> Hi,
>
> I've dreamed to write more compact structure for vacuum for three
> years, but life didn't give me a time to.
>
> Let me join to friendly competition.
>
> I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Thank you for proposing the new idea!

>
> Novelties:
> - 32 consecutive pages are stored together in a single sparse array
>    (called "chunks").
>    Chunk contains:
>    - its number,
>    - 4 byte bitmap of non-empty pages,
>    - array of non-empty page headers 2 byte each.
>      Page header contains offset of page's bitmap in bitmaps container.
>      (Except if there is just one dead tuple in a page. Then it is
>      written into header itself).
>    - container of concatenated bitmaps.
>
>    Ie, page metadata overhead varies from 2.4byte (32pages in single
> chunk)
>    to 18byte (1 page in single chunk) per page.
>
> - If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
>    it is compressed by removing zero byte and indexing with two-level
>    bitmap index.
>    Two-level index - zero bytes in first level are removed using
>    second level. It is mostly done for 32kb pages, but let it stay since
>    it is almost free.
>
> - If page's bitmaps contains a lot of "all-one" bytes, it is inverted
>    and then encoded as sparse.
>
> - Chunks are allocated with custom "allocator" that has no
>    per-allocation overhead. It is possible because there is no need
>    to perform "free": allocator is freed as whole at once.
>
> - Array of pointers to chunks is also bitmap indexed. It saves cpu time
>    when not every 32 consecutive pages has at least one dead tuple.
>    But consumes time otherwise. Therefore additional optimization is
> added
>    to quick skip lookup for first non-empty run of chunks.
>    (Ahhh, I believe this explanation is awful).

It sounds better than my proposal.

>
> Andres Freund wrote 2021-07-20 02:49:
> > Hi,
> >
> > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
> >> BTW is the implementation of the radix tree approach available
> >> somewhere? If so I'd like to experiment with that too.
> >>
> >> >
> >> > I have toyed with implementing adaptively large radix nodes like
> >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> >> > gotten it quite working.
> >>
> >> That seems promising approach.
> >
> > I've since implemented some, but not all of the ideas of that paper
> > (adaptive node sizes, but not the tree compression pieces).
> >
> > E.g. for
> >
> > select prepare(
> > 1000000, -- max block
> > 20, -- # of dead tuples per page
> > 10, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> >         attach  size    shuffled      ordered
> > array    69 ms  120 MB  84.87 s          8.66 s
> > intset  173 ms   65 MB  68.82 s         11.75 s
> > rtbm    201 ms   67 MB  11.54 s          1.35 s
> > tbm     232 ms  100 MB   8.33 s          1.26 s
> > vtbm    162 ms   58 MB  10.01 s          1.22 s
> > radix    88 ms   42 MB  11.49 s          1.67 s
> >
> > and for
> > select prepare(
> > 1000000, -- max block
> > 10, -- # of dead tuples per page
> > 1, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> >
> >         attach  size    shuffled      ordered
> > array    24 ms   60MB   3.74s            1.02 s
> > intset   97 ms   49MB   3.14s            0.75 s
> > rtbm    138 ms   36MB   0.41s            0.14 s
> > tbm     198 ms  101MB   0.41s            0.14 s
> > vtbm    118 ms   27MB   0.39s            0.12 s
> > radix    33 ms   10MB   0.28s            0.10 s
> >
> > (this is an almost unfairly good case for radix)
> >
> > Running out of time to format the results of the other testcases before
> > I have to run, unfortunately. radix uses 42MB both in test case 3 and
> > 4.
>
> My results (Ubuntu 20.04 Intel Core i7-1165G7):
>
> Test1.
>
> select prepare(1000000, 10, 20, 1); -- original
>
>         attach  size   shuffled
> array   29ms    60MB   93.99s
> intset  93ms    49MB   80.94s
> rtbm   171ms    67MB   14.05s
> tbm    238ms   100MB    8.36s
> vtbm   148ms    59MB    9.12s
> radix  100ms    42MB   11.81s
> svtm    75ms    29MB    8.90s
>
> select prepare(1000000, 20, 10, 1); -- Andres's variant
>
>         attach  size   shuffled
> array   61ms   120MB  111.91s
> intset 163ms    66MB   85.00s
> rtbm   236ms    67MB   10.72s
> tbm    290ms   100MB    8.40s
> vtbm   190ms    59MB    9.28s
> radix  117ms    42MB   12.00s
> svtm    98ms    29MB    8.77s
>
> Test2.
>
> select prepare(1000000, 10, 1, 1);
>
>         attach  size   shuffled
> array   31ms    60MB    4.68s
> intset  97ms    49MB    4.03s
> rtbm   163ms    36MB    0.42s
> tbm    240ms   100MB    0.42s
> vtbm   136ms    27MB    0.36s
> radix   60ms    10MB    0.72s
> svtm    39ms     6MB    0.19s
>
> (Bad radix result probably due to smaller cache in notebook's CPU ?)
>
> Test3
>
> select prepare(1000000, 2, 100, 1);
>
>         attach  size   shuffled
> array    6ms    12MB   53.42s
> intset  23ms    16MB   54.99s
> rtbm   115ms    38MB    8.19s
> tbm    186ms   100MB    8.37s
> vtbm   105ms    59MB    9.08s
> radix   64ms    42MB   10.41s
> svtm    73ms    10MB    7.49s
>
> Test4
>
> select prepare(1000000, 100, 1, 1);
>
>         attach  size   shuffled
> array  304ms   600MB   75.12s
> intset 775ms    98MB   47.49s
> rtbm   356ms    38MB    4.11s
> tbm    539ms   100MB    4.20s
> vtbm   493ms    42MB    4.44s
> radix  263ms    42MB    6.05s
> svtm   360ms     8MB    3.49s
>
> Therefore Specialized Vaccum Tid Map always consumes least memory amount
> and usually faster.

I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I'll experiment with the proposed ideas including this idea in more
> scenarios and share the results tomorrow.
>

I've done some benchmarks for proposed data structures. In this trial,
I've done with the scenario where dead tuples are concentrated on a
particular range of table blocks (test 5-8), in addition to the
scenarios I've done in the previous trial. Also, I've done benchmarks
of each scenario while increasing table size. In the first test, the
maximum block number of the table is 1,000,000 (i.g., 8GB table) and
in the second test, it's 10,000,000 (80GB table). We can see how
performance and memory consumption changes with a large-scale table.
Here are the results:

* Test 1
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
20 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 57.23 MB  |  0.040 |   98.613 | 572.21 MB  |     0.387 |    1521.981
 intset | 46.88 MB  |  0.114 |   75.944 | 468.67 MB  |     0.961 |     997.760
 radix  | 40.26 MB  |  0.102 |   18.427 | 336.64 MB  |     0.797 |     266.146
 rtbm   | 64.02 MB  |  0.234 |   22.443 | 512.02 MB  |     2.230 |     275.143
 svtm   | 27.28 MB  |  0.060 |   13.568 | 274.07 MB  |     0.476 |     211.073
 tbm    | 96.01 MB  |  0.273 |   10.347 | 768.01 MB  |     2.882 |     128.103

* Test 2
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 57.23 MB  |  0.041 |    4.757 | 572.21 MB  |     0.344 |      71.228
 intset | 46.88 MB  |  0.127 |    3.762 | 468.67 MB  |     1.093 |      49.573
 radix  | 9.95 MB   |  0.048 |    0.679 | 82.57 MB   |     0.371 |      16.211
 rtbm   | 34.02 MB  |  0.179 |    0.534 | 288.02 MB  |     2.092 |       8.693
 svtm   | 5.78 MB   |  0.043 |    0.239 | 54.60 MB   |     0.342 |       7.759
 tbm    | 96.01 MB  |  0.274 |    0.521 | 768.01 MB  |     2.685 |       6.360

* Test 3
select prepare(
1000000, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 11.45 MB  |  0.009 |   57.698 | 114.45 MB  |     0.076 |    1045.639
 intset | 15.63 MB  |  0.031 |   46.083 | 156.23 MB  |     0.243 |     848.525
 radix  | 40.26 MB  |  0.063 |   13.755 | 336.64 MB  |     0.501 |     223.413
 rtbm   | 36.02 MB  |  0.123 |   11.527 | 320.02 MB  |     1.843 |     180.977
 svtm   | 9.28 MB   |  0.053 |    9.631 | 92.59 MB   |     0.438 |     212.626
 tbm    | 96.01 MB  |  0.228 |   10.381 | 768.01 MB  |     2.258 |     126.630

* Test 4
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 572.21 MB |  0.367 |   78.047 | 5722.05 MB |     3.942 |    1154.776
 intset | 93.74 MB  |  0.777 |   45.146 | 937.34 MB  |     7.716 |     643.708
 radix  | 40.26 MB  |  0.203 |    9.015 | 336.64 MB  |     1.775 |     133.294
 rtbm   | 36.02 MB  |  0.369 |    5.639 | 320.02 MB  |     3.823 |      88.832
 svtm   | 7.28 MB   |  0.294 |    3.891 | 73.60 MB   |     2.690 |     103.744
 tbm    | 96.01 MB  |  0.534 |    5.223 | 768.01 MB  |     5.679 |      60.632


* Test 5
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within  a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);

There are 10000 consecutive pages that have 150 dead tuples at every
20000 pages.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 429.16 MB |  0.274 |   75.664 | 4291.54 MB |     3.067 |    1259.501
 intset | 46.88 MB  |  0.559 |   36.449 | 468.67 MB  |     4.565 |     517.445
 radix  | 20.26 MB  |  0.166 |    8.466 | 196.90 MB  |     1.273 |     166.587
 rtbm   | 18.02 MB  |  0.242 |    8.491 | 160.02 MB  |     2.407 |     171.725
 svtm   | 3.66 MB   |  0.243 |    3.635 | 37.10 MB   |     2.022 |      86.165
 tbm    | 48.01 MB  |  0.344 |    9.763 | 384.01 MB  |     3.327 |     151.824

* Test 6
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);

There are 10000 consecutive pages that have 10 dead tuples at every 20000 pages.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 28.62 MB  |  0.022 |    2.791 | 286.11 MB  |     0.170 |      46.920
 intset | 23.45 MB  |  0.061 |    2.156 | 234.34 MB  |     0.501 |      32.577
 radix  | 5.04 MB   |  0.026 |    0.433 | 48.57 MB   |     0.191 |      11.060
 rtbm   | 17.02 MB  |  0.074 |    0.533 | 144.02 MB  |     0.954 |      11.502
 svtm   | 3.16 MB   |  0.023 |    0.206 | 27.60 MB   |     0.175 |       4.886
 tbm    | 48.01 MB  |  0.132 |    0.656 | 384.01 MB  |     1.284 |      10.231

* Test 7
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1000, -- # of consecutive pages having dead tuples
999000 -- page interval
);

There are pages that have 150 dead tuples at first 1000 blocks and
last 1000 blocks.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 1.72 MB   |  0.002 |    7.507 | 17.17 MB   |     0.011 |      76.510
 intset | 0.20 MB   |  0.003 |    6.742 | 1.89 MB    |     0.022 |      52.122
 radix  | 0.20 MB   |  0.001 |    1.023 | 1.07 MB    |     0.007 |      12.023
 rtbm   | 0.15 MB   |  0.001 |    2.637 | 0.65 MB    |     0.009 |      34.528
 svtm   | 0.52 MB   |  0.002 |    0.721 | 0.61 MB    |     0.010 |       6.434
 tbm    | 0.20 MB   |  0.002 |    2.733 | 1.51 MB    |     0.015 |      38.538

* Test 8
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
50, -- # of consecutive pages having dead tuples
100 -- page interval
);

There are 50 consecutive pages that have 100 dead tuples at every 100 pages.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
 array  | 286.11 MB |  0.184 |   67.233 | 2861.03 MB |     1.743 |     979.070
 intset | 46.88 MB  |  0.389 |   35.176 | 468.67 MB  |     3.698 |     505.322
 radix  | 21.82 MB  |  0.116 |    6.160 | 186.86 MB  |     0.891 |     117.730
 rtbm   | 18.02 MB  |  0.182 |    5.909 | 160.02 MB  |     1.870 |     112.550
 svtm   | 4.28 MB   |  0.152 |    3.213 | 37.60 MB   |     1.383 |      79.073
 tbm    | 48.01 MB  |  0.265 |    6.673 | 384.01 MB  |     2.586 |     101.327

Overall, 'svtm' is faster and consumes less memory. 'radix' tree also
has good performance and memory usage.

From these results, svtm is the best data structure among proposed
ideas for dead tuple storage used during lazy vacuum in terms of
performance and memory usage. I think it can support iteration by
extracting the offset of dead tuples for each block while iterating
chunks.

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases? On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Yura Sokolov
Date:
Masahiko Sawada писал 2021-07-27 07:06:
> On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada 
> <sawada.mshk@gmail.com> wrote:
>> 
>> I'll experiment with the proposed ideas including this idea in more
>> scenarios and share the results tomorrow.
>> 
> 
> I've done some benchmarks for proposed data structures. In this trial,
> I've done with the scenario where dead tuples are concentrated on a
> particular range of table blocks (test 5-8), in addition to the
> scenarios I've done in the previous trial. Also, I've done benchmarks
> of each scenario while increasing table size. In the first test, the
> maximum block number of the table is 1,000,000 (i.g., 8GB table) and
> in the second test, it's 10,000,000 (80GB table). We can see how
> performance and memory consumption changes with a large-scale table.
> Here are the results:
> 
> * Test 1
> select prepare(
> 1000000, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within  a page
> 1, -- # of consecutive pages having dead tuples
> 20 -- page interval
> );
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 57.23 MB  |  0.040 |   98.613 | 572.21 MB  |     0.387 |    
> 1521.981
>  intset | 46.88 MB  |  0.114 |   75.944 | 468.67 MB  |     0.961 |     
> 997.760
>  radix  | 40.26 MB  |  0.102 |   18.427 | 336.64 MB  |     0.797 |     
> 266.146
>  rtbm   | 64.02 MB  |  0.234 |   22.443 | 512.02 MB  |     2.230 |     
> 275.143
>  svtm   | 27.28 MB  |  0.060 |   13.568 | 274.07 MB  |     0.476 |     
> 211.073
>  tbm    | 96.01 MB  |  0.273 |   10.347 | 768.01 MB  |     2.882 |     
> 128.103
> 
> * Test 2
> select prepare(
> 1000000, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within  a page
> 1, -- # of consecutive pages having dead tuples
> 1 -- page interval
> );
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 57.23 MB  |  0.041 |    4.757 | 572.21 MB  |     0.344 |      
> 71.228
>  intset | 46.88 MB  |  0.127 |    3.762 | 468.67 MB  |     1.093 |      
> 49.573
>  radix  | 9.95 MB   |  0.048 |    0.679 | 82.57 MB   |     0.371 |      
> 16.211
>  rtbm   | 34.02 MB  |  0.179 |    0.534 | 288.02 MB  |     2.092 |      
>  8.693
>  svtm   | 5.78 MB   |  0.043 |    0.239 | 54.60 MB   |     0.342 |      
>  7.759
>  tbm    | 96.01 MB  |  0.274 |    0.521 | 768.01 MB  |     2.685 |      
>  6.360
> 
> * Test 3
> select prepare(
> 1000000, -- max block
> 2, -- # of dead tuples per page
> 100, -- dead tuples interval within  a page
> 1, -- # of consecutive pages having dead tuples
> 1 -- page interval
> );
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 11.45 MB  |  0.009 |   57.698 | 114.45 MB  |     0.076 |    
> 1045.639
>  intset | 15.63 MB  |  0.031 |   46.083 | 156.23 MB  |     0.243 |     
> 848.525
>  radix  | 40.26 MB  |  0.063 |   13.755 | 336.64 MB  |     0.501 |     
> 223.413
>  rtbm   | 36.02 MB  |  0.123 |   11.527 | 320.02 MB  |     1.843 |     
> 180.977
>  svtm   | 9.28 MB   |  0.053 |    9.631 | 92.59 MB   |     0.438 |     
> 212.626
>  tbm    | 96.01 MB  |  0.228 |   10.381 | 768.01 MB  |     2.258 |     
> 126.630
> 
> * Test 4
> select prepare(
> 1000000, -- max block
> 100, -- # of dead tuples per page
> 1, -- dead tuples interval within  a page
> 1, -- # of consecutive pages having dead tuples
> 1 -- page interval
> );
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 572.21 MB |  0.367 |   78.047 | 5722.05 MB |     3.942 |    
> 1154.776
>  intset | 93.74 MB  |  0.777 |   45.146 | 937.34 MB  |     7.716 |     
> 643.708
>  radix  | 40.26 MB  |  0.203 |    9.015 | 336.64 MB  |     1.775 |     
> 133.294
>  rtbm   | 36.02 MB  |  0.369 |    5.639 | 320.02 MB  |     3.823 |      
> 88.832
>  svtm   | 7.28 MB   |  0.294 |    3.891 | 73.60 MB   |     2.690 |     
> 103.744
>  tbm    | 96.01 MB  |  0.534 |    5.223 | 768.01 MB  |     5.679 |      
> 60.632
> 
> 
> * Test 5
> select prepare(
> 1000000, -- max block
> 150, -- # of dead tuples per page
> 1, -- dead tuples interval within  a page
> 10000, -- # of consecutive pages having dead tuples
> 20000 -- page interval
> );
> 
> There are 10000 consecutive pages that have 150 dead tuples at every
> 20000 pages.
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 429.16 MB |  0.274 |   75.664 | 4291.54 MB |     3.067 |    
> 1259.501
>  intset | 46.88 MB  |  0.559 |   36.449 | 468.67 MB  |     4.565 |     
> 517.445
>  radix  | 20.26 MB  |  0.166 |    8.466 | 196.90 MB  |     1.273 |     
> 166.587
>  rtbm   | 18.02 MB  |  0.242 |    8.491 | 160.02 MB  |     2.407 |     
> 171.725
>  svtm   | 3.66 MB   |  0.243 |    3.635 | 37.10 MB   |     2.022 |      
> 86.165
>  tbm    | 48.01 MB  |  0.344 |    9.763 | 384.01 MB  |     3.327 |     
> 151.824
> 
> * Test 6
> select prepare(
> 1000000, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within  a page
> 10000, -- # of consecutive pages having dead tuples
> 20000 -- page interval
> );
> 
> There are 10000 consecutive pages that have 10 dead tuples at every 
> 20000 pages.
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 28.62 MB  |  0.022 |    2.791 | 286.11 MB  |     0.170 |      
> 46.920
>  intset | 23.45 MB  |  0.061 |    2.156 | 234.34 MB  |     0.501 |      
> 32.577
>  radix  | 5.04 MB   |  0.026 |    0.433 | 48.57 MB   |     0.191 |      
> 11.060
>  rtbm   | 17.02 MB  |  0.074 |    0.533 | 144.02 MB  |     0.954 |      
> 11.502
>  svtm   | 3.16 MB   |  0.023 |    0.206 | 27.60 MB   |     0.175 |      
>  4.886
>  tbm    | 48.01 MB  |  0.132 |    0.656 | 384.01 MB  |     1.284 |      
> 10.231
> 
> * Test 7
> select prepare(
> 1000000, -- max block
> 150, -- # of dead tuples per page
> 1, -- dead tuples interval within  a page
> 1000, -- # of consecutive pages having dead tuples
> 999000 -- page interval
> );
> 
> There are pages that have 150 dead tuples at first 1000 blocks and
> last 1000 blocks.
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 1.72 MB   |  0.002 |    7.507 | 17.17 MB   |     0.011 |      
> 76.510
>  intset | 0.20 MB   |  0.003 |    6.742 | 1.89 MB    |     0.022 |      
> 52.122
>  radix  | 0.20 MB   |  0.001 |    1.023 | 1.07 MB    |     0.007 |      
> 12.023
>  rtbm   | 0.15 MB   |  0.001 |    2.637 | 0.65 MB    |     0.009 |      
> 34.528
>  svtm   | 0.52 MB   |  0.002 |    0.721 | 0.61 MB    |     0.010 |      
>  6.434
>  tbm    | 0.20 MB   |  0.002 |    2.733 | 1.51 MB    |     0.015 |      
> 38.538
> 
> * Test 8
> select prepare(
> 1000000, -- max block
> 100, -- # of dead tuples per page
> 1, -- dead tuples interval within  a page
> 50, -- # of consecutive pages having dead tuples
> 100 -- page interval
> );
> 
> There are 50 consecutive pages that have 100 dead tuples at every 100 
> pages.
> 
>   name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
> shuffled_x10
> --------+-----------+--------+----------+------------+-----------+-------------
>  array  | 286.11 MB |  0.184 |   67.233 | 2861.03 MB |     1.743 |     
> 979.070
>  intset | 46.88 MB  |  0.389 |   35.176 | 468.67 MB  |     3.698 |     
> 505.322
>  radix  | 21.82 MB  |  0.116 |    6.160 | 186.86 MB  |     0.891 |     
> 117.730
>  rtbm   | 18.02 MB  |  0.182 |    5.909 | 160.02 MB  |     1.870 |     
> 112.550
>  svtm   | 4.28 MB   |  0.152 |    3.213 | 37.60 MB   |     1.383 |      
> 79.073
>  tbm    | 48.01 MB  |  0.265 |    6.673 | 384.01 MB  |     2.586 |     
> 101.327
> 
> Overall, 'svtm' is faster and consumes less memory. 'radix' tree also
> has good performance and memory usage.
> 
> From these results, svtm is the best data structure among proposed
> ideas for dead tuple storage used during lazy vacuum in terms of
> performance and memory usage. I think it can support iteration by
> extracting the offset of dead tuples for each block while iterating
> chunks.
> 
> Apart from performance and memory usage points of view, we also need
> to consider the reusability of the code. When I started this thread, I
> thought the best data structure would be the one optimized for
> vacuum's dead tuple storage. However, if we can use a data structure
> that can also be used in general, we can use it also for other
> purposes. Moreover, if it's too optimized for the current TID system
> (32 bits block number, 16 bits offset number, maximum block/offset
> number, etc.) it may become a blocker for future changes.
> 
> In that sense, radix tree also seems good since it can also be used in
> gist vacuum as a replacement for intset, or a replacement for hash
> table for shared buffer as discussed before. Are there any other use
> cases? On the other hand, I’m concerned that radix tree would be an
> over-engineering in terms of vacuum's dead tuples storage since the
> dead tuple storage is static data and requires only lookup operation,
> so if we want to use radix tree as dead tuple storage, I'd like to see
> further use cases.

I can evolve svtm to transparent intset replacement certainly. Using
same trick from radix_to_key it will store tids efficiently:

   shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
   tid_i = ItemPointerGetOffsetNumber(tid);
   tid_i |= ItemPointerGetBlockNumber(tid) << shift;

Will do today's evening.

regards
Yura Sokolov aka funny_falcon



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,


On 2021-07-25 19:07:18 +0300, Yura Sokolov wrote:
> I've dreamed to write more compact structure for vacuum for three
> years, but life didn't give me a time to.
> 
> Let me join to friendly competition.
> 
> I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

My concern with several of the proposals in this thread is that they
over-optimize for this specific case. It's not actually that crucial to
have a crazily optimized vacuum dead tid storage datatype. Having
something more general that also performs reasonably for the dead tuple
storage, but also performs well in a number of other cases, makes a lot
more sense to me.


> (Bad radix result probably due to smaller cache in notebook's CPU ?)

Probably largely due to the node dispatch. a) For some reason gcc likes
jump tables too much, I get better numbers when disabling those b) the
node type dispatch should be stuffed into the low bits of the pointer.


> select prepare(1000000, 2, 100, 1);
> 
>        attach  size   shuffled
> array    6ms    12MB   53.42s
> intset  23ms    16MB   54.99s
> rtbm   115ms    38MB    8.19s
> tbm    186ms   100MB    8.37s
> vtbm   105ms    59MB    9.08s
> radix   64ms    42MB   10.41s
> svtm    73ms    10MB    7.49s

> Test4
> 
> select prepare(1000000, 100, 1, 1);
> 
>        attach  size   shuffled
> array  304ms   600MB   75.12s
> intset 775ms    98MB   47.49s
> rtbm   356ms    38MB    4.11s
> tbm    539ms   100MB    4.20s
> vtbm   493ms    42MB    4.44s
> radix  263ms    42MB    6.05s
> svtm   360ms     8MB    3.49s
> 
> Therefore Specialized Vaccum Tid Map always consumes least memory amount
> and usually faster.

Impressive.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
> Apart from performance and memory usage points of view, we also need
> to consider the reusability of the code. When I started this thread, I
> thought the best data structure would be the one optimized for
> vacuum's dead tuple storage. However, if we can use a data structure
> that can also be used in general, we can use it also for other
> purposes. Moreover, if it's too optimized for the current TID system
> (32 bits block number, 16 bits offset number, maximum block/offset
> number, etc.) it may become a blocker for future changes.

Indeed.


> In that sense, radix tree also seems good since it can also be used in
> gist vacuum as a replacement for intset, or a replacement for hash
> table for shared buffer as discussed before. Are there any other use
> cases?

Yes, I think there are. Whenever there is some spatial locality it has a
decent chance of winning over a hash table, and it will most of the time
win over ordered datastructures like rbtrees (which perform very poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming &
outgoing node) radix trees even can deal quite performantly with
variable width keys.


> On the other hand, I’m concerned that radix tree would be an
> over-engineering in terms of vacuum's dead tuples storage since the
> dead tuple storage is static data and requires only lookup operation,
> so if we want to use radix tree as dead tuple storage, I'd like to see
> further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
> > Apart from performance and memory usage points of view, we also need
> > to consider the reusability of the code. When I started this thread, I
> > thought the best data structure would be the one optimized for
> > vacuum's dead tuple storage. However, if we can use a data structure
> > that can also be used in general, we can use it also for other
> > purposes. Moreover, if it's too optimized for the current TID system
> > (32 bits block number, 16 bits offset number, maximum block/offset
> > number, etc.) it may become a blocker for future changes.
>
> Indeed.
>
>
> > In that sense, radix tree also seems good since it can also be used in
> > gist vacuum as a replacement for intset, or a replacement for hash
> > table for shared buffer as discussed before. Are there any other use
> > cases?
>
> Yes, I think there are. Whenever there is some spatial locality it has a
> decent chance of winning over a hash table, and it will most of the time
> win over ordered datastructures like rbtrees (which perform very poorly
> due to the number of branches and pointer dispatches). There's plenty
> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
> degree of locality, so I'd expect a few potential uses. When adding
> "tree compression" (i.e. skip inner nodes that have a single incoming &
> outgoing node) radix trees even can deal quite performantly with
> variable width keys.

Good point.

>
> > On the other hand, I’m concerned that radix tree would be an
> > over-engineering in terms of vacuum's dead tuples storage since the
> > dead tuple storage is static data and requires only lookup operation,
> > so if we want to use radix tree as dead tuple storage, I'd like to see
> > further use cases.
>
> I don't think we should rely on the read-only-ness. It seems pretty
> clear that we'd want parallel dead-tuple scans at a point not too far
> into the future?

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.

During the performance benchmark, I found some bugs in the radix tree
implementation. Also, we need the functionality of tree iteration, and
if we have the radix tree in the source tree as a general library, we
need some changes since the current implementation seems to be for a
replacement for shared buffer’s hash table. I'll try to work on those
stuff as PoC if you don't. What do you think?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Yura Sokolov
Date:
Masahiko Sawada писал 2021-07-29 12:11:
> On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> 
> wrote:
>> 
>> Hi,
>> 
>> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
>> > Apart from performance and memory usage points of view, we also need
>> > to consider the reusability of the code. When I started this thread, I
>> > thought the best data structure would be the one optimized for
>> > vacuum's dead tuple storage. However, if we can use a data structure
>> > that can also be used in general, we can use it also for other
>> > purposes. Moreover, if it's too optimized for the current TID system
>> > (32 bits block number, 16 bits offset number, maximum block/offset
>> > number, etc.) it may become a blocker for future changes.
>> 
>> Indeed.
>> 
>> 
>> > In that sense, radix tree also seems good since it can also be used in
>> > gist vacuum as a replacement for intset, or a replacement for hash
>> > table for shared buffer as discussed before. Are there any other use
>> > cases?
>> 
>> Yes, I think there are. Whenever there is some spatial locality it has 
>> a
>> decent chance of winning over a hash table, and it will most of the 
>> time
>> win over ordered datastructures like rbtrees (which perform very 
>> poorly
>> due to the number of branches and pointer dispatches). There's plenty
>> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
>> degree of locality, so I'd expect a few potential uses. When adding
>> "tree compression" (i.e. skip inner nodes that have a single incoming 
>> &
>> outgoing node) radix trees even can deal quite performantly with
>> variable width keys.
> 
> Good point.
> 
>> 
>> > On the other hand, I’m concerned that radix tree would be an
>> > over-engineering in terms of vacuum's dead tuples storage since the
>> > dead tuple storage is static data and requires only lookup operation,
>> > so if we want to use radix tree as dead tuple storage, I'd like to see
>> > further use cases.
>> 
>> I don't think we should rely on the read-only-ness. It seems pretty
>> clear that we'd want parallel dead-tuple scans at a point not too far
>> into the future?
> 
> Indeed. Given that the radix tree itself has other use cases, I have
> no concern about using radix tree for vacuum's dead tuples storage. It
> will be better to have one that can be generally used and has some
> optimizations that are helpful also for vacuum's use case, rather than
> having one that is very optimized only for vacuum's use case.

Main portion of svtm that leads to memory saving is compression of many
pages at once (CHUNK). It could be combined with radix as a storage for
pointers to CHUNKs.

For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM 
like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.

Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.

I'll send result later today after all benchmarks complete.

And I'll try then to make mix of radix and CHUNK compression.

> During the performance benchmark, I found some bugs in the radix tree
> implementation.

There is a bug in radix_to_key_off as well:

     tid_i |= ItemPointerGetBlockNumber(tid) << shift;

ItemPointerGetBlockNumber returns uint32, therefore result after shift
is uint32 as well.

It leads to lesser memory consumption (and therefore better times) on
10x test, when page number exceed 2^23 (8M). It still produce "correct"
result for test since every page is filled in the same way.

Could you push your fixes for radix, please?

regards,
Yura Sokolov

y.sokolov@postgrespro.ru
funny.falcon@gmail.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
> Masahiko Sawada писал 2021-07-29 12:11:
> > On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
> > wrote:
> >>
> >> Hi,
> >>
> >> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
> >> > Apart from performance and memory usage points of view, we also need
> >> > to consider the reusability of the code. When I started this thread, I
> >> > thought the best data structure would be the one optimized for
> >> > vacuum's dead tuple storage. However, if we can use a data structure
> >> > that can also be used in general, we can use it also for other
> >> > purposes. Moreover, if it's too optimized for the current TID system
> >> > (32 bits block number, 16 bits offset number, maximum block/offset
> >> > number, etc.) it may become a blocker for future changes.
> >>
> >> Indeed.
> >>
> >>
> >> > In that sense, radix tree also seems good since it can also be used in
> >> > gist vacuum as a replacement for intset, or a replacement for hash
> >> > table for shared buffer as discussed before. Are there any other use
> >> > cases?
> >>
> >> Yes, I think there are. Whenever there is some spatial locality it has
> >> a
> >> decent chance of winning over a hash table, and it will most of the
> >> time
> >> win over ordered datastructures like rbtrees (which perform very
> >> poorly
> >> due to the number of branches and pointer dispatches). There's plenty
> >> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
> >> degree of locality, so I'd expect a few potential uses. When adding
> >> "tree compression" (i.e. skip inner nodes that have a single incoming
> >> &
> >> outgoing node) radix trees even can deal quite performantly with
> >> variable width keys.
> >
> > Good point.
> >
> >>
> >> > On the other hand, I’m concerned that radix tree would be an
> >> > over-engineering in terms of vacuum's dead tuples storage since the
> >> > dead tuple storage is static data and requires only lookup operation,
> >> > so if we want to use radix tree as dead tuple storage, I'd like to see
> >> > further use cases.
> >>
> >> I don't think we should rely on the read-only-ness. It seems pretty
> >> clear that we'd want parallel dead-tuple scans at a point not too far
> >> into the future?
> >
> > Indeed. Given that the radix tree itself has other use cases, I have
> > no concern about using radix tree for vacuum's dead tuples storage. It
> > will be better to have one that can be generally used and has some
> > optimizations that are helpful also for vacuum's use case, rather than
> > having one that is very optimized only for vacuum's use case.
>
> Main portion of svtm that leads to memory saving is compression of many
> pages at once (CHUNK). It could be combined with radix as a storage for
> pointers to CHUNKs.
>
> For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM
> like)
> and CHUNK compression, therefore datastructure could be used for gist
> vacuum as well.
>
> Since it is generic (allows to index all 64bit) it lacks of trick used
> to speedup svtm. Still on 10x test it is faster than radix.

BTW, how does svtm work when we add two sets of dead tuple TIDs to one
svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
of the different offsets on the same block. The case I imagine is the
idea discussed on this thread[1]. With this idea, we store the
collected dead tuple TIDs somewhere and skip index vacuuming for some
reason (index skipping optimization, failsafe mode, or interruptions
etc.). Then, in the next lazy vacuum timing, we load the dead tuple
TIDs and start to scan the heap. During the heap scan in the second
lazy vacuum, it's possible that new dead tuples will be found on the
pages that we have already stored in svtm during the first lazy
vacuum. How can we efficiently update the chunk in the svtm?

Regards,

[1] https://www.postgresql.org/message-id/CA%2BTgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A%40mail.gmail.com


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Yura Sokolov
Date:
Masahiko Sawada писал 2021-07-29 17:29:
> On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru> 
> wrote:
>> 
>> Masahiko Sawada писал 2021-07-29 12:11:
>> > On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
>> >> > Apart from performance and memory usage points of view, we also need
>> >> > to consider the reusability of the code. When I started this thread, I
>> >> > thought the best data structure would be the one optimized for
>> >> > vacuum's dead tuple storage. However, if we can use a data structure
>> >> > that can also be used in general, we can use it also for other
>> >> > purposes. Moreover, if it's too optimized for the current TID system
>> >> > (32 bits block number, 16 bits offset number, maximum block/offset
>> >> > number, etc.) it may become a blocker for future changes.
>> >>
>> >> Indeed.
>> >>
>> >>
>> >> > In that sense, radix tree also seems good since it can also be used in
>> >> > gist vacuum as a replacement for intset, or a replacement for hash
>> >> > table for shared buffer as discussed before. Are there any other use
>> >> > cases?
>> >>
>> >> Yes, I think there are. Whenever there is some spatial locality it has
>> >> a
>> >> decent chance of winning over a hash table, and it will most of the
>> >> time
>> >> win over ordered datastructures like rbtrees (which perform very
>> >> poorly
>> >> due to the number of branches and pointer dispatches). There's plenty
>> >> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
>> >> degree of locality, so I'd expect a few potential uses. When adding
>> >> "tree compression" (i.e. skip inner nodes that have a single incoming
>> >> &
>> >> outgoing node) radix trees even can deal quite performantly with
>> >> variable width keys.
>> >
>> > Good point.
>> >
>> >>
>> >> > On the other hand, I’m concerned that radix tree would be an
>> >> > over-engineering in terms of vacuum's dead tuples storage since the
>> >> > dead tuple storage is static data and requires only lookup operation,
>> >> > so if we want to use radix tree as dead tuple storage, I'd like to see
>> >> > further use cases.
>> >>
>> >> I don't think we should rely on the read-only-ness. It seems pretty
>> >> clear that we'd want parallel dead-tuple scans at a point not too far
>> >> into the future?
>> >
>> > Indeed. Given that the radix tree itself has other use cases, I have
>> > no concern about using radix tree for vacuum's dead tuples storage. It
>> > will be better to have one that can be generally used and has some
>> > optimizations that are helpful also for vacuum's use case, rather than
>> > having one that is very optimized only for vacuum's use case.
>> 
>> Main portion of svtm that leads to memory saving is compression of 
>> many
>> pages at once (CHUNK). It could be combined with radix as a storage 
>> for
>> pointers to CHUNKs., bute
>> 
>> For a moment I'm benchmarking IntegerSet replacement based on Trie 
>> (HATM
>> like)
>> and CHUNK compression, therefore datastructure could be used for gist
>> vacuum as well.
>> 
>> Since it is generic (allows to index all 64bit) it lacks of trick used
>> to speedup svtm. Still on 10x test it is faster than radix.

I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2

SVTM is measured with couple of changes from commit 
5055ef72d23482dd3e11ce
in that branch: 1) more often compress bitmap, but slower, 2) couple of
popcount tricks.

IntegerSet consists of trie index to CHUNKS. CHUNKS is compressed bitmap
of 2^15 (6+9) bits (almost like in SVTM, but for fixed bit width).

Well, IntegerSet2 is always faster than IntegerSet and always uses
significantly less memory (radix uses more memory than IntegerSet in
couple of tests and uses comparable in others).

IntegerSet2 is not always faster than radix. It is more like radix.
That it because both are generic prefix trees with comparable amount of
memory accesses. SVTM did the trick being not multilevel prefix tree, 
but
just 1 level bitmap index to chunks.

I believe, trie part of IntegerSet could be replaced with radix.
Ie use radix as storage for pointers to CHUNKS.

> BTW, how does svtm work when we add two sets of dead tuple TIDs to one
> svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
> of the different offsets on the same block. The case I imagine is the
> idea discussed on this thread[1]. With this idea, we store the
> collected dead tuple TIDs somewhere and skip index vacuuming for some
> reason (index skipping optimization, failsafe mode, or interruptions
> etc.). Then, in the next lazy vacuum timing, we load the dead tuple
> TIDs and start to scan the heap. During the heap scan in the second
> lazy vacuum, it's possible that new dead tuples will be found on the
> pages that we have already stored in svtm during the first lazy
> vacuum. How can we efficiently update the chunk in the svtm?

If we store tidmap to disk, then it will be serialized. Since SVTM/
IntegerSet2 are ordered, they could be loaded in order. Then we
can just merge tuples in per page basis: deserialize page (or CHUNK),
put new tuples, store again. Since both scan (scan of serilized map
and scan of table) are in order, merging will be cheap enough.

SVTM and IntegerSet2 already works in "buffered" way on insertion.
(As well as IntegerSet that also does compression but in small parts).

regards,

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Yura Sokolov
Date:
Yura Sokolov писал 2021-07-29 18:29:

> I've attached IntegerSet2 patch for pgtools repo and benchmark results.
> Branch https://github.com/funny-falcon/pgtools/tree/integerset2

Strange web-mail client... I never can be sure what it will attach...

Reattach benchmark results

> 
> regards,
> 
> Yura Sokolov
> y.sokolov@postgrespro.ru
> funny.falcon@gmail.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Robert Haas
Date:
On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Indeed. Given that the radix tree itself has other use cases, I have
> no concern about using radix tree for vacuum's dead tuples storage. It
> will be better to have one that can be generally used and has some
> optimizations that are helpful also for vacuum's use case, rather than
> having one that is very optimized only for vacuum's use case.

What I'm about to say might be a really stupid idea, especially since
I haven't looked at any of the code already posted, but what I'm
wondering about is whether we need a full radix tree or maybe just a
radix-like lookup aid. For example, suppose that for a relation <= 8MB
in size, we create an array of 1024 elements indexed by block number.
Each element of the array stores an offset into the dead TID array.
When you need to probe for a TID, you look up blkno and blkno + 1 in
the array and then bsearch only between those two offsets. For bigger
relations, a two or three level structure could be built, or it could
always be 3 levels. This could even be done on demand, so you
initialize all of the elements to some special value that means "not
computed yet" and then fill them the first time they're needed,
perhaps with another special value that means "no TIDs in that block".

I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Yura Sokolov
Date:
Robert Haas писал 2021-07-29 20:15:
> On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> 
> wrote:
>> Indeed. Given that the radix tree itself has other use cases, I have
>> no concern about using radix tree for vacuum's dead tuples storage. It
>> will be better to have one that can be generally used and has some
>> optimizations that are helpful also for vacuum's use case, rather than
>> having one that is very optimized only for vacuum's use case.
> 
> What I'm about to say might be a really stupid idea, especially since
> I haven't looked at any of the code already posted, but what I'm
> wondering about is whether we need a full radix tree or maybe just a
> radix-like lookup aid. For example, suppose that for a relation <= 8MB
> in size, we create an array of 1024 elements indexed by block number.
> Each element of the array stores an offset into the dead TID array.
> When you need to probe for a TID, you look up blkno and blkno + 1 in
> the array and then bsearch only between those two offsets. For bigger
> relations, a two or three level structure could be built, or it could
> always be 3 levels. This could even be done on demand, so you
> initialize all of the elements to some special value that means "not
> computed yet" and then fill them the first time they're needed,
> perhaps with another special value that means "no TIDs in that block".

8MB relation is not a problem, imo. There is no need to do anything to
handle 8MB relation.

Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead
tuples.

Then offset array will be 2GB and tuple offset array will be 6GB (2 byte
offset per tuple). 8GB in total.

We can make offset array only for higher 3 bytes of block number.
We then will have 1M offset array weighted 8MB, and there will be array
of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes
from Tuple) weighted 9GB.

But using per-batch compression schemes, there could be amortized
4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory.
Yes, it is not as guaranteed as in array approach. But 95% of time it is
such low and even lower. And better: more tuples are dead - better
compression works. Page with all tuples dead could be encoded as little
as 5 bytes. Therefore, overall memory consumption is more stable and
predictive.

Lower memory consumption of tuple storage means there is less chance
indexes should be scanned twice or more times. It gives more
predictability in user experience.

> I don't know if this is better, but I do kind of like the fact that
> the basic representation is just an array. It makes it really easy to
> predict how much memory will be needed for a given number of dead
> TIDs, and it's very DSM-friendly as well.

Whole thing could be encoded in one single array of bytes. Just give
"pointer-to-array"+"array-size" to constructor, and use "bump allocator"
inside. Complex logical structure doesn't imply "DSM-unfriendliness".
Hmm.... I mean if it is suitably designed.

In fact, my code uses bump allocator internally to avoid "per-allocation
overhead" of "aset", "slab" or "generational". And IntegerSet2 version
even uses it for all allocations since it has no reallocatable parts.

Well, if datastructure has reallocatable parts, it could be less 
friendly
to DSM.

regards,

---
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-29 13:15:53 -0400, Robert Haas wrote:
> I don't know if this is better, but I do kind of like the fact that
> the basic representation is just an array. It makes it really easy to
> predict how much memory will be needed for a given number of dead
> TIDs, and it's very DSM-friendly as well.

I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array.  Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Robert Haas
Date:
On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:
> I think those advantages are far outstripped by the big disadvantage of
> needing to either size the array accurately from the start, or to
> reallocate the whole array.  Our current pre-allocation behaviour is
> very wasteful for most vacuums but doesn't handle large work_mem at all,
> causing unnecessary index scans.

I agree that the current pre-allocation behavior is bad, but I don't
really see that as an issue with my idea. Fixing that would require
allocating the array in chunks, but that doesn't really affect the
core of the idea much, at least as I see it.

But I accept that Yura has a very good point about the memory usage of
what I was proposing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2021-07-30 15:13:49 -0400, Robert Haas wrote:
> On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:
> > I think those advantages are far outstripped by the big disadvantage of
> > needing to either size the array accurately from the start, or to
> > reallocate the whole array.  Our current pre-allocation behaviour is
> > very wasteful for most vacuums but doesn't handle large work_mem at all,
> > causing unnecessary index scans.
> 
> I agree that the current pre-allocation behavior is bad, but I don't
> really see that as an issue with my idea. Fixing that would require
> allocating the array in chunks, but that doesn't really affect the
> core of the idea much, at least as I see it.

Well, then it'd not really be the "simple array approach" anymore :)


> But I accept that Yura has a very good point about the memory usage of
> what I was proposing.

The lower memory usage also often will result in a better cache
utilization - which is a crucial factor for index vacuuming when the
index order isn't correlated with the heap order. Cache misses really
are a crucial performance factor there.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Robert Haas
Date:
On Fri, Jul 30, 2021 at 3:34 PM Andres Freund <andres@anarazel.de> wrote:
> The lower memory usage also often will result in a better cache
> utilization - which is a crucial factor for index vacuuming when the
> index order isn't correlated with the heap order. Cache misses really
> are a crucial performance factor there.

Fair enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Matthias van de Meent
Date:
Hi,

Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.

Are there any plans to get the results of this thread from PoC to committable?

Kind regards,

Matthias van de Meent



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:
> Today I noticed the inefficiencies of our dead tuple storage once
> again, and started theorizing about a better storage method; which is
> when I remembered that this thread exists, and that this thread
> already has amazing results.
> 
> Are there any plans to get the results of this thread from PoC to committable?

I'm not currently planning to work on it personally. It'd would be awesome if
somebody did...

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sun, Feb 13, 2022 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:
> > Today I noticed the inefficiencies of our dead tuple storage once
> > again, and started theorizing about a better storage method; which is
> > when I remembered that this thread exists, and that this thread
> > already has amazing results.
> >
> > Are there any plans to get the results of this thread from PoC to committable?
>
> I'm not currently planning to work on it personally. It'd would be awesome if
> somebody did...

Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:
> Actually, I'm working on simplifying and improving radix tree
> implementation for PG16 dev cycle. From the discussion so far I think
> it's better to have a data structure that can be used for
> general-purpose and is also good for storing TID, not very specific to
> store TID. So I think radix tree would be a potent candidate. I have
> done the insertion and search implementation.

Awesome!



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Sun, Feb 13, 2022 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:
> > Actually, I'm working on simplifying and improving radix tree
> > implementation for PG16 dev cycle. From the discussion so far I think
> > it's better to have a data structure that can be used for
> > general-purpose and is also good for storing TID, not very specific to
> > store TID. So I think radix tree would be a potent candidate. I have
> > done the insertion and search implementation.
>
> Awesome!

To move this project forward, I've implemented radix tree
implementation from scratch while studying Andres's implementation. It
supports insertion, search, and iteration but not deletion yet. In my
implementation, I use Datum as the value so internal and lead nodes
have the same data structure, simplifying the implementation. The
iteration on the radix tree returns keys with the value in ascending
order of the key. The patch has regression tests for radix tree but is
still in PoC state: left many debugging codes, not supported SSE2 SIMD
instructions, added -mavx2 flag is hard-coded.

I've measured the size and loading and lookup performance for each
candidate data structure with two test cases: dense and sparse, by
using the test tool[1]. Here are the results:

* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);

name                 size            attach             lookup
array           520 MB      248.60 ms   89891.92 ms
hash         3188 MB  28029.59 ms   50850.32 ms
intset           85 MB       644.96 ms   39801.17 ms
tbm              96 MB       474.06 ms     6641.38 ms
radix            37 MB       173.03 ms     9145.97 ms
radix_tree    36 MB       184.51 ms     9729.94 ms

* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
10000000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);

name                 size           attach             lookup
array              125 kB      0.53 ms    82183.61 ms
hash             1032 kB      1.31 ms   28128.33 ms
intset              222 kB      0.51 ms    87775.68 ms
tbm                768 MB      1.24 ms   98674.60 ms
radix             1080 kB      1.66 ms    20698.07 ms
radix_tree       949 kB      1.50 ms    21465.23 ms

Each test virtually generates TIDs and loads them to the data
structure, and then searches for virtual index TIDs.
'array' is a sorted array which is the current method, 'hash' is HTAB,
'intset' is IntegerSet, and 'tbm' is TIDBitmap. The last two results
are radix tree implementations: 'radix' is Andres's radix tree
implementation and 'radix_tree' is my radix tree implementation. In
both radix tree tests, I converted TIDs into an int64 and store the
lower 6 bits in the value part of the radix tree.

Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.

Regards,

[1] https://github.com/MasahikoSawada/pgtools/tree/master/bdbench
[2] https://www.postgresql.org/message-id/CAFiTN-visUO9VTz2%2Bh224z5QeUjKhKNdSfjaCucPhYJdbzxx0g%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Overall, radix tree implementations have good numbers. Once we got an
> agreement on moving in this direction, I'll start a new thread for
> that and move the implementation further; there are many things to do
> and discuss: deletion, API design, SIMD support, more tests etc.

+1

(FWIW, I think the current thread is still fine.)

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, May 10, 2022 at 6:58 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Overall, radix tree implementations have good numbers. Once we got an
> > agreement on moving in this direction, I'll start a new thread for
> > that and move the implementation further; there are many things to do
> > and discuss: deletion, API design, SIMD support, more tests etc.
>
> +1
>

Thanks!

I've attached an updated version patch. It is still WIP but I've
implemented deletion and improved test cases and comments.

> (FWIW, I think the current thread is still fine.)

Okay, agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, May 25, 2022 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, May 10, 2022 at 6:58 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Overall, radix tree implementations have good numbers. Once we got an
> > > agreement on moving in this direction, I'll start a new thread for
> > > that and move the implementation further; there are many things to do
> > > and discuss: deletion, API design, SIMD support, more tests etc.
> >
> > +1
> >
>
> Thanks!
>
> I've attached an updated version patch. It is still WIP but I've
> implemented deletion and improved test cases and comments.

I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.

The integration with lazy vacuum and parallel vacuum is missing for
now. In order to support parallel vacuum, we need to have the radix
tree support to be created on DSA.

Added this item to the next CF.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've attached an updated version patch that changes the configure
> script. I'm still studying how to support AVX2 on msvc build. Also,
> added more regression tests.

Thanks for the update, I will take a closer look at the patch in the
near future, possibly next week. For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it. Even if 32-pointer nodes are better from a memory perspective, I
imagine it should be possible to use two SSE2 registers to find the
index. It'd be locally slightly more complex, but not much. It might
not even cost much more in cycles since AVX2 would require indirecting
through a function pointer. It's much more convenient if we don't need
a runtime check. There are also thermal and power disadvantages when
using AXV2 in some workloads. I'm not sure that's the case here, but
if it is, we'd better be getting something in return.

One more thing in general: In an earlier version, I noticed that
Andres used the slab allocator and documented why. The last version of
your patch that I saw had the same allocator, but not the "why".
Especially in early stages of review, we want to document design
decisions so it's more clear for the reader.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andrew Dunstan
Date:
On 2022-06-16 Th 00:56, Masahiko Sawada wrote:
>
> I've attached an updated version patch that changes the configure
> script. I'm still studying how to support AVX2 on msvc build. Also,
> added more regression tests.


I think you would need to add '/arch:AVX2' to the compiler flags in
MSBuildProject.pm.


See
<https://docs.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170>


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Thu, Jun 16, 2022 at 4:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've attached an updated version patch that changes the configure
> > script. I'm still studying how to support AVX2 on msvc build. Also,
> > added more regression tests.
>
> Thanks for the update, I will take a closer look at the patch in the
> near future, possibly next week.

Thanks!

> For now, though, I'd like to question
> why we even need to use 32-byte registers in the first place. For one,
> the paper referenced has 16-pointer nodes, but none for 32 (next level
> is 48 and uses a different method to find the index of the next
> pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> of his patch a couple weeks ago I don't recall a reason mentioned for
> it.

I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.
32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type. Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.

> Even if 32-pointer nodes are better from a memory perspective, I
> imagine it should be possible to use two SSE2 registers to find the
> index. It'd be locally slightly more complex, but not much. It might
> not even cost much more in cycles since AVX2 would require indirecting
> through a function pointer. It's much more convenient if we don't need
> a runtime check.

Right.

> There are also thermal and power disadvantages when
> using AXV2 in some workloads. I'm not sure that's the case here, but
> if it is, we'd better be getting something in return.

Good point.

> One more thing in general: In an earlier version, I noticed that
> Andres used the slab allocator and documented why. The last version of
> your patch that I saw had the same allocator, but not the "why".
> Especially in early stages of review, we want to document design
> decisions so it's more clear for the reader.

Indeed. I'll add comments in the next version patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

[v3 patch]

Hi Masahiko,

Since there are new files, and they are pretty large, I've attached
most specific review comments and questions as a diff rather than in
the email body. This is not a full review, which will take more time
-- this is a first pass mostly to aid my understanding, and discuss
some of the design and performance implications.

I tend to think it's a good idea to avoid most cosmetic review until
it's close to commit, but I did mention a couple things that might
enhance readability during review.

As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:

> On Thu, Jun 16, 2022 at 4:30 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > For now, though, I'd like to question
> > why we even need to use 32-byte registers in the first place. For one,
> > the paper referenced has 16-pointer nodes, but none for 32 (next level
> > is 48 and uses a different method to find the index of the next
> > pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> > of his patch a couple weeks ago I don't recall a reason mentioned for
> > it.
>
> I might be wrong but since AVX2 instruction set is introduced in
> Haswell microarchitecture in 2013 and the referenced paper is
> published in the same year, the art didn't use AVX2 instruction set.

Sure, but with a bit of work the same technique could be done on that
node size with two 16-byte registers.

> 32-pointer nodes are better from a memory perspective as you
> mentioned. Andres' prototype supports both 16-pointer nodes and
> 32-pointer nodes (out of 6 node types). This would provide better
> memory usage but on the other hand, it would also bring overhead of
> switching the node type.

Right, using more node types provides smaller increments of node size.
Just changing node type can be better or worse, depending on the
input.

> Anyway, it's an important design decision to
> support which size of node to support. It should be done based on
> experiment results and documented.

Agreed. I would add that in the first step, we want something
straightforward to read and easy to integrate into our codebase. I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
- node dispatch:
https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

When the PG16 cycle opens, I will work separately on ensuring the
portability of using SSE2, so you can focus on other aspects. I think
it would be a good idea to have both node16 and node32 for testing.
During benchmarking we can delete one or the other and play with the
other thresholds a bit.

Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:

switch (node->kind) // todo: inspect tagged pointer
{
  case RADIX_TREE_NODE_KIND_4:
       idx = node_search_eq(node, chunk, 4);
       do_action(node, idx, 4, ...);
       break;
  case RADIX_TREE_NODE_KIND_32:
       idx = node_search_eq(node, chunk, 32);
       do_action(node, idx, 32, ...);
  ...
}

static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
  // do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
  // do vectorized loop where available with (node_vec *) node;
...
}

...and let the compiler do loop unrolling and branch removal. Not sure
how difficult this is to do, but something to think about.

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Hannu Krosing
Date:
> Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> "simple loop", and looping over up to 32 elements is not great
> (although possibly okay). We could do binary search, but that has bad
> branch prediction.

I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)
Do you know any modern platforms that do not have SIMD ?

I would definitely test before assuming binary search is better.

Often other approaches like counting search over such small vectors is
much better when the vector fits in cache (or even a cache line) and
you always visit all items as this will completely avoid branch
predictions and allows compiler to vectorize and / or unroll the loop
as needed.

Cheers
Hannu



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-06-27 18:12:13 +0700, John Naylor wrote:
> Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> "simple loop", and looping over up to 32 elements is not great
> (although possibly okay). We could do binary search, but that has bad
> branch prediction.

I'd be quite quite surprised if binary search were cheaper. Particularly on
less fancy platforms.

- Andres



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote:
>
> > Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> > "simple loop", and looping over up to 32 elements is not great
> > (although possibly okay). We could do binary search, but that has bad
> > branch prediction.
>
> I am not sure that for relevant non-x86 platforms SIMD / vector
> instructions would not be used (though it would be a good idea to
> verify)

By that logic, we can also dispense with intrinsics on x86 because the
compiler will autovectorize there too (if I understand your claim
correctly). I'm not quite convinced of that in this case.

> I would definitely test before assuming binary search is better.

I wasn't very clear in my language, but I did reject binary search as
having bad branch prediction.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-06-28 11:17:42 +0700, John Naylor wrote:
> On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote:
> >
> > > Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> > > "simple loop", and looping over up to 32 elements is not great
> > > (although possibly okay). We could do binary search, but that has bad
> > > branch prediction.
> >
> > I am not sure that for relevant non-x86 platforms SIMD / vector
> > instructions would not be used (though it would be a good idea to
> > verify)
> 
> By that logic, we can also dispense with intrinsics on x86 because the
> compiler will autovectorize there too (if I understand your claim
> correctly). I'm not quite convinced of that in this case.

Last time I checked (maybe a year ago?) none of the popular compilers could
autovectorize that code pattern.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Mon, Jun 27, 2022 at 8:12 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> [v3 patch]
>
> Hi Masahiko,
>
> Since there are new files, and they are pretty large, I've attached
> most specific review comments and questions as a diff rather than in
> the email body. This is not a full review, which will take more time
> -- this is a first pass mostly to aid my understanding, and discuss
> some of the design and performance implications.
>
> I tend to think it's a good idea to avoid most cosmetic review until
> it's close to commit, but I did mention a couple things that might
> enhance readability during review.

Thank you for reviewing the patch!

>
> As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:
>
> > On Thu, Jun 16, 2022 at 4:30 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > For now, though, I'd like to question
> > > why we even need to use 32-byte registers in the first place. For one,
> > > the paper referenced has 16-pointer nodes, but none for 32 (next level
> > > is 48 and uses a different method to find the index of the next
> > > pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> > > of his patch a couple weeks ago I don't recall a reason mentioned for
> > > it.
> >
> > I might be wrong but since AVX2 instruction set is introduced in
> > Haswell microarchitecture in 2013 and the referenced paper is
> > published in the same year, the art didn't use AVX2 instruction set.
>
> Sure, but with a bit of work the same technique could be done on that
> node size with two 16-byte registers.
>
> > 32-pointer nodes are better from a memory perspective as you
> > mentioned. Andres' prototype supports both 16-pointer nodes and
> > 32-pointer nodes (out of 6 node types). This would provide better
> > memory usage but on the other hand, it would also bring overhead of
> > switching the node type.
>
> Right, using more node types provides smaller increments of node size.
> Just changing node type can be better or worse, depending on the
> input.
>
> > Anyway, it's an important design decision to
> > support which size of node to support. It should be done based on
> > experiment results and documented.
>
> Agreed. I would add that in the first step, we want something
> straightforward to read and easy to integrate into our codebase.

Agreed.



> I
> suspect other optimizations would be worth a lot more than using AVX2:
> - collapsing inner nodes
> - taking care when constructing the key (more on this when we
> integrate with VACUUM)
> ...and a couple Andres mentioned:
> - memory management: in
> https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> - node dispatch:
> https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
>
> Therefore, I would suggest that we use SSE2 only, because:
> - portability is very easy
> - to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

>
> When the PG16 cycle opens, I will work separately on ensuring the
> portability of using SSE2, so you can focus on other aspects.

Thanks!

> I think it would be a good idea to have both node16 and node32 for testing.
> During benchmarking we can delete one or the other and play with the
> other thresholds a bit.

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32. The 'name' in the below result indicates the
kind of instruction set (AVX2 or SSE2) and the node type used. For
instance, sse2_4_32_48_256 means the radix tree has four kinds of node
types for each which have 4, 32, 48, and 256 pointers, respectively,
and use SSE2 instruction set.

* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);

                          name             size              attach
          lookup
 avx2_4_32_128_256       1154 MB    6742.53 ms   47765.63 ms
 avx2_4_32_48_256         1839 MB    4239.35 ms   40528.39 ms
 sse2_4_16_128_256       1154 MB    6994.43 ms   40383.85 ms
 sse2_4_16_32_128_256 1154 MB    7239.35 ms   43542.39 ms
 sse2_4_16_48_256         1839 MB    4404.63 ms   36048.96 ms
 sse2_4_32_128_256        1154 MB   6688.50 ms   44902.64 ms

* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
10000000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);

                          name       size            attach              lookup
avx2_4_32_128_256        1535 kB   1.85 ms     17427.42 ms
avx2_4_32_48_256          1472 kB   2.01 ms     22176.75 ms
sse2_4_16_128_256        1582 kB   2.16 ms     15391.12 ms
sse2_4_16_32_128_256  1535 kB   2.14 ms     18757.86 ms
sse2_4_16_48_256          1489 kB   1.91 ms     19210.39 ms
sse2_4_32_128_256        1535 kB   2.05 ms     17777.55 ms

The statistics of the number of each node types are:

* avx2_4_32_128_256 (dense and sparse)
    * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
    * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

* avx2_4_32_48_256
    * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n48 = 227, n256 = 916433
    * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n48 = 159, n256 = 50

* sse2_4_16_128_256
    * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n128 = 916914, n256 = 31
    * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n128 = 256, n256 = 1

* sse2_4_16_32_128_256
    * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n32 = 285, n128 =
916629, n256 = 31
    * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n32 = 48, n128 =
208, n256 = 1

* sse2_4_16_48_256
    * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
    * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50

* sse2_4_32_128_256
    * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
    * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

Observations are:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On
the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.

In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.

> Ideally, node16 and node32 would have the same code with a different
> loop count (1 or 2). More generally, there is too much duplication of
> code (noted by Andres in his PoC), and there are many variable names
> with the node size embedded. This is a bit tricky to make more
> general, so we don't need to try it yet, but ideally we would have
> something similar to:
>
> switch (node->kind) // todo: inspect tagged pointer
> {
>   case RADIX_TREE_NODE_KIND_4:
>        idx = node_search_eq(node, chunk, 4);
>        do_action(node, idx, 4, ...);
>        break;
>   case RADIX_TREE_NODE_KIND_32:
>        idx = node_search_eq(node, chunk, 32);
>        do_action(node, idx, 32, ...);
>   ...
> }
>
> static pg_alwaysinline void
> node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> {
> if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
>   // do simple loop with (node_simple *) node;
> else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
>   // do vectorized loop where available with (node_vec *) node;
> ...
> }
>
> ...and let the compiler do loop unrolling and branch removal. Not sure
> how difficult this is to do, but something to think about.

Agreed.

I'll update my patch based on your review comments and use SSE2.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I
> > suspect other optimizations would be worth a lot more than using AVX2:
> > - collapsing inner nodes
> > - taking care when constructing the key (more on this when we
> > integrate with VACUUM)
> > ...and a couple Andres mentioned:
> > - memory management: in
> > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > - node dispatch:
> > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> >
> > Therefore, I would suggest that we use SSE2 only, because:
> > - portability is very easy
> > - to avoid a performance hit from indirecting through a function pointer
>
> Okay, I'll try these optimizations and see if the performance becomes better.

FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).

> I've done benchmark tests while changing the node types. The code base
> is v3 patch that doesn't have the optimization you mentioned below
> (memory management and node dispatch) but I added the code to use SSE2
> for node-16 and node-32.

Great, this is helpful to visualize what's going on!

> * sse2_4_16_48_256
>     * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
>     * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50
>
> * sse2_4_32_128_256
>     * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
>     * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

> Observations are:
>
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.

> In dense case, since most nodes have around 100 children, the radix
> tree that has node-128 had a good figure in terms of memory usage. On

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases. I'm curious to hear your thoughts.

> the other hand, the radix tree that doesn't have node-128 has a better
> number in terms of insertion performance. This is probably because we
> need to iterate over 'isset' flags from the beginning of the array in
> order to find an empty slot when inserting new data. We do the same
> thing also for node-48 but it was better than node-128 as it's up to
> 48.

I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...

> In terms of lookup performance, the results vary but I could not find
> any common pattern that makes the performance better or worse. Getting
> more statistics such as the number of each node type per tree level
> might help me.

I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jun 28, 2022 at 10:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I
> > > suspect other optimizations would be worth a lot more than using AVX2:
> > > - collapsing inner nodes
> > > - taking care when constructing the key (more on this when we
> > > integrate with VACUUM)
> > > ...and a couple Andres mentioned:
> > > - memory management: in
> > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > - node dispatch:
> > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > >
> > > Therefore, I would suggest that we use SSE2 only, because:
> > > - portability is very easy
> > > - to avoid a performance hit from indirecting through a function pointer
> >
> > Okay, I'll try these optimizations and see if the performance becomes better.
>
> FWIW, I think it's fine if we delay these until after committing a
> good-enough version. The exception is key construction and I think
> that deserves some attention now (more on this below).

Agreed.

>
> > I've done benchmark tests while changing the node types. The code base
> > is v3 patch that doesn't have the optimization you mentioned below
> > (memory management and node dispatch) but I added the code to use SSE2
> > for node-16 and node-32.
>
> Great, this is helpful to visualize what's going on!
>
> > * sse2_4_16_48_256
> >     * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
> >     * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50
> >
> > * sse2_4_32_128_256
> >     * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
> >     * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1
>
> > Observations are:
> >
> > In both test cases, There is not much difference between using AVX2
> > and SSE2. The more mode types, the more time it takes for loading the
> > data (see sse2_4_16_32_128_256).
>
> Good to know. And as Andres mentioned in his PoC, more node types
> would be a barrier for pointer tagging, since 32-bit platforms only
> have two spare bits in the pointer.
>
> > In dense case, since most nodes have around 100 children, the radix
> > tree that has node-128 had a good figure in terms of memory usage. On
>
> Looking at the node stats, and then your benchmark code, I think key
> construction is a major influence, maybe more than node type. The
> key/value scheme tested now makes sense:
>
> blockhi || blocklo || 9 bits of item offset
>
> (with the leaf nodes containing a bit map of the lowest few bits of
> this whole thing)
>
> We want the lower fanout nodes at the top of the tree and higher
> fanout ones at the bottom.

So more inner nodes can fit in CPU cache, right?

>
> Note some consequences: If the table has enough columns such that much
> fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> dense case the nodes above the leaves will have lower fanout (maybe
> they will fit in a node32). Also, the bitmap values in the leaves will
> be more empty. In other words, many tables in the wild *resemble* the
> sparse case a bit, even if truly all tuples on the page are dead.
>
> Note also that the dense case in the benchmark above has ~4500 times
> more keys than the sparse case, and uses about ~1000 times more
> memory. But the runtime is only 2-3 times longer. That's interesting
> to me.
>
> To optimize for the sparse case, it seems to me that the key/value would be
>
> blockhi || 9 bits of item offset || blocklo
>
> I believe that would make the leaf nodes more dense, with fewer inner
> nodes, and could drastically speed up the sparse case, and maybe many
> realistic dense cases.

Does it have an effect on the number of inner nodes?

>  I'm curious to hear your thoughts.

Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:

9 bits of item offset || blockhi || blocklo

>
> > the other hand, the radix tree that doesn't have node-128 has a better
> > number in terms of insertion performance. This is probably because we
> > need to iterate over 'isset' flags from the beginning of the array in
> > order to find an empty slot when inserting new data. We do the same
> > thing also for node-48 but it was better than node-128 as it's up to
> > 48.
>
> I mentioned in my diff, but for those following along, I think we can
> improve that by iterating over the bytes and if it's 0xFF all 8 bits
> are set already so keep looking...

Right. Using 0xFF also makes the code readable so I'll change that.

>
> > In terms of lookup performance, the results vary but I could not find
> > any common pattern that makes the performance better or worse. Getting
> > more statistics such as the number of each node type per tree level
> > might help me.
>
> I think that's a sign that the choice of node types might not be
> terribly important for these two cases. That's good if that's true in
> general -- a future performance-critical use of this code might tweak
> things for itself without upsetting vacuum.

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

I just noticed that I had a reply forgotten in drafts...

On 2022-05-10 10:51:46 +0900, Masahiko Sawada wrote:
> To move this project forward, I've implemented radix tree
> implementation from scratch while studying Andres's implementation. It
> supports insertion, search, and iteration but not deletion yet. In my
> implementation, I use Datum as the value so internal and lead nodes
> have the same data structure, simplifying the implementation. The
> iteration on the radix tree returns keys with the value in ascending
> order of the key. The patch has regression tests for radix tree but is
> still in PoC state: left many debugging codes, not supported SSE2 SIMD
> instructions, added -mavx2 flag is hard-coded.

Very cool - thanks for picking this up.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> new file mode 100644
> index 0000000000..bf87f932fd
> --- /dev/null
> +++ b/src/backend/lib/radixtree.c
> @@ -0,0 +1,1763 @@
> +/*-------------------------------------------------------------------------
> + *
> + * radixtree.c
> + *        Implementation for adaptive radix tree.
> + *
> + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
> + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
> + * Neumann, 2013.
> + *
> + * There are some differences from the proposed implementation.  For instance,
> + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
> + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> + * Also, there is no support for path compression and lazy path expansion. The
> + * radix tree supports fixed length of the key so we don't expect the tree level
> + * wouldn't be high.

I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.


> + * The key is a 64-bit unsigned integer and the value is a Datum.

I don't think it's a good idea to define the value type to be a datum.


> +/*
> + * As we descend a radix tree, we push the node to the stack. The stack is used
> + * at deletion.
> + */
> +typedef struct radix_tree_stack_data
> +{
> +    radix_tree_node *node;
> +    struct radix_tree_stack_data *parent;
> +} radix_tree_stack_data;
> +typedef radix_tree_stack_data *radix_tree_stack;

I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.

Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?

> +/*
> + * Allocate a new node with the given node kind.
> + */
> +static radix_tree_node *
> +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> +{
> +    radix_tree_node *newnode;
> +
> +    newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> +                                                         radix_tree_node_info[kind].size);
> +    newnode->kind = kind;
> +
> +    /* update the statistics */
> +    tree->mem_used += GetMemoryChunkSpace(newnode);
> +    tree->cnt[kind]++;
> +
> +    return newnode;
> +}

Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?


> +                    else if (idx != n4->n.count)
> +                    {
> +                        /*
> +                         * the key needs to be inserted in the middle of the
> +                         * array, make space for the new key.
> +                         */
> +                        memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
> +                                sizeof(uint8) * (n4->n.count - idx));
> +                        memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
> +                                sizeof(radix_tree_node *) * (n4->n.count - idx));
> +                    }

Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.


> +/*
> + * Insert the key with the val.
> + *
> + * found_p is set to true if the key already present, otherwise false, if
> + * it's not NULL.
> + *
> + * XXX: do we need to support update_if_exists behavior?
> + */

Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.


> +void
> +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> +{
> +    int            shift;
> +    bool        replaced;
> +    radix_tree_node *node;
> +    radix_tree_node *parent = tree->root;
> +
> +    /* Empty tree, create the root */
> +    if (!tree->root)
> +        radix_tree_new_root(tree, key, val);
> +
> +    /* Extend the tree if necessary */
> +    if (key > tree->max_val)
> +        radix_tree_extend(tree, key);

FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.


Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.


Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.

Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytes

I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.


One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().


> > Ideally, node16 and node32 would have the same code with a different
> > loop count (1 or 2). More generally, there is too much duplication of
> > code (noted by Andres in his PoC), and there are many variable names
> > with the node size embedded. This is a bit tricky to make more
> > general, so we don't need to try it yet, but ideally we would have
> > something similar to:
> >
> > switch (node->kind) // todo: inspect tagged pointer
> > {
> >   case RADIX_TREE_NODE_KIND_4:
> >        idx = node_search_eq(node, chunk, 4);
> >        do_action(node, idx, 4, ...);
> >        break;
> >   case RADIX_TREE_NODE_KIND_32:
> >        idx = node_search_eq(node, chunk, 32);
> >        do_action(node, idx, 32, ...);
> >   ...
> > }

FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:


> > static pg_alwaysinline void
> > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > {
> > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> >   // do simple loop with (node_simple *) node;
> > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> >   // do vectorized loop where available with (node_vec *) node;
> > ...
> > }

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jun 28, 2022 at 10:10 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I
> > > > suspect other optimizations would be worth a lot more than using AVX2:
> > > > - collapsing inner nodes
> > > > - taking care when constructing the key (more on this when we
> > > > integrate with VACUUM)
> > > > ...and a couple Andres mentioned:
> > > > - memory management: in
> > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > > - node dispatch:
> > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > > >
> > > > Therefore, I would suggest that we use SSE2 only, because:
> > > > - portability is very easy
> > > > - to avoid a performance hit from indirecting through a function pointer
> > >
> > > Okay, I'll try these optimizations and see if the performance becomes better.
> >
> > FWIW, I think it's fine if we delay these until after committing a
> > good-enough version. The exception is key construction and I think
> > that deserves some attention now (more on this below).
>
> Agreed.
>
> >
> > > I've done benchmark tests while changing the node types. The code base
> > > is v3 patch that doesn't have the optimization you mentioned below
> > > (memory management and node dispatch) but I added the code to use SSE2
> > > for node-16 and node-32.
> >
> > Great, this is helpful to visualize what's going on!
> >
> > > * sse2_4_16_48_256
> > >     * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
> > >     * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50
> > >
> > > * sse2_4_32_128_256
> > >     * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
> > >     * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1
> >
> > > Observations are:
> > >
> > > In both test cases, There is not much difference between using AVX2
> > > and SSE2. The more mode types, the more time it takes for loading the
> > > data (see sse2_4_16_32_128_256).
> >
> > Good to know. And as Andres mentioned in his PoC, more node types
> > would be a barrier for pointer tagging, since 32-bit platforms only
> > have two spare bits in the pointer.
> >
> > > In dense case, since most nodes have around 100 children, the radix
> > > tree that has node-128 had a good figure in terms of memory usage. On
> >
> > Looking at the node stats, and then your benchmark code, I think key
> > construction is a major influence, maybe more than node type. The
> > key/value scheme tested now makes sense:
> >
> > blockhi || blocklo || 9 bits of item offset
> >
> > (with the leaf nodes containing a bit map of the lowest few bits of
> > this whole thing)
> >
> > We want the lower fanout nodes at the top of the tree and higher
> > fanout ones at the bottom.
>
> So more inner nodes can fit in CPU cache, right?
>
> >
> > Note some consequences: If the table has enough columns such that much
> > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > dense case the nodes above the leaves will have lower fanout (maybe
> > they will fit in a node32). Also, the bitmap values in the leaves will
> > be more empty. In other words, many tables in the wild *resemble* the
> > sparse case a bit, even if truly all tuples on the page are dead.
> >
> > Note also that the dense case in the benchmark above has ~4500 times
> > more keys than the sparse case, and uses about ~1000 times more
> > memory. But the runtime is only 2-3 times longer. That's interesting
> > to me.
> >
> > To optimize for the sparse case, it seems to me that the key/value would be
> >
> > blockhi || 9 bits of item offset || blocklo
> >
> > I believe that would make the leaf nodes more dense, with fewer inner
> > nodes, and could drastically speed up the sparse case, and maybe many
> > realistic dense cases.
>
> Does it have an effect on the number of inner nodes?
>
> >  I'm curious to hear your thoughts.
>
> Thank you for your analysis. It's worth trying. We use 9 bits for item
> offset but most pages don't use all bits in practice. So probably it
> might be better to move the most significant bit of item offset to the
> left of blockhi. Or more simply:
>
> 9 bits of item offset || blockhi || blocklo
>
> >
> > > the other hand, the radix tree that doesn't have node-128 has a better
> > > number in terms of insertion performance. This is probably because we
> > > need to iterate over 'isset' flags from the beginning of the array in
> > > order to find an empty slot when inserting new data. We do the same
> > > thing also for node-48 but it was better than node-128 as it's up to
> > > 48.
> >
> > I mentioned in my diff, but for those following along, I think we can
> > improve that by iterating over the bytes and if it's 0xFF all 8 bits
> > are set already so keep looking...
>
> Right. Using 0xFF also makes the code readable so I'll change that.
>
> >
> > > In terms of lookup performance, the results vary but I could not find
> > > any common pattern that makes the performance better or worse. Getting
> > > more statistics such as the number of each node type per tree level
> > > might help me.
> >
> > I think that's a sign that the choice of node types might not be
> > terribly important for these two cases. That's good if that's true in
> > general -- a future performance-critical use of this code might tweak
> > things for itself without upsetting vacuum.
>
> Agreed.
>

I've attached an updated patch that incorporated comments from John.
Here are some comments I could not address and the reason:

+// bitfield is uint32, so we don't need UINT64_C
  bitfield &= ((UINT64_C(1) << node->n.count) - 1);

Since node->n.count could be 32, I think we need to use UINT64CONST() here.

 /* Macros for radix tree nodes */
+// not sure why are we doing casts here?
 #define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
 #define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)

I've left the casts as I use IS_LEAF_NODE for rt_node_4/16/32/128/256.

Also, I've dropped the configure script support for AVX2, and support
for SSE2 is missing. I'll update it later.

I've not addressed the comments I got from Andres yet so I'll update
the patch according to the discussion but the current patch would be
more readable than the previous one thanks to the comments from John.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> > new file mode 100644
> > index 0000000000..bf87f932fd
> > --- /dev/null
> > +++ b/src/backend/lib/radixtree.c
> > @@ -0,0 +1,1763 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * radixtree.c
> > + *           Implementation for adaptive radix tree.
> > + *
> > + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
> > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
> > + * Neumann, 2013.
> > + *
> > + * There are some differences from the proposed implementation.  For instance,
> > + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
> > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> > + * Also, there is no support for path compression and lazy path expansion. The
> > + * radix tree supports fixed length of the key so we don't expect the tree level
> > + * wouldn't be high.
>
> I think we're going to need path compression at some point, fwiw. I'd bet on
> it being beneficial even for the tid case.
>
>
> > + * The key is a 64-bit unsigned integer and the value is a Datum.
>
> I don't think it's a good idea to define the value type to be a datum.

A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?

>
>
> > +/*
> > + * As we descend a radix tree, we push the node to the stack. The stack is used
> > + * at deletion.
> > + */
> > +typedef struct radix_tree_stack_data
> > +{
> > +     radix_tree_node *node;
> > +     struct radix_tree_stack_data *parent;
> > +} radix_tree_stack_data;
> > +typedef radix_tree_stack_data *radix_tree_stack;
>
> I think it's a very bad idea for traversal to need allocations. I really want
> to eventually use this for shared structures (eventually with lock-free
> searches at least), and needing to do allocations while traversing the tree is
> a no-go for that.
>
> Particularly given that the tree currently has a fixed depth, can't you just
> allocate this on the stack once?

Yes, we can do that.

>
> > +/*
> > + * Allocate a new node with the given node kind.
> > + */
> > +static radix_tree_node *
> > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> > +{
> > +     radix_tree_node *newnode;
> > +
> > +     newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> > +
radix_tree_node_info[kind].size);
> > +     newnode->kind = kind;
> > +
> > +     /* update the statistics */
> > +     tree->mem_used += GetMemoryChunkSpace(newnode);
> > +     tree->cnt[kind]++;
> > +
> > +     return newnode;
> > +}
>
> Why are you tracking the memory usage at this level of detail? It's *much*
> cheaper to track memory usage via the memory contexts? Since they're dedicated
> for the radix tree, that ought to be sufficient?

Indeed. I'll use MemoryContextMemAllocated instead.

>
>
> > +                                     else if (idx != n4->n.count)
> > +                                     {
> > +                                             /*
> > +                                              * the key needs to be inserted in the middle of the
> > +                                              * array, make space for the new key.
> > +                                              */
> > +                                             memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
> > +                                                             sizeof(uint8) * (n4->n.count - idx));
> > +                                             memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
> > +                                                             sizeof(radix_tree_node *) * (n4->n.count - idx));
> > +                                     }
>
> Maybe we could add a static inline helper for these memmoves? Both because
> it's repetitive (for different node types) and because the last time I looked
> gcc was generating quite bad code for this. And having to put workarounds into
> multiple places is obviously worse than having to do it in one place.

Agreed, I'll update it.

>
>
> > +/*
> > + * Insert the key with the val.
> > + *
> > + * found_p is set to true if the key already present, otherwise false, if
> > + * it's not NULL.
> > + *
> > + * XXX: do we need to support update_if_exists behavior?
> > + */
>
> Yes, I think that's needed - hence using bfm_set() instead of insert() in the
> prototype.

Agreed.

>
>
> > +void
> > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> > +{
> > +     int                     shift;
> > +     bool            replaced;
> > +     radix_tree_node *node;
> > +     radix_tree_node *parent = tree->root;
> > +
> > +     /* Empty tree, create the root */
> > +     if (!tree->root)
> > +             radix_tree_new_root(tree, key, val);
> > +
> > +     /* Extend the tree if necessary */
> > +     if (key > tree->max_val)
> > +             radix_tree_extend(tree, key);
>
> FWIW, the reason I used separate functions for these in the prototype is that
> it turns out to generate a lot better code, because it allows non-inlined
> function calls to be sibling calls - thereby avoiding the need for a dedicated
> stack frame. That's not possible once you need a palloc or such, so splitting
> off those call paths into dedicated functions is useful.

Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jul 5, 2022 at 7:00 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> > In both test cases, There is not much difference between using AVX2
> > and SSE2. The more mode types, the more time it takes for loading the
> > data (see sse2_4_16_32_128_256).
>
> Yea, at some point the compiler starts using a jump table instead of branches,
> and that turns out to be a good bit more expensive. And even with branches, it
> obviously adds hard to predict branches. IIRC I fought a bit with the compiler
> to avoid some of that cost, it's possible that got "lost" in Sawada-san's
> patch.
>
>
> Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
> one is not unimportant until we have path compression.

I wanted to start with a smaller number of node types for simplicity.
16 node type has been added to v4 patch I submitted[1]. I think it's
trade-off between better memory and the overhead of growing (and
shrinking) the node type. I'm going to add more node types once we
turn out based on the benchmark that it's beneficial.

>
> Right now the node struct sizes are:
> 4 - 48 bytes
> 32 - 296 bytes
> 128 - 1304 bytes
> 256 - 2088 bytes
>
> I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
> bytes, but needing that separate isset array somehow is sad :/. I wonder if a
> smaller "free index" would do the trick? Point to the element + 1 where we
> searched last and start a plain loop there. Particularly in an insert-only
> workload that'll always work, and in other cases it'll still often work I
> think.

radix_tree_node_128->isset is used to distinguish between null-pointer
in inner nodes and 0 in leaf nodes. So I guess we can have a flag to
indicate a leaf or an inner so that we can interpret (Datum) 0 as
either null-pointer or 0. Or if we define different data types for
inner and leaf nodes probably we don't need it.


> One thing I was wondering about is trying to choose node types in
> roughly-power-of-two struct sizes. It's pretty easy to end up with significant
> fragmentation in the slabs right now when inserting as you go, because some of
> the smaller node types will be freed but not enough to actually free blocks of
> memory. If we instead have ~power-of-two sizes we could just use a single slab
> of the max size, and carve out the smaller node types out of that largest
> allocation.

You meant to manage memory allocation (and free) for smaller node
types by ourselves?

How about using different block size for different node types?

>
> Btw, that fragmentation is another reason why I think it's better to track
> memory usage via memory contexts, rather than doing so based on
> GetMemoryChunkSpace().

Agreed.

>
>
> > > Ideally, node16 and node32 would have the same code with a different
> > > loop count (1 or 2). More generally, there is too much duplication of
> > > code (noted by Andres in his PoC), and there are many variable names
> > > with the node size embedded. This is a bit tricky to make more
> > > general, so we don't need to try it yet, but ideally we would have
> > > something similar to:
> > >
> > > switch (node->kind) // todo: inspect tagged pointer
> > > {
> > >   case RADIX_TREE_NODE_KIND_4:
> > >        idx = node_search_eq(node, chunk, 4);
> > >        do_action(node, idx, 4, ...);
> > >        break;
> > >   case RADIX_TREE_NODE_KIND_32:
> > >        idx = node_search_eq(node, chunk, 32);
> > >        do_action(node, idx, 32, ...);
> > >   ...
> > > }
>
> FWIW, that should be doable with an inline function, if you pass it the memory
> to the "array" rather than the node directly. Not so sure it's a good idea to
> do dispatch between node types / search methods inside the helper, as you
> suggest below:
>
>
> > > static pg_alwaysinline void
> > > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > > {
> > > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> > >   // do simple loop with (node_simple *) node;
> > > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> > >   // do vectorized loop where available with (node_vec *) node;
> > > ...
> > > }

Yeah, It's worth trying at some point.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
> On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
> A datum value is convenient to represent both a pointer and a value so
> I used it to avoid defining node types for inner and leaf nodes
> separately.

I'm not convinced that's a good goal. I think we're going to want to have
different key and value types, and trying to unify leaf and inner nodes is
going to make that impossible.

Consider e.g. using it for something like a buffer mapping table - your key
might be way too wide to fit it sensibly into 64bit.


> Since a datum could be 4 bytes or 8 bytes depending it might not be good for
> some platforms.

Right - thats another good reason why it's problematic. A lot of key types
aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.


> > > +void
> > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> > > +{
> > > +     int                     shift;
> > > +     bool            replaced;
> > > +     radix_tree_node *node;
> > > +     radix_tree_node *parent = tree->root;
> > > +
> > > +     /* Empty tree, create the root */
> > > +     if (!tree->root)
> > > +             radix_tree_new_root(tree, key, val);
> > > +
> > > +     /* Extend the tree if necessary */
> > > +     if (key > tree->max_val)
> > > +             radix_tree_extend(tree, key);
> >
> > FWIW, the reason I used separate functions for these in the prototype is that
> > it turns out to generate a lot better code, because it allows non-inlined
> > function calls to be sibling calls - thereby avoiding the need for a dedicated
> > stack frame. That's not possible once you need a palloc or such, so splitting
> > off those call paths into dedicated functions is useful.
> 
> Thank you for the info. How much does using sibling call optimization
> help the performance in this case? I think that these two cases are
> used only a limited number of times: inserting the first key and
> extending the tree.

It's not that it helps in the cases moved into separate functions - it's that
not having that code in the "normal" paths keeps the normal path faster.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-07-05 16:33:29 +0900, Masahiko Sawada wrote:
> > One thing I was wondering about is trying to choose node types in
> > roughly-power-of-two struct sizes. It's pretty easy to end up with significant
> > fragmentation in the slabs right now when inserting as you go, because some of
> > the smaller node types will be freed but not enough to actually free blocks of
> > memory. If we instead have ~power-of-two sizes we could just use a single slab
> > of the max size, and carve out the smaller node types out of that largest
> > allocation.
> 
> You meant to manage memory allocation (and free) for smaller node
> types by ourselves?

For all of them basically. Using a single slab allocator and then subdividing
the "common block size" into however many chunks that fit into a single node
type.

> How about using different block size for different node types?

Not following...


Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > Looking at the node stats, and then your benchmark code, I think key
> > construction is a major influence, maybe more than node type. The
> > key/value scheme tested now makes sense:
> >
> > blockhi || blocklo || 9 bits of item offset
> >
> > (with the leaf nodes containing a bit map of the lowest few bits of
> > this whole thing)
> >
> > We want the lower fanout nodes at the top of the tree and higher
> > fanout ones at the bottom.
>
> So more inner nodes can fit in CPU cache, right?

My thinking is, on average, there will be more dense space utilization
in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
cache, since with my idea a search might have to visit more nodes to
get the common negative result (indexed tid not found in vacuum's
list).

> > Note some consequences: If the table has enough columns such that much
> > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > dense case the nodes above the leaves will have lower fanout (maybe
> > they will fit in a node32). Also, the bitmap values in the leaves will
> > be more empty. In other words, many tables in the wild *resemble* the
> > sparse case a bit, even if truly all tuples on the page are dead.
> >
> > Note also that the dense case in the benchmark above has ~4500 times
> > more keys than the sparse case, and uses about ~1000 times more
> > memory. But the runtime is only 2-3 times longer. That's interesting
> > to me.
> >
> > To optimize for the sparse case, it seems to me that the key/value would be
> >
> > blockhi || 9 bits of item offset || blocklo
> >
> > I believe that would make the leaf nodes more dense, with fewer inner
> > nodes, and could drastically speed up the sparse case, and maybe many
> > realistic dense cases.
>
> Does it have an effect on the number of inner nodes?
>
> >  I'm curious to hear your thoughts.
>
> Thank you for your analysis. It's worth trying. We use 9 bits for item
> offset but most pages don't use all bits in practice. So probably it
> might be better to move the most significant bit of item offset to the
> left of blockhi. Or more simply:
>
> 9 bits of item offset || blockhi || blocklo

A concern here is most tids won't use many bits in blockhi either,
most often far fewer, so this would make the tree higher, I think.
Each value of blockhi represents 0.5GB of heap (32TB max). Even with
very large tables I'm guessing most pages of interest to vacuum are
concentrated in a few of these 0.5GB "segments".

And it's possible path compression would change the tradeoffs here.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jul 5, 2022 at 5:09 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
> > On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
> > A datum value is convenient to represent both a pointer and a value so
> > I used it to avoid defining node types for inner and leaf nodes
> > separately.
>
> I'm not convinced that's a good goal. I think we're going to want to have
> different key and value types, and trying to unify leaf and inner nodes is
> going to make that impossible.
>
> Consider e.g. using it for something like a buffer mapping table - your key
> might be way too wide to fit it sensibly into 64bit.

Right. It seems to be better to have an interface so that the user of
the radix tree can specify the arbitrary key size (and perhaps value
size too?) on creation. And we can have separate leaf node types that
have values instead of pointers. If the value size is less than
pointer size, we can have values within leaf nodes but if it’s bigger
probably the leaf node can have pointers to memory where to store the
value.

>
>
> > Since a datum could be 4 bytes or 8 bytes depending it might not be good for
> > some platforms.
>
> Right - thats another good reason why it's problematic. A lot of key types
> aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.
>
>
> > > > +void
> > > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> > > > +{
> > > > +     int                     shift;
> > > > +     bool            replaced;
> > > > +     radix_tree_node *node;
> > > > +     radix_tree_node *parent = tree->root;
> > > > +
> > > > +     /* Empty tree, create the root */
> > > > +     if (!tree->root)
> > > > +             radix_tree_new_root(tree, key, val);
> > > > +
> > > > +     /* Extend the tree if necessary */
> > > > +     if (key > tree->max_val)
> > > > +             radix_tree_extend(tree, key);
> > >
> > > FWIW, the reason I used separate functions for these in the prototype is that
> > > it turns out to generate a lot better code, because it allows non-inlined
> > > function calls to be sibling calls - thereby avoiding the need for a dedicated
> > > stack frame. That's not possible once you need a palloc or such, so splitting
> > > off those call paths into dedicated functions is useful.
> >
> > Thank you for the info. How much does using sibling call optimization
> > help the performance in this case? I think that these two cases are
> > used only a limited number of times: inserting the first key and
> > extending the tree.
>
> It's not that it helps in the cases moved into separate functions - it's that
> not having that code in the "normal" paths keeps the normal path faster.

Thanks, understood.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jul 5, 2022 at 5:49 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > Looking at the node stats, and then your benchmark code, I think key
> > > construction is a major influence, maybe more than node type. The
> > > key/value scheme tested now makes sense:
> > >
> > > blockhi || blocklo || 9 bits of item offset
> > >
> > > (with the leaf nodes containing a bit map of the lowest few bits of
> > > this whole thing)
> > >
> > > We want the lower fanout nodes at the top of the tree and higher
> > > fanout ones at the bottom.
> >
> > So more inner nodes can fit in CPU cache, right?
>
> My thinking is, on average, there will be more dense space utilization
> in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
> cache, since with my idea a search might have to visit more nodes to
> get the common negative result (indexed tid not found in vacuum's
> list).
>
> > > Note some consequences: If the table has enough columns such that much
> > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > > dense case the nodes above the leaves will have lower fanout (maybe
> > > they will fit in a node32). Also, the bitmap values in the leaves will
> > > be more empty. In other words, many tables in the wild *resemble* the
> > > sparse case a bit, even if truly all tuples on the page are dead.
> > >
> > > Note also that the dense case in the benchmark above has ~4500 times
> > > more keys than the sparse case, and uses about ~1000 times more
> > > memory. But the runtime is only 2-3 times longer. That's interesting
> > > to me.
> > >
> > > To optimize for the sparse case, it seems to me that the key/value would be
> > >
> > > blockhi || 9 bits of item offset || blocklo
> > >
> > > I believe that would make the leaf nodes more dense, with fewer inner
> > > nodes, and could drastically speed up the sparse case, and maybe many
> > > realistic dense cases.
> >
> > Does it have an effect on the number of inner nodes?
> >
> > >  I'm curious to hear your thoughts.
> >
> > Thank you for your analysis. It's worth trying. We use 9 bits for item
> > offset but most pages don't use all bits in practice. So probably it
> > might be better to move the most significant bit of item offset to the
> > left of blockhi. Or more simply:
> >
> > 9 bits of item offset || blockhi || blocklo
>
> A concern here is most tids won't use many bits in blockhi either,
> most often far fewer, so this would make the tree higher, I think.
> Each value of blockhi represents 0.5GB of heap (32TB max). Even with
> very large tables I'm guessing most pages of interest to vacuum are
> concentrated in a few of these 0.5GB "segments".

Right.

I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.

Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example, or an approach like Roaring bitmap.

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).

> And it's possible path compression would change the tradeoffs here.

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I guess that the tree height is affected by where garbages are, right?
> For example, even if all garbage in the table is concentrated in
> 0.5GB, if they exist between 2^17 and 2^18 block, we use the first
> byte of blockhi. If the table is larger than 128GB, the second byte of
> the blockhi could be used depending on where the garbage exists.

Right.

> Another variation of how to store TID would be that we use the block
> number as a key and store a bitmap of the offset as a value. We can
> use Bitmapset for example,

I like the idea of using existing code to set/check a bitmap if it's
convenient. But (in case that was implied here) I'd really like to
stay away from variable-length values, which would require
"Single-value leaves" (slow). I also think it's fine to treat the
key/value as just bits, and not care where exactly they came from, as
we've been talking about.

> or an approach like Roaring bitmap.

This would require two new data structures instead of one. That
doesn't seem like a path to success.

> I think that at this stage it's better to define the design first. For
> example, key size and value size, and these sizes are fixed or can be
> set the arbitary size?

I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.

> For value size, if we support
> different value sizes specified by the user, we can either embed
> multiple values in the leaf node (called Multi-value leaves in ART
> paper)

I don't think "Multi-value leaves" allow for variable-length values,
FWIW. And now I see I also used this term wrong in my earlier review
comment -- v3/4 don't actually use "multi-value leaves", but Andres'
does (going by the multiple leaf types). From the paper: "Multi-value
leaves: The values are stored in one of four different leaf node
types, which mirror the structure of inner nodes, but contain values
instead of pointers."

(It seems v3/v4 could be called a variation of "Combined pointer/value
slots: If values fit into pointers, no separate node types are
necessary. Instead, each pointer storage location in an inner node can
either store a pointer or a value." But without the advantage of
variable length keys).

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Jul 8, 2022 at 3:43 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I guess that the tree height is affected by where garbages are, right?
> > For example, even if all garbage in the table is concentrated in
> > 0.5GB, if they exist between 2^17 and 2^18 block, we use the first
> > byte of blockhi. If the table is larger than 128GB, the second byte of
> > the blockhi could be used depending on where the garbage exists.
>
> Right.
>
> > Another variation of how to store TID would be that we use the block
> > number as a key and store a bitmap of the offset as a value. We can
> > use Bitmapset for example,
>
> I like the idea of using existing code to set/check a bitmap if it's
> convenient. But (in case that was implied here) I'd really like to
> stay away from variable-length values, which would require
> "Single-value leaves" (slow). I also think it's fine to treat the
> key/value as just bits, and not care where exactly they came from, as
> we've been talking about.
>
> > or an approach like Roaring bitmap.
>
> This would require two new data structures instead of one. That
> doesn't seem like a path to success.

Agreed.

>
> > I think that at this stage it's better to define the design first. For
> > example, key size and value size, and these sizes are fixed or can be
> > set the arbitary size?
>
> I don't think we need to start over. Andres' prototype had certain
> design decisions built in for the intended use case (although maybe
> not clearly documented as such). Subsequent patches in this thread
> substantially changed many design aspects. If there were any changes
> that made things wonderful for vacuum, it wasn't explained, but Andres
> did explain how some of these changes were not good for other uses.
> Going to fixed 64-bit keys and values should still allow many future
> applications, so let's do that if there's no reason not to.

I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.

>
> > For value size, if we support
> > different value sizes specified by the user, we can either embed
> > multiple values in the leaf node (called Multi-value leaves in ART
> > paper)
>
> I don't think "Multi-value leaves" allow for variable-length values,
> FWIW. And now I see I also used this term wrong in my earlier review
> comment -- v3/4 don't actually use "multi-value leaves", but Andres'
> does (going by the multiple leaf types). From the paper: "Multi-value
> leaves: The values are stored in one of four different leaf node
> types, which mirror the structure of inner nodes, but contain values
> instead of pointers."

Right, but sorry I meant the user specifies the arbitrary fixed-size
value length on creation like we do in dynahash.c.

>
> (It seems v3/v4 could be called a variation of "Combined pointer/value
> slots: If values fit into pointers, no separate node types are
> necessary. Instead, each pointer storage location in an inner node can
> either store a pointer or a value." But without the advantage of
> variable length keys).

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > > I think that at this stage it's better to define the design first. For
> > > example, key size and value size, and these sizes are fixed or can be
> > > set the arbitary size?
> >
> > I don't think we need to start over. Andres' prototype had certain
> > design decisions built in for the intended use case (although maybe
> > not clearly documented as such). Subsequent patches in this thread
> > substantially changed many design aspects. If there were any changes
> > that made things wonderful for vacuum, it wasn't explained, but Andres
> > did explain how some of these changes were not good for other uses.
> > Going to fixed 64-bit keys and values should still allow many future
> > applications, so let's do that if there's no reason not to.
>
> I thought Andres pointed out that given that we store BufferTag (or
> part of that) into the key, the fixed 64-bit keys might not be enough
> for buffer mapping use cases. If we want to use wider keys more than
> 64-bit, we would need to consider it.

It sounds like you've answered your own question, then. If so, I'm
curious what your current thinking is.

If we *did* want to have maximum flexibility, then "single-value
leaves" method would be the way to go, since it seems to be the
easiest way to have variable-length both keys and values. I do have a
concern that the extra pointer traversal would be a drag on
performance, and also require lots of small memory allocations. If we
happened to go that route, your idea upthread of using a bitmapset of
item offsets in the leaves sounds like a good fit for that.

I also have some concerns about also simultaneously trying to design
for the use for buffer mappings. I certainly want to make this good
for as many future uses as possible, and I'd really like to preserve
any optimizations already fought for. However, to make concrete
progress on the thread subject, I also don't think it's the most
productive use of time to get tied up about the fine details of
something that will not likely happen for several years at the
earliest.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 14, 2022 at 1:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > > I think that at this stage it's better to define the design first. For
> > > > example, key size and value size, and these sizes are fixed or can be
> > > > set the arbitary size?
> > >
> > > I don't think we need to start over. Andres' prototype had certain
> > > design decisions built in for the intended use case (although maybe
> > > not clearly documented as such). Subsequent patches in this thread
> > > substantially changed many design aspects. If there were any changes
> > > that made things wonderful for vacuum, it wasn't explained, but Andres
> > > did explain how some of these changes were not good for other uses.
> > > Going to fixed 64-bit keys and values should still allow many future
> > > applications, so let's do that if there's no reason not to.
> >
> > I thought Andres pointed out that given that we store BufferTag (or
> > part of that) into the key, the fixed 64-bit keys might not be enough
> > for buffer mapping use cases. If we want to use wider keys more than
> > 64-bit, we would need to consider it.
>
> It sounds like you've answered your own question, then. If so, I'm
> curious what your current thinking is.
>
> If we *did* want to have maximum flexibility, then "single-value
> leaves" method would be the way to go, since it seems to be the
> easiest way to have variable-length both keys and values. I do have a
> concern that the extra pointer traversal would be a drag on
> performance, and also require lots of small memory allocations.

Agreed.

> I also have some concerns about also simultaneously trying to design
> for the use for buffer mappings. I certainly want to make this good
> for as many future uses as possible, and I'd really like to preserve
> any optimizations already fought for. However, to make concrete
> progress on the thread subject, I also don't think it's the most
> productive use of time to get tied up about the fine details of
> something that will not likely happen for several years at the
> earliest.

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-07-08 11:09:44 +0900, Masahiko Sawada wrote:
> I think that at this stage it's better to define the design first. For
> example, key size and value size, and these sizes are fixed or can be
> set the arbitary size? Given the use case of buffer mapping, we would
> need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
> the other hand, limiting the key size is 64 bit integer makes the
> logic simple, and possibly it could still be used in buffer mapping
> cases by using a tree of a tree. For value size, if we support
> different value sizes specified by the user, we can either embed
> multiple values in the leaf node (called Multi-value leaves in ART
> paper) or introduce a leaf node that stores one value (called
> Single-value leaves).

FWIW, I think the best path forward would be to do something similar to the
simplehash.h approach, so it can be customized to the specific user.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote:
> FWIW, I think the best path forward would be to do something similar to the
> simplehash.h approach, so it can be customized to the specific user.

I figured that would come up at some point. It may be worth doing in the future, but I think it's way too much to ask for the first use case.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Peter Geoghegan
Date:
On Mon, Jul 18, 2022 at 9:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote:
> > FWIW, I think the best path forward would be to do something similar to the
> > simplehash.h approach, so it can be customized to the specific user.
>
> I figured that would come up at some point. It may be worth doing in the future, but I think it's way too much to ask
forthe first use case.
 

I have a prototype patch that creates a read-only snapshot of the
visibility map, and has vacuumlazy.c work off of that when determining
with pages to skip. The patch also gets rid of the
SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C,
principally because it really cuts down on the number of scanned_pages
that are scanned only because the VM bit is unset concurrently by DML.
The window for this is very large when the table is large (and
naturally takes a long time to scan), resulting in many more "dead but
not yet removable" tuples being encountered than necessary. Which
itself causes bogus information in the FSM -- information about the
space that VACUUM could free from the page, which is often highly
misleading.

There are remaining questions about how to do this properly. Right now
I'm just copying pages from the VM into local memory, right after
OldestXmin is first acquired -- we "lock in" a snapshot of the VM at
the earliest opportunity, which is what lazy_scan_skip() actually
works off now. There needs to be some consideration given to the
resource management aspects of this -- it needs to use memory
sensibly, which the current prototype patch doesn't do at all. I'm
probably going to seriously pursue this as a project soon, and will
probably need some kind of data structure for the local copy. The raw
pages are usually quite space inefficient, considering we only need an
immutable snapshot of the VM.

I wonder if it makes sense to use this as part of this project. It
will be possible to know the exact heap pages that will become
scanned_pages before scanning even one page with this design (perhaps
with caveats about low memory conditions). It could also be very
effective as a way of speeding up TID lookups in the reasonably common
case where most scanned_pages don't have any LP_DEAD items -- just
look it up in our local/materialized copy of the VM first. But even
when LP_DEAD items are spread fairly evenly, it could still give us
reliable information about the distribution of LP_DEAD items very
early on.

Maybe the two data structures could even be combined in some way? You
can use more memory for the local copy of the VM if you know that you
won't need the memory for dead_items. It's kinda the same problem, in
a way.

-- 
Peter Geoghegan



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:


On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I’d like to keep the first version simple. We can improve it and add
> more optimizations later. Using radix tree for vacuum TID storage
> would still be a big win comparing to using a flat array, even without
> all these optimizations. In terms of single-value leaves method, I'm
> also concerned about an extra pointer traversal and extra memory
> allocation. It's most flexible but multi-value leaves method is also
> flexible enough for many use cases. Using the single-value method
> seems to be too much as the first step for me.
>
> Overall, using 64-bit keys and 64-bit values would be a reasonable
> choice for me as the first step . It can cover wider use cases
> including vacuum TID use cases. And possibly it can cover use cases by
> combining a hash table or using tree of tree, for example.

These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
>
> On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I’d like to keep the first version simple. We can improve it and add
> > more optimizations later. Using radix tree for vacuum TID storage
> > would still be a big win comparing to using a flat array, even without
> > all these optimizations. In terms of single-value leaves method, I'm
> > also concerned about an extra pointer traversal and extra memory
> > allocation. It's most flexible but multi-value leaves method is also
> > flexible enough for many use cases. Using the single-value method
> > seems to be too much as the first step for me.
> >
> > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > choice for me as the first step . It can cover wider use cases
> > including vacuum TID use cases. And possibly it can cover use cases by
> > combining a hash table or using tree of tree, for example.
>
> These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to
preserveoptimization work already done, so +1 from me. 

Thanks.

I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> >
> > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I’d like to keep the first version simple. We can improve it and add
> > > more optimizations later. Using radix tree for vacuum TID storage
> > > would still be a big win comparing to using a flat array, even without
> > > all these optimizations. In terms of single-value leaves method, I'm
> > > also concerned about an extra pointer traversal and extra memory
> > > allocation. It's most flexible but multi-value leaves method is also
> > > flexible enough for many use cases. Using the single-value method
> > > seems to be too much as the first step for me.
> > >
> > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > choice for me as the first step . It can cover wider use cases
> > > including vacuum TID use cases. And possibly it can cover use cases by
> > > combining a hash table or using tree of tree, for example.
> >
> > These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to
preserveoptimization work already done, so +1 from me. 
>
> Thanks.
>
> I've updated the patch. It now implements 64-bit keys, 64-bit values,
> and the multi-value leaves method. I've tried to remove duplicated
> codes but we might find a better way to do that.
>

With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >
> > >
> > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I’d like to keep the first version simple. We can improve it and add
> > > > more optimizations later. Using radix tree for vacuum TID storage
> > > > would still be a big win comparing to using a flat array, even without
> > > > all these optimizations. In terms of single-value leaves method, I'm
> > > > also concerned about an extra pointer traversal and extra memory
> > > > allocation. It's most flexible but multi-value leaves method is also
> > > > flexible enough for many use cases. Using the single-value method
> > > > seems to be too much as the first step for me.
> > > >
> > > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > > choice for me as the first step . It can cover wider use cases
> > > > including vacuum TID use cases. And possibly it can cover use cases by
> > > > combining a hash table or using tree of tree, for example.
> > >
> > > These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to
preserveoptimization work already done, so +1 from me. 
> >
> > Thanks.
> >
> > I've updated the patch. It now implements 64-bit keys, 64-bit values,
> > and the multi-value leaves method. I've tried to remove duplicated
> > codes but we might find a better way to do that.
> >
>
> With the recent changes related to simd, I'm going to split the patch
> into at least two parts: introduce other simd optimized functions used
> by the radix tree and the radix tree implementation. Particularly we
> need two functions for radix tree: a function like pg_lfind32 but for
> 8 bits integers and return the index, and a function that returns the
> index of the first element that is >= key.

I recommend looking at

https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com

since I did the work just now for searching bytes and returning a
bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
In any case, I'll take a look at the latest patch next month.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > > I’d like to keep the first version simple. We can improve it and add
> > > > > more optimizations later. Using radix tree for vacuum TID storage
> > > > > would still be a big win comparing to using a flat array, even without
> > > > > all these optimizations. In terms of single-value leaves method, I'm
> > > > > also concerned about an extra pointer traversal and extra memory
> > > > > allocation. It's most flexible but multi-value leaves method is also
> > > > > flexible enough for many use cases. Using the single-value method
> > > > > seems to be too much as the first step for me.
> > > > >
> > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > > > choice for me as the first step . It can cover wider use cases
> > > > > including vacuum TID use cases. And possibly it can cover use cases by
> > > > > combining a hash table or using tree of tree, for example.
> > > >
> > > > These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier
topreserve optimization work already done, so +1 from me. 
> > >
> > > Thanks.
> > >
> > > I've updated the patch. It now implements 64-bit keys, 64-bit values,
> > > and the multi-value leaves method. I've tried to remove duplicated
> > > codes but we might find a better way to do that.
> > >
> >
> > With the recent changes related to simd, I'm going to split the patch
> > into at least two parts: introduce other simd optimized functions used
> > by the radix tree and the radix tree implementation. Particularly we
> > need two functions for radix tree: a function like pg_lfind32 but for
> > 8 bits integers and return the index, and a function that returns the
> > index of the first element that is >= key.
>
> I recommend looking at
>
> https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com
>
> since I did the work just now for searching bytes and returning a
> bool, buth = and <=. Should be pretty close. Also, i believe if you
> left this for last as a possible refactoring, it might save some work.
> In any case, I'll take a look at the latest patch next month.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

0002 patch is the main radix tree implementation. I've removed some
duplicated codes of node manipulation. For instance, since node-4,
node-16, and node-32 have a similar structure with different fanouts,
I introduced the common function for them.

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Aug 15, 2022 at 10:39 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > bool, buth = and <=. Should be pretty close. Also, i believe if you
> > left this for last as a possible refactoring, it might save some work.

v6 demonstrates why this should have been put off towards the end. (more below)

> > In any case, I'll take a look at the latest patch next month.

Since the CF entry said "Needs Review", I began looking at v5 again
this week. Hopefully not too much has changed, but in the future I
strongly recommend setting to "Waiting on Author" if a new version is
forthcoming. I realize many here share updated patches at any time,
but I'd like to discourage the practice especially for large patches.

> I've updated the radix tree patch. It's now separated into two patches.
>
> 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
> better names) that are similar to the pg_lfind8() family but they
> return the index of the key in the vector instead of true/false. The
> patch includes regression tests.

I don't want to do a full review of this just yet, but I'll just point
out some problems from a quick glance.

+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.

That's a bizarre API to indicate non-existence.

+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */

That is *completely* unacceptable for a general-purpose function.

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;

I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.

+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+      };
+
+    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
vshrq_n_s8(v, 7));
+    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));

For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:


https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.

> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.

Excellent! This was high on my wish list.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Nathan Bossart
Date:
On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:
> Here again, I'd rather put this off and focus on getting the "large
> details" in good enough shape so we can got towards integrating with
> vacuum.

I started a new thread for the SIMD patch [0] so that this thread can
remain focused on the radix tree stuff.

[0] https://www.postgresql.org/message-id/20220917052903.GA3172400%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Sep 16, 2022 at 4:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Aug 15, 2022 at 10:39 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > bool, buth = and <=. Should be pretty close. Also, i believe if you
> > > left this for last as a possible refactoring, it might save some work.
>
> v6 demonstrates why this should have been put off towards the end. (more below)
>
> > > In any case, I'll take a look at the latest patch next month.
>
> Since the CF entry said "Needs Review", I began looking at v5 again
> this week. Hopefully not too much has changed, but in the future I
> strongly recommend setting to "Waiting on Author" if a new version is
> forthcoming. I realize many here share updated patches at any time,
> but I'd like to discourage the practice especially for large patches.

Understood. Sorry for the inconveniences.

>
> > I've updated the radix tree patch. It's now separated into two patches.
> >
> > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
> > better names) that are similar to the pg_lfind8() family but they
> > return the index of the key in the vector instead of true/false. The
> > patch includes regression tests.
>
> I don't want to do a full review of this just yet, but I'll just point
> out some problems from a quick glance.
>
> +/*
> + * Return the index of the first element in the vector that is greater than
> + * or eual to the given scalar. Return sizeof(Vector8) if there is no such
> + * element.
>
> That's a bizarre API to indicate non-existence.
>
> + *
> + * Note that this function assumes the elements in the vector are sorted.
> + */
>
> That is *completely* unacceptable for a general-purpose function.
>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
>
> I don't think we should try to force the non-simd case to adopt the
> special semantics of vector comparisons. It's much easier to just use
> the same logic as the assert builds.
>
> +#ifdef USE_SSE2
> + return (uint32) _mm_movemask_epi8(v);
> +#elif defined(USE_NEON)
> + static const uint8 mask[16] = {
> +        1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +        1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +        1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +        1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +      };
> +
> +    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
> vshrq_n_s8(v, 7));
> +    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
> +
> +    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
>
> For Arm, we need to be careful here. This article goes into a lot of
> detail for this situation:
>
>
https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon
>
> Here again, I'd rather put this off and focus on getting the "large
> details" in good enough shape so we can got towards integrating with
> vacuum.

Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.

I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.


Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 16, 2022 at 4:54 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > Here again, I'd rather put this off and focus on getting the "large
> > details" in good enough shape so we can got towards integrating with
> > vacuum.
>
> Thank you for the comments! These above comments are addressed by
> Nathan in a newly derived thread. I'll work on the patch.

I still seem to be out-voted on when to tackle this particular optimization, so I've extended the v6 benchmark code with a hackish function that populates a fixed number of keys, but with different fanouts. (diff attached as a text file)

I didn't take particular care to make this scientific, but the following seems pretty reproducible. Note what happens to load and search performance when node16 has 15 entries versus 16:

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
     15 | 327680 |          3776512 |         39 |           20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256 = 0

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
     16 | 327680 |          3514368 |         25 |           11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256 = 0

In trying to wrap the SIMD code behind layers of abstraction, the latest patch (and Nathan's cleanup) threw it away in almost all cases. To explain, we need to talk about how vectorized code deals with the "tail" that is too small for the register:

1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the result.

There are advantages to both depending on the situation.

Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 elements or less, it will iterate over them one-by-one exactly like a node4. Only when full with 16 will the vector path be taken. When another entry is added, the elements are copied to the next bigger node, so there's a *small* window where it's fast.

In short, this code needs to be lower level so that we still have full control while being portable. I will work on this, and also the related code for node dispatch.

Since v6 has some good infrastructure to do low-level benchmarking, I also want to do some experiments with memory management.

(I have further comments about the code, but I will put that off until later)

> I'll consider how to integrate with vacuum as the next step. One
> concern for me is how to limit the memory usage to
> maintenance_work_mem. Unlike using a flat array, memory space for
> adding one TID varies depending on the situation. If we want strictly
> not to allow using memory more than maintenance_work_mem, probably we
> need to estimate the memory consumption in a conservative way.

+1

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Nathan Bossart
Date:
On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
> In trying to wrap the SIMD code behind layers of abstraction, the latest
> patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
> we need to talk about how vectorized code deals with the "tail" that is too
> small for the register:
> 
> 1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
> 2. Read some junk into the register and mask off false positives from the
> result.
> 
> There are advantages to both depending on the situation.
> 
> Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
> elements or less, it will iterate over them one-by-one exactly like a
> node4. Only when full with 16 will the vector path be taken. When another
> entry is added, the elements are copied to the next bigger node, so there's
> a *small* window where it's fast.
> 
> In short, this code needs to be lower level so that we still have full
> control while being portable. I will work on this, and also the related
> code for node dispatch.

Is it possible to use approach #2 here, too?  AFAICT space is allocated for
all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count.  Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order to
avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
>
> > In short, this code needs to be lower level so that we still have full
> > control while being portable. I will work on this, and also the related
> > code for node dispatch.
>
> Is it possible to use approach #2 here, too?  AFAICT space is allocated for
> all of the chunks, so there wouldn't be any danger in searching all them
> and discarding any results >= node->count.

Sure, the caller could pass the maximum node capacity, and then check if the returned index is within the range of the node count.

> Granted, we're depending on the
> number of chunks always being a multiple of elements-per-vector in order to
> avoid the tail path, but that seems like a reasonably safe assumption that
> can be covered with comments.

Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're not reading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that case, the caller can hard-code the  size (it would even happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in

While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function.

--

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Sep 22, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >
> > On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
> >
> > > In short, this code needs to be lower level so that we still have full
> > > control while being portable. I will work on this, and also the related
> > > code for node dispatch.
> >
> > Is it possible to use approach #2 here, too?  AFAICT space is allocated for
> > all of the chunks, so there wouldn't be any danger in searching all them
> > and discarding any results >= node->count.
>
> Sure, the caller could pass the maximum node capacity, and then check if the returned index is within the range of
thenode count. 
>
> > Granted, we're depending on the
> > number of chunks always being a multiple of elements-per-vector in order to
> > avoid the tail path, but that seems like a reasonably safe assumption that
> > can be covered with comments.
>
> Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're
notreading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that
case,the caller can hard-code the  size (it would even happen to work now to multiply rt_node_kind by 16, to be
sneaky).One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably
insidea power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which
wouldbe faster and possibly would solve the fragmentation problem Andres referred to in 
>
> https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
>
> While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds
branchesand memmove calls, and is the whole reason for the recent "pg_lfind_ge" function. 

Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Sep 22, 2022 at 1:46 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function.
>
> Good point. While keeping the chunks in the small nodes in sorted
> order is useful for visiting all keys in sorted order, additional
> branches and memmove calls could be slow.

Right, the ordering is a property that some users will need, so best to keep it. Although the node128 doesn't have that property -- too slow to do so, I think.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Thu, Sep 22, 2022 at 7:52 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Good point. While keeping the chunks in the small nodes in sorted
> > order is useful for visiting all keys in sorted order, additional
> > branches and memmove calls could be slow.
>
> Right, the ordering is a property that some users will need, so best to keep it. Although the node128 doesn't have that property -- too slow to do so, I think.

Nevermind, I must have been mixing up keys and values there...

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in

https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:

uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;

That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.

Here are the v6 node kinds:

node4:   8 +   4 +(4)    +   4*8 =   48 bytes
node16:  8 +  16         +  16*8 =  152
node32:  8 +  32         +  32*8 =  296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8       + 256/8 + 256*8 = 2088

And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:

node3:   4 +   3 +(1)    +   3*8 =   32 bytes
node6:   4 +   6 +(6)    +   6*8 =   64
node13:  4 +  13 +(7)    +  13*8 =  128
node28:  4 +  28         +  28*8 =  256
node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
node94:  4 + 256 +  96/8 +  94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256:                         = 4096

The main disadvantage is that node256 would balloon in size.

--

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> > One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably
insidea power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which
wouldbe faster and possibly would solve the fragmentation problem Andres referred to in 
>
> > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
>
> While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful,
takingup 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here
inthe struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind: 
>
> uint16 -- kind and count bitfield
> uint8 shift;
> uint8 chunk;
>
> That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count
only.

Good point, agreed.

>
> Here are the v6 node kinds:
>
> node4:   8 +   4 +(4)    +   4*8 =   48 bytes
> node16:  8 +  16         +  16*8 =  152
> node32:  8 +  32         +  32*8 =  296
> node128: 8 + 256 + 128/8 + 128*8 = 1304
> node256: 8       + 256/8 + 256*8 = 2088
>
> And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding
bytes.Even if my math has mistakes, the numbers shouldn't be too far off: 
>
> node3:   4 +   3 +(1)    +   3*8 =   32 bytes
> node6:   4 +   6 +(6)    +   6*8 =   64
> node13:  4 +  13 +(7)    +  13*8 =  128
> node28:  4 +  28         +  28*8 =  256
> node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> node94:  4 + 256 +  96/8 +  94*8 = 1024
> node220: 4 + 256 + 224/8 + 220*8 = 2048
> node256:                         = 4096
>
> The main disadvantage is that node256 would balloon in size.

Yeah, node31 and node256 are bloated.  We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.

BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:

static const uint16 dsa_size_classes[] = {
    sizeof(dsa_area_span), 0,   /* special size classes */
    8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
    80, 96, 112, 128,           /* 4 classes separated by 16 bytes */
    160, 192, 224, 256,         /* 4 classes separated by 32 bytes */
    320, 384, 448, 512,         /* 4 classes separated by 64 bytes */
    640, 768, 896, 1024,        /* 4 classes separated by 128 bytes */
    1280, 1560, 1816, 2048,     /* 4 classes separated by ~256 bytes */
    2616, 3120, 3640, 4096,     /* 4 classes separated by ~512 bytes */
    5456, 6552, 7280, 8192      /* 4 classes separated by ~1024 bytes */
};

node256 will be classed as 2616, which is still not good.

Anyway, I'll implement DSA support for radix tree.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]

Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage.

Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.

I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. To show what I mean, consider this new table:

node2:   5 +  6       +(5)+  2*8 =   32 bytes
node6:   5 +  6       +(5)+  6*8 =   64

node12:  5 + 27       +     12*8 =  128
node27:  5 + 27       +     27*8 =  248(->256)

node91:  5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048

node256: 5 + 32       +(3)+256*8 = 2088(->4096)

Seven size classes are grouped into the four kinds.

The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like:

case FOO_KIND:
if (unlikely(count == capacity))
{
  if (capacity == XYZ) /* for smaller size class of the pair */
  {
    <repalloc to next size class>;
    capacity = next-higher-capacity;
    goto do_insert;
  }
  else
    <migrate data to next node kind>;
}
else
{
do_insert:
  <...>;
  break;
}
/* FALLTHROUGH */
...

One disadvantage is that this wastes some space by reserving the full set of control data in the smaller size class of the pair, but it's usually small compared to array size. Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null.

[1] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Sep 28, 2022 at 1:18 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> [stuff about size classes]

I kind of buried the lede here on one thing: If we only have 4 kinds regardless of the number of size classes, we can use 2 bits of the pointer for dispatch, which would only require 4-byte alignment. That should make that technique more portable.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Sep 28, 2022 at 3:18 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > BTW We need to consider not only aset/slab but also DSA since we
> > allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> > the following size classes:
> >
> > static const uint16 dsa_size_classes[] = {
> > [...]
>
> Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least
benchmarkusing aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas
tobe more prominent. I'll also want to compare total context size compared to slab to see if possibly less
fragmentationmakes up for other wastage. 

Thanks!

>
> Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between
memoryefficiency and number of branches when searching/inserting. My current thinking is there is too much coupling
betweensize class and data type. Each size class currently uses a different data type and a different algorithm to
searchand set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor
branchprediction [1] and (I imagine) code density. 
>
> I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in
thestruct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when
migratinga node to a larger size class of the same kind, we can simply repalloc() to the next size. 

Interesting idea. Using flexible array members for values would be
good also for the case in the future where we want to support other
value types than uint64.

With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow. If we want to support node
shrink, the deletion is also affected.

> To show what I mean, consider this new table:
>
> node2:   5 +  6       +(5)+  2*8 =   32 bytes
> node6:   5 +  6       +(5)+  6*8 =   64
>
> node12:  5 + 27       +     12*8 =  128
> node27:  5 + 27       +     27*8 =  248(->256)
>
> node91:  5 + 256 + 28 +(7)+ 91*8 = 1024
> node219: 5 + 256 + 28 +(7)+219*8 = 2048
>
> node256: 5 + 32       +(3)+256*8 = 2088(->4096)
>
> Seven size classes are grouped into the four kinds.
>
> The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore
fornode256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the
offsetto the pointer/value array is the same. Thus, migration would look something like: 

I think we can use a bitfield for capacity. That way, we can pack
count (9bits), kind (2bits)and capacity (4bits) in uint16.

> Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the
indirectarray type (now node128), since we can just test if the pointer is null. 

Right. I didn't do that to use the common logic for inner node128 and
leaf node128.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:
> I've updated the radix tree patch. It's now separated into two patches.

cfbot notices a compiler warning:
https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446

[11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
[11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
[11:03:05.343]  1758 |    *value_p = *((uint64 *) slot);
[11:03:05.343]       |               ^~~~~~~~~~~~~~~~~~

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Oct 3, 2022 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:
> > I've updated the radix tree patch. It's now separated into two patches.
>
> cfbot notices a compiler warning:
> https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446
>
> [11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
> [11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
> [11:03:05.343]  1758 |    *value_p = *((uint64 *) slot);
> [11:03:05.343]       |               ^~~~~~~~~~~~~~~~~~
>

Thanks, I'll fix it in the next version patch.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably
insidea power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which
wouldbe faster and possibly would solve the fragmentation problem Andres referred to in 
> >
> > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
> >
> > While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit
wasteful,taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an
enumhere in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count
andkind: 
> >
> > uint16 -- kind and count bitfield
> > uint8 shift;
> > uint8 chunk;
> >
> > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being
countonly. 
>
> Good point, agreed.
>
> >
> > Here are the v6 node kinds:
> >
> > node4:   8 +   4 +(4)    +   4*8 =   48 bytes
> > node16:  8 +  16         +  16*8 =  152
> > node32:  8 +  32         +  32*8 =  296
> > node128: 8 + 256 + 128/8 + 128*8 = 1304
> > node256: 8       + 256/8 + 256*8 = 2088
> >
> > And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding
bytes.Even if my math has mistakes, the numbers shouldn't be too far off: 
> >
> > node3:   4 +   3 +(1)    +   3*8 =   32 bytes
> > node6:   4 +   6 +(6)    +   6*8 =   64
> > node13:  4 +  13 +(7)    +  13*8 =  128
> > node28:  4 +  28         +  28*8 =  256
> > node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> > node94:  4 + 256 +  96/8 +  94*8 = 1024
> > node220: 4 + 256 + 224/8 + 220*8 = 2048
> > node256:                         = 4096
> >
> > The main disadvantage is that node256 would balloon in size.
>
> Yeah, node31 and node256 are bloated.  We probably could use slab for
> node256 independently. It's worth trying a benchmark to see how it
> affects the performance and the tree size.
>
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
>     sizeof(dsa_area_span), 0,   /* special size classes */
>     8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
>     80, 96, 112, 128,           /* 4 classes separated by 16 bytes */
>     160, 192, 224, 256,         /* 4 classes separated by 32 bytes */
>     320, 384, 448, 512,         /* 4 classes separated by 64 bytes */
>     640, 768, 896, 1024,        /* 4 classes separated by 128 bytes */
>     1280, 1560, 1816, 2048,     /* 4 classes separated by ~256 bytes */
>     2616, 3120, 3640, 4096,     /* 4 classes separated by ~512 bytes */
>     5456, 6552, 7280, 8192      /* 4 classes separated by ~1024 bytes */
> };
>
> node256 will be classed as 2616, which is still not good.
>
> Anyway, I'll implement DSA support for radix tree.
>

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > Yeah, node31 and node256 are bloated.  We probably could use slab for
> > node256 independently. It's worth trying a benchmark to see how it
> > affects the performance and the tree size.

This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better. 

> Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
> to point to its child nodes, instead of C pointers (ig, backend-local
> address). I'm thinking of a straightforward approach as the first
> step; inner nodes have a union of rt_node* and dsa_pointer and we
> choose either one based on whether the radix tree is shared or not. We
> allocate and free the shared memory for individual nodes by
> dsa_allocate() and dsa_free(), respectively. Therefore we need to get
> a C pointer from dsa_pointer by using dsa_get_address() while
> descending the tree. I'm a bit concerned that calling
> dsa_get_address() for every descent could be performance overhead but
> I'm going to measure it anyway.

Are dsa pointers aligned the same as pointers to locally allocated memory? Meaning, is the offset portion always a multiple of 4 (or 8)? It seems that way from a glance, but I can't say for sure. If the lower 2 bits of a DSA pointer are never set, we can tag them the same way as a regular pointer. That same technique could help hide the latency of converting the pointer, by the same way it would hide the latency of loading parts of a node into CPU registers.

One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit. But that might be overkill.
--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > Yeah, node31 and node256 are bloated.  We probably could use slab for
> > > node256 independently. It's worth trying a benchmark to see how it
> > > affects the performance and the tree size.
>
> This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local
allocation:If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so,
sinceaset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256.
Thatway, node128 and node256 share the same start of pointers/values array, so there would be one less branch for
gettingthat address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table
for5 kinds but not for 4), but possibly should be, and the smaller the better. 

It would be good for performance but I'm a bit concerned that it's
highly optimized to the design of aset and DSA. Since size 2088 will
be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
useless 256-byte and the assumption will be broken.

>
> > Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
> > to point to its child nodes, instead of C pointers (ig, backend-local
> > address). I'm thinking of a straightforward approach as the first
> > step; inner nodes have a union of rt_node* and dsa_pointer and we
> > choose either one based on whether the radix tree is shared or not. We
> > allocate and free the shared memory for individual nodes by
> > dsa_allocate() and dsa_free(), respectively. Therefore we need to get
> > a C pointer from dsa_pointer by using dsa_get_address() while
> > descending the tree. I'm a bit concerned that calling
> > dsa_get_address() for every descent could be performance overhead but
> > I'm going to measure it anyway.
>
> Are dsa pointers aligned the same as pointers to locally allocated memory? Meaning, is the offset portion always a
multipleof 4 (or 8)? 

I think so.

> It seems that way from a glance, but I can't say for sure. If the lower 2 bits of a DSA pointer are never set, we can
tagthem the same way as a regular pointer. That same technique could help hide the latency of converting the pointer,
bythe same way it would hide the latency of loading parts of a node into CPU registers. 
>
> One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces
codedensity. That might be a reason in favor of templating to handle each case in its own translation unit. 

Right. We also need to support locking for shared radix tree, which
would require more branches.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better.
>
> It would be good for performance but I'm a bit concerned that it's
> highly optimized to the design of aset and DSA. Since size 2088 will
> be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
> we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
> useless 256-byte and the assumption will be broken.

A new DSA class is hypothetical. A better argument against my idea is that SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just now and the slab block sizes are:

Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024)

...which would be 128kB for nodemax. I'm curious about the difference.

> > One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit.
>
> Right. We also need to support locking for shared radix tree, which
> would require more branches.

Hmm, now it seems we'll likely want to template local vs. shared as a later step...

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.

Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've mentioned. I'm long overdue for an update, but the picture is not yet complete.

For now, I have two questions that I can't figure out on my own:

1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers report). This is independent of the number of tids per block. Example below:

john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE:  num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 250000, n256 = 981
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 8000000 |        268435456 |            48000000 |        661 |            29 |          276 |             389

john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 8388608 |        276824064 |            54000000 |        718 |            33 |          311 |             446

The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to show the stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid per block here.)

2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than bench_seq_search(). I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is? Example:

john=# select * from bench_seq_search(0, 1000000);
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |        168 |           106 |          827 |            3348

john=# select * from bench_shuffle_search(0, 1000000);
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |        171 |           107 |          827 |            1400

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Oct 7, 2022 at 2:29 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > In addition to two patches, I've attached the third patch. It's not
> > part of radix tree implementation but introduces a contrib module
> > bench_radix_tree, a tool for radix tree performance benchmarking. It
> > measures loading and lookup performance of both the radix tree and a
> > flat array.
>
> Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've
mentioned.I'm long overdue for an update, but the picture is not yet complete.
 

Thanks!

> For now, I have two questions that I can't figure out on my own:
>
> 1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers
report).This is independent of the number of tids per block. Example below:
 
>
> john=# select * from bench_shuffle_search(0, 8*1000*1000);
> NOTICE:  num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 250000, n256 = 981
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  8000000 |        268435456 |            48000000 |        661 |            29 |          276 |             389
>
> john=# select * from bench_shuffle_search(0, 9*1000*1000);
> NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  8388608 |        276824064 |            54000000 |        718 |            33 |          311 |             446
>
> The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to
showthe stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid
perblock here.)
 

Yes, I can reproduce this. In tid_to_key_off() we need to cast to
uint64 when packing offset number and block number:

   tid_i = ItemPointerGetOffsetNumber(tid);
   tid_i |= ItemPointerGetBlockNumber(tid) << shift;

>
> 2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than
bench_seq_search().I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is?
Example:
>
> john=# select * from bench_seq_search(0, 1000000);
> NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |        168 |           106 |          827 |            3348
>
> john=# select * from bench_shuffle_search(0, 1000000);
> NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |        171 |           107 |          827 |            1400
>

Ugh, in shuffle_itemptrs(), we shuffled itemptrs instead of itemptr:

    for (int i = 0; i < nitems - 1; i++)
    {
        int j = shuffle_randrange(&state, i, nitems - 1);
       ItemPointerData t = itemptrs[j];

       itemptrs[j] = itemptrs[i];
       itemptrs[i] = t;

With the fix, the results on my environment were:

postgres(1:4093192)=# select * from bench_seq_search(0, 10000000);
2022-10-07 16:57:03.124 JST [4093192] LOG:  num_keys = 10000000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
----------+------------------+---------------------+------------+---------------+--------------+-----------------
 10000000 |        101826560 |          1800000000 |        846 |
     486 |         6096 |           21128
(1 row)

Time: 28975.566 ms (00:28.976)
postgres(1:4093192)=# select * from bench_shuffle_search(0, 10000000);
2022-10-07 16:57:37.476 JST [4093192] LOG:  num_keys = 10000000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
----------+------------------+---------------------+------------+---------------+--------------+-----------------
 10000000 |        101826560 |          1800000000 |        845 |
     484 |        32700 |          152583
(1 row)

I've attached a patch to fix them. Also, I realized that bsearch()
could be optimized out so I added code to prevent it:

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I have to start somewhere...

My main concerns are that internal APIs:

1. are difficult to follow
2. lead to poor branch prediction and too many function calls

Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regression there can go completely unnoticed. Hopefully the broader themes are informative.

On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> [fixed benchmarks]

Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary. I've done some testing on loading, but will leave it out for now in the interest of length.


0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example, because we know it already. And the sleep call is so I can run perf in a different shell session, on only the search portion.

Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there are always 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wise loop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor:

john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |        167 |             0 |          822 |               0

     1,470,141,841      branches:u                                                  
            63,693      branch-misses:u           #    0.00% of all branches  

john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |        168 |             0 |         2174 |               0

     1,470,142,569      branches:u                                                  
        15,023,983      branch-misses:u           #    1.02% of all branches


0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded.  Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the same path in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go through different branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are less full -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        173 |             0 |          907 |               0

     1,684,114,926      branches:u                                                  
         1,989,901      branch-misses:u           #    0.12% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        173 |             0 |         2890 |               0

     1,684,115,844      branches:u                                                  
        34,215,740      branch-misses:u           #    2.03% of all branches


0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4, this benchmark hardly has any so it's ok.

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        176 |             0 |          867 |               0

     1,469,540,357      branches:u                                                  
            96,678      branch-misses:u           #    0.01% of all branches  

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        171 |             0 |         2530 |               0

     1,469,540,533      branches:u                                                  
        15,019,975      branch-misses:u           #    1.02% of all branches


0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly. With these patches, searching an unevenly populated load is the same or faster than the original sequential load, despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win the memory back.)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         20381696 |           179937720 |        171 |             0 |          717 |               0

     1,349,614,294      branches:u                                                  
             1,313      branch-misses:u           #    0.00% of all branches  

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         20381696 |           179937720 |        172 |             0 |         2202 |               0

     1,349,614,741      branches:u                                                  
            30,592      branch-misses:u           #    0.00% of all branches  

Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Ther abstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done this way, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level.

I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert() branches based on the kind. If it must call rt_node_grow(), that function has no idea where it came from and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branch again. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a 5-way jump table because the caller could be anything at all.

Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compiler get rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert() is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodes get a return value of zero. There is not enough commonality here.

Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaves with the same api:

rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search

I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for the last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leaf stuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisite for decent performance as well as readability.

For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspects I would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretend to know the practical consequences of every change I mention.

- If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.
- I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm willing to put serious time into that once the broad details are right. I will also investigate pointer tagging if we can confirm that can work similarly for dsa pointers.

Regarding size class decoupling, I'll respond to a point made earlier:

On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> With this idea, we can just repalloc() to grow to the larger size in a
> pair but I'm slightly concerned that the more size class we use, the
> more frequent the node needs to grow.

Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many ways a size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use. It's a trade off between memory usage and complexity.

> If we want to support node
> shrink, the deletion is also affected.

Not necessarily. We don't have to shrink at the same granularity as growing. My evidence is simple: we don't shrink at all now. :-)

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Mon, Oct 10, 2022 at 12:16 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) 

Forgot the patchset...

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Mon, Oct 10, 2022 at 2:16 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I
haveto start somewhere... 
>
> My main concerns are that internal APIs:
>
> 1. are difficult to follow
> 2. lead to poor branch prediction and too many function calls
>
> Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a
regressionthere can go completely unnoticed. Hopefully the broader themes are informative. 
>
> On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > [fixed benchmarks]
>
> Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of
v6)are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is
necessary.I've done some testing on loading, but will leave it out for now in the interest of length. 
>
>
> 0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for
example,because we know it already. And the sleep call is so I can run perf in a different shell session, on only the
searchportion. 
>
> Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there
arealways 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the
byte-wiseloop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor: 
>
> john=# select * from bench_seq_search(0, 1*1000*1000);
> NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |        167 |             0 |          822 |               0
>
>      1,470,141,841      branches:u
>             63,693      branch-misses:u           #    0.00% of all branches
>
> john=# select * from bench_shuffle_search(0, 1*1000*1000);
> NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |        168 |             0 |         2174 |               0
>
>      1,470,142,569      branches:u
>         15,023,983      branch-misses:u           #    1.02% of all branches
>
>
> 0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being
loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the
samepath in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go
throughdifferent branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are
lessfull -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at
least)
>
> john=# select * from bench_seq_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |        173 |             0 |          907 |               0
>
>      1,684,114,926      branches:u
>          1,989,901      branch-misses:u           #    0.12% of all branches
>
> john=# select * from bench_shuffle_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |        173 |             0 |         2890 |               0
>
>      1,684,115,844      branches:u
>         34,215,740      branch-misses:u           #    2.03% of all branches
>
>
> 0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable
performance.For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for
node4,this benchmark hardly has any so it's ok. 
>
> john=# select * from bench_seq_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |        176 |             0 |          867 |               0
>
>      1,469,540,357      branches:u
>             96,678      branch-misses:u           #    0.01% of all branches
>
> john=# select * from bench_shuffle_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |        171 |             0 |         2530 |               0
>
>      1,469,540,533      branches:u
>         15,019,975      branch-misses:u           #    1.02% of all branches
>
>
> 0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to
code.The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is
costly.With these patches, searching an unevenly populated load is the same or faster than the original sequential
load,despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win
thememory back.) 
>
> john=# select * from bench_seq_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         20381696 |           179937720 |        171 |             0 |          717 |               0
>
>      1,349,614,294      branches:u
>              1,313      branch-misses:u           #    0.00% of all branches
>
> john=# select * from bench_shuffle_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         20381696 |           179937720 |        172 |             0 |         2202 |               0
>
>      1,349,614,741      branches:u
>             30,592      branch-misses:u           #    0.00% of all branches
>
> Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind.
Therabstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done
thisway, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine
level.

Right. When updating the patch from v4 to v5, I've eliminated the
duplication of code between each node type as much as possible, which
in turn produced more code on the machine level. The resulst of your
experiment clearly showed the bad side of this work. FWIW I've also
confirmed your changes in my environment (I've added the third
argument to turn on and off the randomizes block selection proposed in
0004 patch):

* w/o patches
postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:33:15.460 JST [361692] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |         87 |
        |          462 |
(1 row)

1590104944      branches:u                #    3.430 G/sec
          65957      branch-misses:u           #    0.00% of all branches

postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:33:28.934 JST [361692] LOG:  num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |         91 |
       |          497 |
(1 row)

1748249456      branches:u                #    3.506 G/sec
        481074      branch-misses:u           #    0.03% of all branches

postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:33:38.378 JST [361692] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |         86 |
        |         1290 |
(1 row)

1590105370      branches:u                #    1.231 G/sec
   15039443      branch-misses:u           #    0.95% of all branches

Time: 4166.346 ms (00:04.166)
postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:33:51.556 JST [361692] LOG:  num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |         90 |
       |         1536 |
(1 row)

1748250497      branches:u                #    1.137 G/sec
    28125016      branch-misses:u           #    1.61% of all branches

* w/ all patches
postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:29:27.232 JST [360358] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |         81 |
        |          432 |
(1 row)

1380062209      branches:u                #    3.185 G/sec
            1066      branch-misses:u           #    0.00% of all branches

postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:29:46.380 JST [360358] LOG:  num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         20381696 |           179937720 |         88 |
       |          438 |
(1 row)

1379640815      branches:u                #    3.133 G/sec
           1332      branch-misses:u           #    0.00% of all branches

postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:30:00.943 JST [360358] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |         81 |
        |          994 |
(1 row)

1380062386      branches:u                #    1.386 G/sec
          18368      branch-misses:u           #    0.00% of all branches

postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:30:15.944 JST [360358] LOG:  num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         20381696 |           179937720 |         88 |
       |         1098 |
(1 row)

1379641503      branches:u                #    1.254 G/sec
          18973      branch-misses:u           #    0.00% of all branches

> I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia.
prepare_node_for_insert()branches based on the kind. If it must call rt_node_grow(), that function has no idea where it
camefrom and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must
branchagain. And if we are one of the three linear-search nodes, we later do another function call, where we encounter
a5-way jump table because the caller could be anything at all. 
>
> Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the
compilerget rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think
prepare_node_for_insert()is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup
nodesget a return value of zero. There is not enough commonality here. 

Agreed.

>
> Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and
leaveswith the same api: 
>
> rt_node_iterate_next
> chunk_array_node_get_slot
> node_128/256_get_slot
> rt_node_search
>
> I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for
thelast one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the
inner/leafstuff. Something to think about for later, but for now I believe we have to accept some code duplication as a
prerequisitefor decent performance as well as readability. 

Agreed.

>
> For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some
aspectsI would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't
pretendto know the practical consequences of every change I mention. 
>
> - If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If
thathas not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier
abstractionsin the current patch. 
> - As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally
alwaysknow what kind we are if we found out earlier. 
> - For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the
sametime, trying to treat them the same is not always worthwhile. 
> - Start to separate treatment of inner/leaves and see how it goes.

Since I've not started coding the shared memory case seriously, I'm
going to start with eliminating abstractions and splitting the
treatment of inner and leaf nodes.

> - I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm
willingto put serious time into that once the broad details are right. I will also investigate pointer tagging if we
canconfirm that can work similarly for dsa pointers. 

I'll keep 4 node kinds. And we can later try to introduce classes into
each node kind.

>
> Regarding size class decoupling, I'll respond to a point made earlier:
>
> On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > With this idea, we can just repalloc() to grow to the larger size in a
> > pair but I'm slightly concerned that the more size class we use, the
> > more frequent the node needs to grow.
>
> Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD
nodekind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many
waysa size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use.
It'sa trade off between memory usage and complexity. 

Agreed.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Oct 14, 2022 at 4:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> On Mon, Oct 10, 2022 at 2:16 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I
haveto start somewhere... 
> >
> > My main concerns are that internal APIs:
> >
> > 1. are difficult to follow
> > 2. lead to poor branch prediction and too many function calls
> >
> > Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a
regressionthere can go completely unnoticed. Hopefully the broader themes are informative. 
> >
> > On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > [fixed benchmarks]
> >
> > Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top
ofv6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is
necessary.I've done some testing on loading, but will leave it out for now in the interest of length. 
> >
> >
> > 0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off,
forexample, because we know it already. And the sleep call is so I can run perf in a different shell session, on only
thesearch portion. 
> >
> > Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there
arealways 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the
byte-wiseloop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor: 
> >
> > john=# select * from bench_seq_search(0, 1*1000*1000);
> > NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> > NOTICE:  sleeping for 2 seconds...
> >   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > ---------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  1000000 |         10199040 |           180000000 |        167 |             0 |          822 |               0
> >
> >      1,470,141,841      branches:u
> >             63,693      branch-misses:u           #    0.00% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 1*1000*1000);
> > NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> > NOTICE:  sleeping for 2 seconds...
> >   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > ---------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  1000000 |         10199040 |           180000000 |        168 |             0 |         2174 |               0
> >
> >      1,470,142,569      branches:u
> >         15,023,983      branch-misses:u           #    1.02% of all branches
> >
> >
> > 0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being
loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the
samepath in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go
throughdifferent branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are
lessfull -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at
least)
> >
> > john=# select * from bench_seq_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        173 |             0 |          907 |               0
> >
> >      1,684,114,926      branches:u
> >          1,989,901      branch-misses:u           #    0.12% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        173 |             0 |         2890 |               0
> >
> >      1,684,115,844      branches:u
> >         34,215,740      branch-misses:u           #    2.03% of all branches
> >
> >
> > 0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains
predictableperformance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way
tooslow for node4, this benchmark hardly has any so it's ok. 
> >
> > john=# select * from bench_seq_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        176 |             0 |          867 |               0
> >
> >      1,469,540,357      branches:u
> >             96,678      branch-misses:u           #    0.01% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        171 |             0 |         2530 |               0
> >
> >      1,469,540,533      branches:u
> >         15,019,975      branch-misses:u           #    1.02% of all branches
> >
> >
> > 0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler
tocode. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is
costly.With these patches, searching an unevenly populated load is the same or faster than the original sequential
load,despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win
thememory back.) 
> >
> > john=# select * from bench_seq_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         20381696 |           179937720 |        171 |             0 |          717 |               0
> >
> >      1,349,614,294      branches:u
> >              1,313      branch-misses:u           #    0.00% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         20381696 |           179937720 |        172 |             0 |         2202 |               0
> >
> >      1,349,614,741      branches:u
> >             30,592      branch-misses:u           #    0.00% of all branches
> >
> > Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind.
Therabstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done
thisway, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine
level.
>
> Right. When updating the patch from v4 to v5, I've eliminated the
> duplication of code between each node type as much as possible, which
> in turn produced more code on the machine level. The resulst of your
> experiment clearly showed the bad side of this work. FWIW I've also
> confirmed your changes in my environment (I've added the third
> argument to turn on and off the randomizes block selection proposed in
> 0004 patch):
>
> * w/o patches
> postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
> 2022-10-14 11:33:15.460 JST [361692] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         87 |
>         |          462 |
> (1 row)
>
> 1590104944      branches:u                #    3.430 G/sec
>           65957      branch-misses:u           #    0.00% of all branches
>
> postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
> 2022-10-14 11:33:28.934 JST [361692] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |         91 |
>        |          497 |
> (1 row)
>
> 1748249456      branches:u                #    3.506 G/sec
>         481074      branch-misses:u           #    0.03% of all branches
>
> postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 *
> 1000, false);
> 2022-10-14 11:33:38.378 JST [361692] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         86 |
>         |         1290 |
> (1 row)
>
> 1590105370      branches:u                #    1.231 G/sec
>    15039443      branch-misses:u           #    0.95% of all branches
>
> Time: 4166.346 ms (00:04.166)
> postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 *
> 1000, true);
> 2022-10-14 11:33:51.556 JST [361692] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |         90 |
>        |         1536 |
> (1 row)
>
> 1748250497      branches:u                #    1.137 G/sec
>     28125016      branch-misses:u           #    1.61% of all branches
>
> * w/ all patches
> postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
> 2022-10-14 11:29:27.232 JST [360358] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         81 |
>         |          432 |
> (1 row)
>
> 1380062209      branches:u                #    3.185 G/sec
>             1066      branch-misses:u           #    0.00% of all branches
>
> postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
> 2022-10-14 11:29:46.380 JST [360358] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         20381696 |           179937720 |         88 |
>        |          438 |
> (1 row)
>
> 1379640815      branches:u                #    3.133 G/sec
>            1332      branch-misses:u           #    0.00% of all branches
>
> postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 *
> 1000, false);
> 2022-10-14 11:30:00.943 JST [360358] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         81 |
>         |          994 |
> (1 row)
>
> 1380062386      branches:u                #    1.386 G/sec
>           18368      branch-misses:u           #    0.00% of all branches
>
> postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 *
> 1000, true);
> 2022-10-14 11:30:15.944 JST [360358] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         20381696 |           179937720 |         88 |
>        |         1098 |
> (1 row)
>
> 1379641503      branches:u                #    1.254 G/sec
>           18973      branch-misses:u           #    0.00% of all branches
>
> > I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia.
prepare_node_for_insert()branches based on the kind. If it must call rt_node_grow(), that function has no idea where it
camefrom and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must
branchagain. And if we are one of the three linear-search nodes, we later do another function call, where we encounter
a5-way jump table because the caller could be anything at all. 
> >
> > Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the
compilerget rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think
prepare_node_for_insert()is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup
nodesget a return value of zero. There is not enough commonality here. 
>
> Agreed.
>
> >
> > Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and
leaveswith the same api: 
> >
> > rt_node_iterate_next
> > chunk_array_node_get_slot
> > node_128/256_get_slot
> > rt_node_search
> >
> > I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful
forthe last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the
inner/leafstuff. Something to think about for later, but for now I believe we have to accept some code duplication as a
prerequisitefor decent performance as well as readability. 
>
> Agreed.
>
> >
> > For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some
aspectsI would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't
pretendto know the practical consequences of every change I mention. 
> >
> > - If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If
thathas not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier
abstractionsin the current patch. 
> > - As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally
alwaysknow what kind we are if we found out earlier. 
> > - For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the
sametime, trying to treat them the same is not always worthwhile. 
> > - Start to separate treatment of inner/leaves and see how it goes.
>
> Since I've not started coding the shared memory case seriously, I'm
> going to start with eliminating abstractions and splitting the
> treatment of inner and leaf nodes.

I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:

* Separate treatment of inner and leaf nodes
* Pack both the node kind and node count to an uint16 value.

I've also made a change in functions in bench_radix_tree test module:
the third argument of bench_seq/shuffle_search() is a flag to turn on
and off the randomizes block selection. The results of performance
tests in my environment are:

postgres(1:1665989)=# select * from bench_seq_search(0, 1* 1000 * 1000, false);
2022-10-24 14:29:40.705 JST [1665989] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871104 |           180000000 |         65 |
        |          248 |
(1 row)

postgres(1:1665989)=# select * from bench_seq_search(0, 2* 1000 * 1000, true);
2022-10-24 14:29:47.999 JST [1665989] LOG:  num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680736 |           179937720 |         71 |
       |          237 |
(1 row)

postgres(1:1665989)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-24 14:29:55.955 JST [1665989] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871104 |           180000000 |         65 |
        |          641 |
(1 row)

postgres(1:1665989)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-24 14:30:04.140 JST [1665989] LOG:  num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680736 |           179937720 |         71 |
       |          654 |
(1 row)

I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've attached updated PoC patches for discussion and cfbot. From the
> previous version, I mainly changed the following things:
>
> * Separate treatment of inner and leaf nodes

Overall, this looks much better!

> * Pack both the node kind and node count to an uint16 value.

For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking again at the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum is a good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like this example:

node4:   4 +  4           +  4*8 =   40
node4:   5 +  4+(7)       +  4*8 =   48 bytes

Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes from kind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded in a pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be 6 temporarily if I work on size decoupling first).

(Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need to write your own shifting/masking code).

> I've not done SIMD part seriously yet. But overall the performance
> seems good so far. If we agree with the current approach, I think we
> can proceed with the verification of decoupling node sizes from node
> kind. And I'll investigate DSA support.

Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently with the above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still be some performance tuning and cosmetic work, but it's getting closer.

-------------------------
0001:

+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif

Leftover from an earlier version?

+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);

Leftovers, causing compiler warnings. (Also see new variable shadow warning)

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif

As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writing their own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaround for lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiom for non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm.

0002:

+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));

Hmm?

+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)

The caller must now have logic for inserting at the end:

+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */

It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example). In fact, these functions are probably better named node*_get_insertpos().

+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */

We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together. This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desired effect, so we need to write like this (also see prototype):

if (unlikely( ! has-free-slot))
  grow-node;
else
{
  ...;
  break;
}
/* FALLTHROUGH */

+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+   rt_node    *child;
+
+   if (NODE_IS_LEAF(node))
+     break;
+
+   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+     child = rt_node_add_new_child(tree, parent, node, key);
+
+   Assert(child);
+
+   parent = node;
+   node = child;
+   shift -= RT_NODE_SPAN;
+ }

Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there (the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mind as we proceed. Maybe a comment about it to remind us.

+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+   return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+  * Delete the key from the leaf node and recursively delete the key in
+  * inner nodes if necessary.
+  */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+   rt_node    *node = stack[level--];
+
+   if (NODE_IS_LEAF(node))
+     rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+   else
+     rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+   /* If the node didn't become empty, we stop deleting the key */
+   if (!NODE_IS_EMPTY(node))
+     break;
+
+   /* The node became empty */
+   rt_free_node(tree, node);
+ }

Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls are inlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeepping and loop over the inner nodes. This might require some duplication of code.

+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)

Spelling

+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+             uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

gcc generates better code with something like this (but not hard-coded) at the top:

    if (count > 4)
        pg_unreachable();

This would have to change when we implement shrinking of nodes, but might still be useful.

+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+   return false;
+
+ return true;

Maybe just "return rt_node_search_leaf(...)" ?

--

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Oct 26, 2022 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >

Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking
againat the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum
isa good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like
thisexample: 
>
> node4:   4 +  4           +  4*8 =   40
> node4:   5 +  4+(7)       +  4*8 =   48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes
fromkind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded
ina pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might
be6 temporarily if I work on size decoupling first). 

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need
towrite your own shifting/masking code). 

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently
withthe above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still
besome performance tuning and cosmetic work, but it's getting closer. 
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -------------------------
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off
writingtheir own byte-wise loop rather than using this API. Especially since the "min" function exists only as a
workaroundfor lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that
idiomfor non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm. 

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is greater
> + * than or equal to 'key'.  Return -1 if there is no such element.
> + */
> +static inline int
> +node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
>
> The caller must now have logic for inserting at the end:
>
> + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
> + int16 count = NODE_GET_COUNT(n32);
> +
> + if (insertpos < 0)
> + insertpos = count; /* insert to the tail */
>
> It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for
example).In fact, these functions are probably better named node*_get_insertpos(). 

Agreed.

>
> + if (likely(NODE_HAS_FREE_SLOT(n128)))
> + {
> + node_inner_128_insert(n128, chunk, child);
> + break;
> + }
> +
> + /* grow node from 128 to 256 */
>
> We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close
together.This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the
desiredeffect, so we need to write like this (also see prototype): 
>
> if (unlikely( ! has-free-slot))
>   grow-node;
> else
> {
>   ...;
>   break;
> }
> /* FALLTHROUGH */

Good point. Will change.

>
> + /* Descend the tree until a leaf node */
> + while (shift >= 0)
> + {
> +   rt_node    *child;
> +
> +   if (NODE_IS_LEAF(node))
> +     break;
> +
> +   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
> +     child = rt_node_add_new_child(tree, parent, node, key);
> +
> +   Assert(child);
> +
> +   parent = node;
> +   node = child;
> +   shift -= RT_NODE_SPAN;
> + }
>
> Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing
there(the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in
mindas we proceed. Maybe a comment about it to remind us. 

Agreed. Currently rt_extend() is used to add upper nodes but probably
we need another function to add lower nodes for this case.

>
> + /* there is no key to delete */
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
> +   return false;
> +
> + /* Update the statistics */
> + tree->num_keys--;
> +
> + /*
> +  * Delete the key from the leaf node and recursively delete the key in
> +  * inner nodes if necessary.
> +  */
> + Assert(NODE_IS_LEAF(stack[level]));
> + while (level >= 0)
> + {
> +   rt_node    *node = stack[level--];
> +
> +   if (NODE_IS_LEAF(node))
> +     rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
> +   else
> +     rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
> +
> +   /* If the node didn't become empty, we stop deleting the key */
> +   if (!NODE_IS_EMPTY(node))
> +     break;
> +
> +   /* The node became empty */
> +   rt_free_node(tree, node);
> + }
>
> Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls
areinlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf
bookkeeppingand loop over the inner nodes. This might require some duplication of code. 

Agreed.

>
> +ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
>
> Spelling

WIll fix.

>
> +static inline void
> +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> +             uint8 *dst_chunks, rt_node **dst_children, int count)
> +{
> + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> +}
>
> gcc generates better code with something like this (but not hard-coded) at the top:
>
>     if (count > 4)
>         pg_unreachable();

Agreed.

>
> This would have to change when we implement shrinking of nodes, but might still be useful.
>
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
> +   return false;
> +
> + return true;
>
> Maybe just "return rt_node_search_leaf(...)" ?

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> True. I'm going to start with 6 bytes and will consider reducing it to
> 5 bytes.

Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes.

> Encoding the kind in a pointer tag could be tricky given DSA

If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory.

> support so currently I'm thinking to pack the node kind and node
> capacity classes to uint8.

That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles.

> I've made some progress on investigating DSA support. I've written
> draft patch for that and regression tests passed. I'll share it as a
> separate patch for discussion with v8 radix tree patch.

Great!

> While implementing DSA support, I realized that we may not need to use
> pointer tagging to distinguish between backend-local address or
> dsa_pointer. In order to get a backend-local address from dsa_pointer,
> we need to pass dsa_area like:

I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind.

Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed.

> > +static inline void
> > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > +             uint8 *dst_chunks, rt_node **dst_children, int count)
> > +{
> > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > +}
> >
> > gcc generates better code with something like this (but not hard-coded) at the top:
> >
> >     if (count > 4)
> >         pg_unreachable();

Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to

static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {

In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Oct 27, 2022 at 12:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > True. I'm going to start with 6 bytes and will consider reducing it to
> > 5 bytes.
>
> Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class
decouplingafter v8 and see how it goes. 
>
> > Encoding the kind in a pointer tag could be tricky given DSA
>
> If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I
hopeto at least test it out with local memory. 
>
> > support so currently I'm thinking to pack the node kind and node
> > capacity classes to uint8.
>
> That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can
directlycompare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my
lastmessage, we need to access the kind quickly, without more cycles. 

Understood.

>
> > I've made some progress on investigating DSA support. I've written
> > draft patch for that and regression tests passed. I'll share it as a
> > separate patch for discussion with v8 radix tree patch.
>
> Great!
>
> > While implementing DSA support, I realized that we may not need to use
> > pointer tagging to distinguish between backend-local address or
> > dsa_pointer. In order to get a backend-local address from dsa_pointer,
> > we need to pass dsa_area like:
>
> I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know
theplaces that would be affected by tagging the pointer with the node kind. 
>
> Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in
thebackend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done
too.However, it's okay to keep the benchmarking module in autoconf, since it won't be committed. 

Updated to support Meson.

>
> > > +static inline void
> > > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > > +             uint8 *dst_chunks, rt_node **dst_children, int count)
> > > +{
> > > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > > +}
> > >
> > > gcc generates better code with something like this (but not hard-coded) at the top:
> > >
> > >     if (count > 4)
> > >         pg_unreachable();
>
> Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4,
sowhy doesn't the compiler know that? I believe it boils down to 
>
> static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
>
> In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not
constant.This might not be important just yet, because I want to base the check on the proposed node capacity instead,
butI mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants. 

I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.

In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL). Regarding the performance, I've
added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,

select * from bench_seq_search(0, 1* 1000 * 1000, false, false);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871240 |           180000000 |         67 |
        |          241 |
(1 row)

select * from bench_seq_search(0, 1* 1000 * 1000, false, true);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         14680064 |           180000000 |         81 |
        |          483 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, false);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680872 |           179937720 |         74 |
       |          235 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, true);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         23068672 |           179937720 |         86 |
       |          445 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, false);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871240 |           180000000 |         67 |
        |          640 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, true);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         14680064 |           180000000 |         81 |
        |         1002 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, false);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680872 |           179937720 |         74 |
       |          697 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, true);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         23068672 |           179937720 |         86 |
       |         1030 |
(1 row)

In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().

Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.
Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> the comments I got so far. 0004 patch is a DSA support patch for PoC.

Thanks for the new patchset. This is not a full review, but I have some comments:

0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, before I do so I'd like to request another revision regarding the following:

> In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> to point its children, and we use rt_node_ptr as either rt_node* or
> dsa_pointer depending on whether the radix tree is shared or not (ie,
> by checking radix_tree->dsa == NULL).

0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read:

- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
  rt_node_inner_32 *new32;
+ rt_node_ptr new32p;
 
  /* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-  RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);

It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up to make this easier to read. Here's what I came up with:

typedef struct rt_node_ptr
{
  uintptr_t encoded;
  rt_node * decoded;
}

Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a tagged pointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have "static inline void" functions  node_ptr_encode() and  node_ptr_decode(), which would only copy from one member to another. I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting with tagged pointers will be easier.

Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does this ideally belong in 0002? What's the reason for it?

Regarding the performance, I've
> added another boolean argument to bench_seq/shuffle_search(),
> specifying whether to use the shared radix tree or not. Here are
> benchmark results in my environment,

> [...]

> In non-shared radix tree cases (the forth argument is false), I don't
> see a visible performance degradation. On the other hand, in shared
> radix tree cases (the forth argument is true), I see visible overheads
> because of dsa_get_address().

Thanks, this is useful.

> Please note that the current shared radix tree implementation doesn't
> support any locking, so it cannot be read while written by someone.

I think at the very least we need a global lock to enforce this.

> Also, only one process can iterate over the shared radix tree. When it
> comes to parallel vacuum, these don't become restriction as the leader
> process writes the radix tree while scanning heap and the radix tree
> is read by multiple processes while vacuuming indexes. And only the
> leader process can do heap vacuum by iterating the key-value pairs in
> the radix tree. If we want to use it for other cases too, we would
> need to support locking, RCU or something.

A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16 of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticated schemes if need be.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Nov 3, 2022 at 1:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> > the comments I got so far. 0004 patch is a DSA support patch for PoC.
>
> Thanks for the new patchset. This is not a full review, but I have some comments:
>
> 0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However,
beforeI do so I'd like to request another revision regarding the following: 
>
> > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> > to point its children, and we use rt_node_ptr as either rt_node* or
> > dsa_pointer depending on whether the radix tree is shared or not (ie,
> > by checking radix_tree->dsa == NULL).
>

Thank you for the comments!

> 0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read:
>
> - if (found && child_p)
> - *child_p = child;
> + if (found && childp_p)
> + *childp_p = childp;
> ...
>   rt_node_inner_32 *new32;
> + rt_node_ptr new32p;
>
>   /* grow node from 4 to 32 */
> - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
> -  RT_NODE_KIND_32);
> + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
> + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
>
> It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up
tomake this easier to read. Here's what I came up with: 
>
> typedef struct rt_node_ptr
> {
>   uintptr_t encoded;
>   rt_node * decoded;
> }
>
> Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a
taggedpointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would
have"static inline void" functions  node_ptr_encode() and  node_ptr_decode(), which would only copy from one member to
another.I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting
withtagged pointers will be easier. 

Good idea. Will try in the next version patch.

>
> Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does
thisideally belong in 0002? What's the reason for it? 

Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.

>
> Regarding the performance, I've
> > added another boolean argument to bench_seq/shuffle_search(),
> > specifying whether to use the shared radix tree or not. Here are
> > benchmark results in my environment,
>
> > [...]
>
> > In non-shared radix tree cases (the forth argument is false), I don't
> > see a visible performance degradation. On the other hand, in shared
> > radix tree cases (the forth argument is true), I see visible overheads
> > because of dsa_get_address().
>
> Thanks, this is useful.
>
> > Please note that the current shared radix tree implementation doesn't
> > support any locking, so it cannot be read while written by someone.
>
> I think at the very least we need a global lock to enforce this.
>
> > Also, only one process can iterate over the shared radix tree. When it
> > comes to parallel vacuum, these don't become restriction as the leader
> > process writes the radix tree while scanning heap and the radix tree
> > is read by multiple processes while vacuuming indexes. And only the
> > leader process can do heap vacuum by iterating the key-value pairs in
> > the radix tree. If we want to use it for other cases too, we would
> > need to support locking, RCU or something.
>
> A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for
v16of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more
sophisticatedschemes if need be. 

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch
also introduces rt_detach() that was missed in 0002 patch. It's a very
rough patch but I hope it helps in considering lazy vacuum
integration, radix tree APIs, and shared radix tree functionality.
There are some TODOs:

* We need to reset the TIDStore and therefore reset the radix tree. It
can easily be done by using MemoryContextReset() in non-shared radix
tree cases, but in shared case, we need either to free all radix tree
nodes recursively or introduce a way to release all allocated DSA
memory.

* We need to limit the size of TIDStore (mainly radix_tree) in
maintenance_work_mem.

* We need to change the counter-based information in
pg_stat_progress_vacuum such as max_dead_tuples and num_dead_tuplesn.
I think it would be better to show maximum bytes we can collect TIDs
and its usage instead.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.

It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end.

> I've attached a draft patch for lazy vacuum integration that can be
> applied on top of v8 patches. The patch adds a new module called
> TIDStore, an efficient storage for TID backed by radix tree. Lazy
> vacuum and parallel vacuum use it instead of a TID array. The patch
> also introduces rt_detach() that was missed in 0002 patch. It's a very
> rough patch but I hope it helps in considering lazy vacuum
> integration, radix tree APIs, and shared radix tree functionality.

It does help, good to see this.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > For parallel heap pruning, multiple workers will insert key-value
> > pairs to the radix tree concurrently. The simplest solution would be a
> > single lock to protect writes but the performance will not be good.
> > Another solution would be that we can divide the tables into multiple
> > ranges so that keys derived from TIDs are not conflicted with each
> > other and have parallel workers process one or more ranges. That way,
> > parallel vacuum workers can build *sub-trees* and the leader process
> > can merge them. In use cases of lazy vacuum, since the write phase and
> > read phase are separated the readers don't need to worry about
> > concurrent updates.
>
> It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve
readaheadfor sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping
thedead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free
thelocal array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using
smalllocal arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end. 

Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.

Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Peter Geoghegan
Date:
On Fri, Nov 4, 2022 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.

I think that the VM snapshot concept can eventually be used to
implement parallel heap pruning. Since every page that will become a
scanned_pages is known right from the start with VM snapshots, it will
be relatively straightforward to partition these pages into distinct
ranges with an equal number of pages, one per worker planned. The VM
snapshot structure can also be used for I/O prefetching, which will be
more important with parallel heap pruning (and with aio).

Working off of an immutable structure that describes which pages to
process right from the start is naturally easy to work with, in
general. We can "reorder work" flexibly (i.e. process individual
scanned_pages in any order that is convenient). Another example is
"changing our mind" about advancing relfrozenxid when it turns out
that we maybe should have decided to do that at the start of VACUUM
[1]. Maybe the specific "changing our mind" idea will turn out to not
be a very useful idea, but it is at least an interesting and thought
provoking concept.

[1] https://postgr.es/m/CAH2-WzkQ86yf==mgAF=cQ0qeLRWKX3htLw9Qo+qx3zbwJJkPiQ@mail.gmail.com
-- 
Peter Geoghegan



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Nov 8, 2022 at 11:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > For parallel heap pruning, multiple workers will insert key-value
> > > pairs to the radix tree concurrently. The simplest solution would be a
> > > single lock to protect writes but the performance will not be good.
> > > Another solution would be that we can divide the tables into multiple
> > > ranges so that keys derived from TIDs are not conflicted with each
> > > other and have parallel workers process one or more ranges. That way,
> > > parallel vacuum workers can build *sub-trees* and the leader process
> > > can merge them. In use cases of lazy vacuum, since the write phase and
> > > read phase are separated the readers don't need to worry about
> > > concurrent updates.
> >
> > It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to
improvereadahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages,
keepingthe dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store,
unlock,free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I
suspectusing small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the
end.
>
> Seems a promising idea. I think it might work well even in the current
> parallel vacuum (ie., single writer). I mean, I think we can have a
> single lwlock for shared cases in the first version. If the overhead
> of acquiring the lwlock per insertion of key-value is not negligible,
> we might want to try this idea.
>
> Apart from that, I'm going to incorporate the comments on 0004 patch
> and try a pointer tagging.

I'd like to share some progress on this work.

0004 patch is a new patch supporting a pointer tagging of the node
kind. Also, it introduces rt_node_ptr we discussed so that internal
functions use it rather than having two arguments for encoded and
decoded pointers. With this intermediate patch, the DSA support patch
became more readable and understandable. Probably we can make it
smaller further if we move the change of separating the control object
from radix_tree to the main patch (0002). The patch still needs to be
polished but I'd like to check if this idea is worthwhile. If we agree
on this direction, this patch will be merged into the main radix tree
implementation patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> 0004 patch is a new patch supporting a pointer tagging of the node
> kind. Also, it introduces rt_node_ptr we discussed so that internal
> functions use it rather than having two arguments for encoded and
> decoded pointers. With this intermediate patch, the DSA support patch
> became more readable and understandable. Probably we can make it
> smaller further if we move the change of separating the control object
> from radix_tree to the main patch (0002). The patch still needs to be
> polished but I'd like to check if this idea is worthwhile. If we agree
> on this direction, this patch will be merged into the main radix tree
> implementation patch.

Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you say it needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier:

- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert. 
- Try templating out the differences between local and shared memory. With local memory, the node-pointer struct would be a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely that would also make the TID store and/or vacuum more complex, because at least some external functions would be duplicated.

I'll set the patch to "waiting on author", but in this case the author is me.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Nov 14, 2022 at 10:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > 0004 patch is a new patch supporting a pointer tagging of the node
> > kind. Also, it introduces rt_node_ptr we discussed so that internal
> > functions use it rather than having two arguments for encoded and
> > decoded pointers. With this intermediate patch, the DSA support patch
> > became more readable and understandable. Probably we can make it
> > smaller further if we move the change of separating the control object
> > from radix_tree to the main patch (0002). The patch still needs to be
> > polished but I'd like to check if this idea is worthwhile. If we agree
> > on this direction, this patch will be merged into the main radix tree
> > implementation patch.
>
> Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you
sayit needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier: 
>
> - See how much performance we actually gain from tagging the node kind.
> - Try additional size classes while keeping the node kinds to only four.
> - Optimize node128 insert.
> - Try templating out the differences between local and shared memory. With local memory, the node-pointer struct
wouldbe a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely
thatwould also make the TID store and/or vacuum more complex, because at least some external functions would be
duplicated.

Thanks! Please let me know if there is something I can help with.

In the meanwhile, I'd like to make some progress on the vacuum
integration and improving the test coverages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:


On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the node-pointer struct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in the same commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part caused the bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer tagging. Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a performance optimization. Let's keep them separate.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID:
18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Nov 16, 2022 at 2:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
>
> On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID:
18242
>
> Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the
node-pointerstruct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in
thesame commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part
causedthe bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer
tagging.Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a
performanceoptimization. Let's keep them separate. 

Totally agreed. I'll separate them in the next version patch. Thank
you for your advice.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Which tests do you use to get this assertion failure? I've confirmed
> there is a bug in 0005 patch but without it, "make check-world"
> passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.

I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues.

It's based on the random int load test, but tests search speed. Run like this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo:

filter = ((uint64)1<<40)-1;
LOG:  num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG:  num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't expect that.

2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >
> > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > Thanks! Please let me know if there is something I can help with.
> > >
> > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > >
> > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186,
PID:18242 
> >
> > Which tests do you use to get this assertion failure? I've confirmed
> > there is a bug in 0005 patch but without it, "make check-world"
> > passed.
>
> Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.

Good to know. No problem.

> I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible
issues.

Thank you for testing!

>
> It's based on the random int load test, but tests search speed. Run like this:
>
> select * from bench_search_random_nodes(10 * 1000 * 1000)
>
> It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a
filter.Here's a simple demo: 
>
> filter = ((uint64)1<<40)-1;
> LOG:  num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130
>
> Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of
each,but that's difficult while still using random inputs. I ended up using 
>
> filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
>
> which gives
>
> LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024
>
> Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported
treeheight would change. For example: 
>
> filter = (((uint64) 1<<32) | (0xFF<<24));
> LOG:  num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161
>
> 1) Any idea why the tree height would be reported as 7 here? I didn't expect that.

In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

>
> 2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second
filterabove and with turboboost disabled): 
>
> v9 0003: 2062 2051 2050
> v9 0004: 2346 2316 2321
>
> That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the
courseof separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to
templatelocal and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. 

I'll also run the test on my environment and do the investigation tomorrow.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Sep 28, 2022 at 1:18 PM I wrote:

> Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size.

While the most important challenge right now is how to best represent and organize the shared memory case, I wanted to get the above idea working and out of the way, to be saved for a future time. I've attached a rough implementation (applies on top of v9 0003) that splits node32 into 2 size classes. They both share the exact same base data type and hence the same search/set code, so the number of "kind"s is still four, but here there are five "size classes", so a new case in the "unlikely" node-growing path. The smaller instance of node32 is a "node15", because that's currently 160 bytes, corresponding to one of the DSA size classes. This idea can be applied to any other node except the max size, as we see fit. (Adding a singleton size class would bring it back in line with the prototype, at least as far as memory consumption.)

One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.

In the course of working on this, I encountered a pain point. Since it's impossible to repalloc in slab, we have to do alloc/copy/free ourselves. That's fine, but the current coding makes too many assumptions about the use cases: rt_alloc_node and rt_copy_node are too entangled with each other and do too much work unrelated to what the names imply. I seem to remember an earlier version had something like rt_node_copy_common that did only...copying. That was much easier to reason about. In 0002 I resorted to doing my own allocation to show what I really want to do, because the new use case doesn't need zeroing and setting values. It only needs to...allocate (and increase the stats counter if built that way).

Future optimization work while I'm thinking of it: rt_alloc_node should be always-inlined and the memset done separately (i.e. not *AllocZero). That way the compiler should be able generate more efficient zeroing code for smaller nodes. I'll test the numbers on this sometime in the future.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > >
> > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > Thanks! Please let me know if there is something I can help with.
> > > >
> > > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > > >
> > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186,
PID:18242 
> > >
> > > Which tests do you use to get this assertion failure? I've confirmed
> > > there is a bug in 0005 patch but without it, "make check-world"
> > > passed.
> >
> > Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.
>
> Good to know. No problem.
>
> > I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two
possibleissues. 
>
> Thank you for testing!
>
> >
> > It's based on the random int load test, but tests search speed. Run like this:
> >
> > select * from bench_search_random_nodes(10 * 1000 * 1000)
> >
> > It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a
filter.Here's a simple demo: 
> >
> > filter = ((uint64)1<<40)-1;
> > LOG:  num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130
> >
> > Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number
ofeach, but that's difficult while still using random inputs. I ended up using 
> >
> > filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
> >
> > which gives
> >
> > LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024
> >
> > Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the
reportedtree height would change. For example: 
> >
> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that.
>
> In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).
>
> >
> > 2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second
filterabove and with turboboost disabled): 
> >
> > v9 0003: 2062 2051 2050
> > v9 0004: 2346 2316 2321
> >
> > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the
courseof separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to
templatelocal and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. 
>
> I'll also run the test on my environment and do the investigation tomorrow.
>

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

v9 0003 patch            : 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging          : 1085 1087 1086 (equivalent to 0004 patch)

In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.

=# select * from bench_search_random_nodes(10 * 1000 * 1000);
2022-11-18 22:18:21.608 JST [3913544] LOG:  num_keys = 9291812, height
= 4, n4 = 262144, n32 =79603, n128 = 182670, n256 = 1024

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
>
> v9 0003 patch            : 1113 1114 1114
> introduce rt_node_ptr: 1127 1128 1128
> pointer tagging          : 1085 1087 1086 (equivalent to 0004 patch)
>
> In my environment, rt_node_ptr seemed to lead some overhead but
> pointer tagging had performance benefits. I'm not sure the reason why
> the results are different from yours. The radix tree stats shows the
> same as your tests.

There is less than 2% difference from the medial set of results, so it's hard to distinguish from noise. I did a fresh rebuild and retested with the same results: about 15% slowdown in v9 0004. That's strange.

On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that.
>
> In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).

Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a straightforward explanation.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:

> > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.

Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random:

         3,343,352      branch-misses:u                  #    0.85% of all branches        
       393,204,959      branches:u

Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached.

> > I'll also run the test on my environment and do the investigation tomorrow.
> >
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.

Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon.

[1] https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.

Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot.

Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier).

node4:

NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      2 |    16 |            16520 |          0 |            3

NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      3 |    81 |            16456 |          0 |           17

NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      4 |   256 |            16456 |          0 |           89

NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      5 |   625 |            16488 |          0 |          327


node32:

NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      2 |    16 |            16488 |          0 |            5
(1 row)

NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      3 |    81 |            16520 |          0 |           28

NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      4 |   256 |            16408 |          0 |           79

NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
      5 |   625 |            24616 |          0 |          199

In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory. 

Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3     + 3*8 = 32

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Nov 21, 2022 at 3:43 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
>
> > > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in
thecourse of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to
templatelocal and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. 
>
> Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently
even,the pattern that the benchmark sees is not terribly random: 
>
>          3,343,352      branch-misses:u                  #    0.85% of all branches
>        393,204,959      branches:u
>
> Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node
betweenthe two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark,
eachlevel has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a
wayto efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a
searcher.Especially in the leaves (or just above the leaves), since those are less likely to be cached. 
>
> > > I'll also run the test on my environment and do the investigation tomorrow.
> > >
> >
> > FYI I've not tested the patch you shared today but here are the
> > benchmark results I did with the v9 patch in my environment (I used
> > the second filter). I splitted 0004 patch into two patches: a patch
> > for pure refactoring patch to introduce rt_node_ptr and a patch to do
> > pointer tagging.
>
> Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the
templatingapproach fairly soon. 
>

Sure. I've attached the v10 patches. 0004 is the pure refactoring
patch and 0005 patch introduces the pointer tagging.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Nov 21, 2022 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> > One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's
notan issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(),
soI just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct
inall cases, but that limits what we can do with the smallest node kind. 
>
> Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node
kindsto determine if there is a free slot. 
>
> Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish
attachedpatch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of
times.For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1,
whichcauses 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier). 
>
> node4:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       2 |    16 |            16520 |          0 |            3
>
> NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       3 |    81 |            16456 |          0 |           17
>
> NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       4 |   256 |            16456 |          0 |           89
>
> NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       5 |   625 |            16488 |          0 |          327
>
>
> node32:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       2 |    16 |            16488 |          0 |            5
> (1 row)
>
> NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       3 |    81 |            16520 |          0 |           28
>
> NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       4 |   256 |            16408 |          0 |           79
>
> NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> --------+-------+------------------+------------+--------------
>       5 |   625 |            24616 |          0 |          199
>
> In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory.
>
> Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of
variable-sizednodes), 3 has a nice property: no wasted padding space: 
>
> node4: 5 + 4+(7) + 4*8 = 48 bytes
> node3: 5 + 3     + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3         :  5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Nov 21, 2022 at 4:20 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
> >
> > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > node3: 5 + 3     + 3*8 = 32
>
> IIUC if we store the fanout member only in variable-sized nodes,
> rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> size doesn't change but there is 1 byte padding space.

I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations.

> Also, even if we have the node3 a variable-sized node, size class 1
> for node3 could be a good choice since it also doesn't need padding
> space and could be a good alternative to path compression.
>
> node3         :  5 + 3 + 3*8 = 32 bytes
> size class 1 : 5 + 3 + 1*8 = 16 bytes

Precisely! I have that scenario in my notes as well -- it's quite compelling.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:
> Sure. I've attached the v10 patches. 0004 is the pure refactoring
> patch and 0005 patch introduces the pointer tagging.

This failed on cfbot, with som many crashes that the VM ran out of disk for
core dumps. During testing with 32bit, so there's probably something broken
around that.

https://cirrus-ci.com/task/4635135954386944

A failure is e.g. at:
https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log

performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access
withinmisaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment
 
0x590faf74: note: pointer points here
  90 11 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
              ^
==55813==Using libbacktrace symbolizer.
    #0 0x56dcc274 in rt_create ../src/backend/lib/radixtree.c:1696
    #1 0x56953d1b in tidstore_create ../src/backend/access/common/tidstore.c:57
    #2 0x56a1ca4f in dead_items_alloc ../src/backend/access/heap/vacuumlazy.c:3109
    #3 0x56a2219f in heap_vacuum_rel ../src/backend/access/heap/vacuumlazy.c:539
    #4 0x56cb77ed in table_relation_vacuum ../src/include/access/tableam.h:1681
    #5 0x56cb77ed in vacuum_rel ../src/backend/commands/vacuum.c:2062
    #6 0x56cb9a16 in vacuum ../src/backend/commands/vacuum.c:472
    #7 0x56cba904 in ExecVacuum ../src/backend/commands/vacuum.c:272
    #8 0x5711b6d0 in standard_ProcessUtility ../src/backend/tcop/utility.c:866
    #9 0x5711bdeb in ProcessUtility ../src/backend/tcop/utility.c:530
    #10 0x5711759f in PortalRunUtility ../src/backend/tcop/pquery.c:1158
    #11 0x57117cb8 in PortalRunMulti ../src/backend/tcop/pquery.c:1315
    #12 0x571183d2 in PortalRun ../src/backend/tcop/pquery.c:791
    #13 0x57111049 in exec_simple_query ../src/backend/tcop/postgres.c:1238
    #14 0x57113f9c in PostgresMain ../src/backend/tcop/postgres.c:4551
    #15 0x5711463d in PostgresSingleUserMain ../src/backend/tcop/postgres.c:4028
    #16 0x56df4672 in main ../src/backend/main/main.c:197
    #17 0xf6ad8e45 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x1ae45)
    #18 0x5691d0f0 in _start (/tmp/cirrus-ci-build/build-32/tmp_install/usr/local/pgsql/bin/postgres+0x3040f0)

Aborted (core dumped)
child process exited with exit code 134
initdb: data directory "/tmp/cirrus-ci-build/build-32/testrun/adminpack/regress/tmp_check/data" not removed at user's
request



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Nov 21, 2022 at 6:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Nov 21, 2022 at 4:20 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of
variable-sizednodes), 3 has a nice property: no wasted padding space: 
> > >
> > > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > > node3: 5 + 3     + 3*8 = 32
> >
> > IIUC if we store the fanout member only in variable-sized nodes,
> > rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> > the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> > size doesn't change but there is 1 byte padding space.
>
> I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small
amountof code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability,
valgrind,or some other obstacle, I'm being pessimistic in my calculations. 
>
> > Also, even if we have the node3 a variable-sized node, size class 1
> > for node3 could be a good choice since it also doesn't need padding
> > space and could be a good alternative to path compression.
> >
> > node3         :  5 + 3 + 3*8 = 32 bytes
> > size class 1 : 5 + 3 + 1*8 = 16 bytes
>
> Precisely! I have that scenario in my notes as well -- it's quite compelling.

So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.

In this verification, I have all nodes except for node256
variable-sized nodes, and the sizes are:

radix tree node 1     :  6 + 4 + (6) + 1*8 = 24 bytes
radix tree node 4     :  6 + 4 + (6) + 4*8 = 48
radix tree node 15   :  6 + 32 + (2) + 15*8 = 160
radix tree node 32   :  6 + 32 + (2) + 32*8 = 296
radix tree node 61   :  inner 6 + 256 + (2) + 61*8 = 752,  leaf 6 +
256 + (2) + 16 + 61*8 = 768
radix tree node 128 :  inner 6 + 256 + (2) + 128*8 = 1288, leaf 6 +
256 + (2) + 16 + 128*8 = 1304
radix tree node 256 :  inner 6 + (2) + 256*8 = 2056, leaf 6 + (2) + 32
+ 256*8 = 2088

I did some performance tests against two radix trees: a radix tree
supporting only fixed-size nodes (i.e. applying up to 0003 patch), and
a radix tree supporting variable-size nodes (i.e. applying all
attached patches). Also, I changed bench_search_random_nodes()
function so that we can specify the filter via a function argument.
Here are results:

Here are results:

* Query
select * from bench_seq_search(0, 1*1000*1000, false)

* Fixed-size
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 =
1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871216 |                     |         67 |
        |          212 |
(1 row)

* Variable-size
NOTICE:  num_keys = 1000000, height = 2, n1 = 0, n4 = 0, n15 = 0, n32
= 31251, n61 = 0, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871280 |                     |         74 |
        |          212 |
(1 row)

---
* Query
select * from bench_seq_search(0, 2*1000*1000, true)
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1,
n256 = 245
* Fixed-size
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680848 |                     |         74 |
       |          201 |
(1 row)

* Variable-size
NOTICE:  num_keys = 999654, height = 2, n1 = 0, n4 = 1, n15 = 26951,
n32 = 35548, n61 = 1, n128 = 0, n256 = 245
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         16009040 |                     |         85 |
       |          201 |
(1 row)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0x7F07FF00FF')

* Fixed-size
NOTICE:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603,
n128 = 182670, n256 = 1024
 mem_allocated | search_ms
---------------+-----------
     343001456 |      1151
(1 row)

* Variable-size
NOTICE:  num_keys = 9291812, height = 4, n1 = 262144, n4 = 0, n15 =
138, n32 = 79465, n61 = 182665, n128 = 5, n256 = 1024
 mem_allocated | search_ms
---------------+-----------
     230504328 |      1077
(1 row)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0xFFFF0000003F')
* Fixed-size
NOTICE:  num_keys = 3807650, height = 5, n4 = 196608, n32 = 0, n128 =
65536, n256 = 257
 mem_allocated | search_ms
---------------+-----------
      99911920 |       632
(1 row)
* Variable-size
NOTICE:  num_keys = 3807650, height = 5, n1 = 196608, n4 = 0, n15 = 0,
n32 = 0, n61 = 61747, n128 = 3789, n256 = 257
 mem_allocated | search_ms
---------------+-----------
      64045688 |       554
(1 row)

Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance. I'm going to check the load
performance as well.

I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> So it seems that there are two candidates of rt_node structure: (1)
> all nodes except for node256 are variable-size nodes and use pointer
> tagging, and (2) node32 and node128 are variable-sized nodes and do
> not use pointer tagging (fanout is in part of only these two nodes).
> rt_node can be 5 bytes in both cases. But before going to this step, I
> started to verify the idea of variable-size nodes by using 6-bytes
> rt_node. We can adjust the node kinds and node classes later.

First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs below.)

Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in place, except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared memory coding looked, and that is the only sticky problem we still have, IMO. The rest is "just work".

That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as needed.

> Overall, the idea of variable-sized nodes is good, smaller size
> without losing search performance.

Good.

> I'm going to check the load
> performance as well.

Part of that is this, which gets called a lot more now, when node1 expands:

+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);

Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When compiling MemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes for large sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for small sizes. The "init" function should be always inline with const parameters where possible. That way, memset can compile to a single instruction for the smallest node kind. (More on alloc/init below)

Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so when expanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought of this while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than special-casing this, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array, but accessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves), but I'll leave the details for later. Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit:

node61:  6 + 256+(2) +16 +  61*8 =  768
node125: 6 + 256+(2) +16 + 125*8 = 1280

> I've attached the patches I used for the verification. I don't include
> patches for pointer tagging, DSA support, and vacuum integration since
> I'm investigating the issue on cfbot that Andres reported. Also, I've
> modified tests to improve the test coverage.

Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some additional comments:

+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}

I don't see the point of a function that just calls two functions.

+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node    *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}

This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This function should really be alloc-node + init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;

And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and "chunk". The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set those, too -- it might even improve readability.

-       if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+       if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branch means "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I think it'd be more effective to improve the lines that follow:

+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;

Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, these arrays should be const so the compiler can avoid runtime lookups. Speaking of...

+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place? This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpy should evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant should be correct, IIUC?

- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+  and this is the max size class so it will never grow */
+ .fanout = 0,

- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);

These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:


On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> [v11]

There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and reworking the shared memory piece more work than it should be. That's important because there are still some open questions about the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size class expansion to just one (or 5 total size classes) for the near future?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the slot index is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is valid, even if no slot index points to it. When the next inserter comes along, something surprising happens. This function:

/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
  int slotpos = 0;

  Assert(!NODE_IS_LEAF(node));
  while (node_inner_128_is_slot_used(node, slotpos))
  slotpos++;

  return slotpos;
}

...passes an integer to this function, whose parameter is a uint8:

/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
  Assert(!NODE_IS_LEAF(node));
  return (node->children[slot] != NULL);
}

...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this:

add     eax, 1
movzx   ecx, al
cmp     QWORD PTR [rbx+264+rcx*8], 0
jne     .L147

The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
> The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM

Oops. I meant to finish with "Since VACUUM doesn't perform deletion we didn't have an opportunity to detect this during that operation."

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update:

On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> - See how much performance we actually gain from tagging the node kind.

Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worse in its current form, depending on compiler(?). Put off for later.

> - Try additional size classes while keeping the node kinds to only four.

This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagine it will be easier to rebase shared memory logic than using this technique everywhere possible.

> - Optimize node128 insert.

I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes. To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later. This is not meant to be included in the next patchset.  For demonstration purposes, I get these results with a function that repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it:

select * from bench_node128_load(120);

v11

NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
    120 | 14400 |           208304 |               56

v11 + 0006 addendum

NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
    120 | 14400 |           208816 |               34

I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leaf isset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won't make a difference for performance because there is no iteration there.

> - Try templating out the differences between local and shared memory.

I hope to start this sometime after the crashes on 32-bit are resolved.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Nov 25, 2022 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > So it seems that there are two candidates of rt_node structure: (1)
> > all nodes except for node256 are variable-size nodes and use pointer
> > tagging, and (2) node32 and node128 are variable-sized nodes and do
> > not use pointer tagging (fanout is in part of only these two nodes).
> > rt_node can be 5 bytes in both cases. But before going to this step, I
> > started to verify the idea of variable-size nodes by using 6-bytes
> > rt_node. We can adjust the node kinds and node classes later.
>
> First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs
below.)
>
> Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in
place,except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared
memorycoding looked, and that is the only sticky problem we still have, IMO. The rest is "just work". 
>
> That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as
needed.
>
> > Overall, the idea of variable-sized nodes is good, smaller size
> > without losing search performance.
>
> Good.
>
> > I'm going to check the load
> > performance as well.
>
> Part of that is this, which gets called a lot more now, when node1 expands:
>
> + if (inner)
> + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
> + rt_node_kind_info[kind].inner_size);
> + else
> + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
> + rt_node_kind_info[kind].leaf_size);
>
> Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When
compilingMemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes
forlarge sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for
smallsizes. The "init" function should be always inline with const parameters where possible. That way, memset can
compileto a single instruction for the smallest node kind. (More on alloc/init below) 

Right. I forgot to update it.

>
> Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so
whenexpanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought
ofthis while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than
special-casingthis, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array,
butaccessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves),
butI'll leave the details for later. 

Agreed, I start with zeroing out the node when expanding from partial
to full size.

> Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit:
>
> node61:  6 + 256+(2) +16 +  61*8 =  768
> node125: 6 + 256+(2) +16 + 125*8 = 1280

Agreed, changed.

>
> > I've attached the patches I used for the verification. I don't include
> > patches for pointer tagging, DSA support, and vacuum integration since
> > I'm investigating the issue on cfbot that Andres reported. Also, I've
> > modified tests to improve the test coverage.
>
> Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some
additionalcomments: 
>
> +/* Return a new and initialized node */
> +static rt_node *
> +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
> +{
> + rt_node *newnode;
> +
> + newnode = rt_alloc_node(tree, kind, inner);
> + rt_init_node(newnode, kind, shift, chunk, inner);
> +
> + return newnode;
> +}
>
> I don't see the point of a function that just calls two functions.

Removed.

>
> +/*
> + * Create a new node with 'new_kind' and the same shift, chunk, and
> + * count of 'node'.
> + */
> +static rt_node *
> +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
> +{
> + rt_node    *newnode;
> +
> + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
> + node->shift > 0);
> + newnode->count = node->count;
> +
> + return newnode;
> +}
>
> This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This
functionshould really be alloc-node + init-node + copy-common, where copy-common is like in the prototype: 
> + newnode->node_shift = oldnode->node_shift;
> + newnode->node_chunk = oldnode->node_chunk;
> + newnode->count = oldnode->count;
>
> And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and
"chunk".The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set
those,too -- it might even improve readability. 
>
> -       if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
> +       if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

Agreed.

>
> This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else"
branchmeans "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I
thinkit'd be more effective to improve the lines that follow: 
>
> + memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
> + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
>
> Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it,
thesearrays should be const so the compiler can avoid runtime lookups. Speaking of... 
>
> +/* Copy both chunks and children/values arrays */
> +static inline void
> +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> +  uint8 *dst_chunks, rt_node **dst_children, int count)
> +{
> + /* For better code generation */
> + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
> + pg_unreachable();
> +
> + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> +}
>
> When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first
place?This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to
memcpyshould evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant
shouldbe correct, IIUC? 

Right. We don't need to pass count to these functions.

>
> - .fanout = 256,
> + /* technically it's 256, but we can't store that in a uint8,
> +  and this is the max size class so it will never grow */
> + .fanout = 0,
>
> - Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
> + Assert(((rt_node *) n256)->fanout == 0);
> + Assert(chunk_exists || ((rt_node *) n256)->count < 256);
>
> These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for
fixed-and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for
node256.That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF. 

Since the node has fanout regardless of fixed-sized and
variable-sized, only node256 is the special case where the fanout in
the node doesn't match the actual fanout of the node. I think if we
want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
node256 and one for other classes. Thoughts? In your idea, for
NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
following?

#define FIXED_NODDE_HAS_FREE_SLOT(node, class)
  (node->base.n.count < rt_size_class_info[class].fanout)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Nov 25, 2022 at 6:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > [v11]
>
> There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and
reworkingthe shared memory piece more work than it should be. That's important because there are still some open
questionsabout the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size
classexpansion to just one (or 5 total size classes) for the near future? 

Make sense. We can add size classes once we have a good design and
implementation around shared memory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Nov 29, 2022 at 1:36 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the
slotindex is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is
valid,even if no slot index points to it. When the next inserter comes along, something surprising happens. This
function:
>
> /* Return an unused slot in node-128 */
> static int
> node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
> {
>   int slotpos = 0;
>
>   Assert(!NODE_IS_LEAF(node));
>   while (node_inner_128_is_slot_used(node, slotpos))
>   slotpos++;
>
>   return slotpos;
> }
>
> ...passes an integer to this function, whose parameter is a uint8:
>
> /* Is the slot in the node used? */
> static inline bool
> node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
> {
>   Assert(!NODE_IS_LEAF(node));
>   return (node->children[slot] != NULL);
> }
>
> ...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this:
>
> add     eax, 1
> movzx   ecx, al
> cmp     QWORD PTR [rbx+264+rcx*8], 0
> jne     .L147
>
> The fix is easy enough -- set the child pointer to null upon deletion,

Good catch!

> but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with
somethingfaster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM 

Indeed, there are some tests for deletion but all of them delete all
keys in the node so we end up deleting the node. I've added tests of
repeating deletion and insertion as well as additional assertions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Nov 23, 2022 at 2:10 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:
> > Sure. I've attached the v10 patches. 0004 is the pure refactoring
> > patch and 0005 patch introduces the pointer tagging.
>
> This failed on cfbot, with som many crashes that the VM ran out of disk for
> core dumps. During testing with 32bit, so there's probably something broken
> around that.
>
> https://cirrus-ci.com/task/4635135954386944
>
> A failure is e.g. at:
https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log
>
> performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access
withinmisaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment
 
> 0x590faf74: note: pointer points here
>   90 11 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
>               ^

radix_tree_control struct has two pg_atomic_uint64 variables, and the
assertion check in pg_atomic_init_u64() failed:

static inline void
pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
{
    /*
     * Can't necessarily enforce alignment - and don't need it - when using
     * the spinlock based fallback implementation. Therefore only assert when
     * not using it.
     */
#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
    AssertPointerAlignment(ptr, 8);
#endif
    pg_atomic_init_u64_impl(ptr, val);
}

I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've investigated this issue and have a question about using atomic
> variables on palloc'ed memory. In non-parallel vacuum cases,
> radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
> the memory allocated by aset.c is 4-bytes aligned so these atomic
> variables are not always 8-bytes aligned. Is there any way to enforce
> 8-bytes aligned memory allocations in 32-bit machines?

The bigger question in my mind is: Why is there an atomic variable in backend-local memory?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Nov 25, 2022 at 5:00 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF.
>
> Since the node has fanout regardless of fixed-sized and
> variable-sized

As currently coded, yes. But that's not strictly necessary, I think.

>, only node256 is the special case where the fanout in
> the node doesn't match the actual fanout of the node. I think if we
> want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
> node256 and one for other classes. Thoughts? In your idea, for
> NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
> following?
>
> #define FIXED_NODDE_HAS_FREE_SLOT(node, class)
>   (node->base.n.count < rt_size_class_info[class].fanout)

Right, and the other one could be VAR_NODE_...

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've investigated this issue and have a question about using atomic
> > variables on palloc'ed memory. In non-parallel vacuum cases,
> > radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
> > the memory allocated by aset.c is 4-bytes aligned so these atomic
> > variables are not always 8-bytes aligned. Is there any way to enforce
> > 8-bytes aligned memory allocations in 32-bit machines?
>
> The bigger question in my mind is: Why is there an atomic variable in backend-local memory?

Because I use the same radix_tree and radix_tree_control structs for
non-parallel and parallel vacuum. Therefore, radix_tree_control is
allocated in DSM for parallel-vacuum cases or in backend-local memory
for non-parallel vacuum cases.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:


On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > The bigger question in my mind is: Why is there an atomic variable in backend-local memory?
>
> Because I use the same radix_tree and radix_tree_control structs for
> non-parallel and parallel vacuum. Therefore, radix_tree_control is
> allocated in DSM for parallel-vacuum cases or in backend-local memory
> for non-parallel vacuum cases.

Ok, that could be yet another reason to compile local- and shared-memory functionality separately, but now I'm wondering why there are atomic variables at all, since there isn't yet any locking support.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Nov 30, 2022 at 2:51 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update:
>
> On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > - See how much performance we actually gain from tagging the node kind.
>
> Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or
worsein its current form, depending on compiler(?). Put off for later. 
>
> > - Try additional size classes while keeping the node kinds to only four.
>
> This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I
imagineit will be easier to rebase shared memory logic than using this technique everywhere possible. 
>
> > - Optimize node128 insert.
>
> I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and
operateon word-sized (32- or 64-bit) types at a time, rather than bytes. 

Thanks! I think this is a good idea.

> To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's
probablygoing to need a separate email thread to resolve the coding style clash this causes, so that can be put off for
later.

Agreed. Since tidbitmap.c also has WORDNUM(x) and BITNUM(x), we can
use it if we move from bitmapset.h.

> This is not meant to be included in the next patchset.  For demonstration purposes, I get these results with a
functionthat repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it: 
>
> select * from bench_node128_load(120);
>
> v11
>
> NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
> --------+-------+------------------+------------------
>     120 | 14400 |           208304 |               56
>
> v11 + 0006 addendum
>
> NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
> --------+-------+------------------+------------------
>     120 | 14400 |           208816 |               34
>
> I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the
node256-leafisset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It
won'tmake a difference for performance because there is no iteration there. 


After updating the patch set according to recent comments, I've also
done the same test in my environment and got similar good results.

w/o 0006 addendum patch

NOTICE:  num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125
= 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
    120 | 14400 |           204424 |               29
(1 row)

w/ 0006 addendum patch

NOTICE:  num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125
= 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
    120 | 14400 |           204936 |               18
(1 row)

> > - Try templating out the differences between local and shared memory.
>
> I hope to start this sometime after the crashes on 32-bit are resolved.

I've attached updated patches that incorporated all comments I got so
far as well as fixes for compiler warnings. I included your bitmapword
patch as 0004 for benchmarking. Also I reverted the change around
pg_atomic_u64 since we don't support any locking as you mentioned and
if we have a single lwlock to protect the radix tree, we don't need to
use pg_atomic_u64 only for max_val and num_keys.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > >
> > > - Optimize node128 insert.
> >
> > I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes.
>
> Thanks! I think this is a good idea.
>
> > To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later.

I started a separate thread [1], and 0002 comes from feedback on that. There is a FIXME about using WORDNUM and BITNUM, at least with that spelling. I'm putting that off to ease rebasing the rest as v13 -- getting some CI testing with 0002 seems like a good idea. There are no other changes yet. Next, I will take a look at templating local vs. shared memory. I might try basing that on the styles of both v12 and v8, and see which one works best with templating.

[1]  https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Dec 6, 2022 at 7:32 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > >
> > > > - Optimize node128 insert.
> > >
> > > I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over
andoperate on word-sized (32- or 64-bit) types at a time, rather than bytes. 
> >
> > Thanks! I think this is a good idea.
> >
> > > To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h.
That'sprobably going to need a separate email thread to resolve the coding style clash this causes, so that can be put
offfor later. 
>
> I started a separate thread [1], and 0002 comes from feedback on that. There is a FIXME about using WORDNUM and
BITNUM,at least with that spelling. I'm putting that off to ease rebasing the rest as v13 -- getting some CI testing
with0002 seems like a good idea. There are no other changes yet. Next, I will take a look at templating local vs.
sharedmemory. I might try basing that on the styles of both v12 and v8, and see which one works best with templating. 

Thank you so much!

In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:

The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.

The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class. Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> In the meanwhile, I've been working on vacuum integration. There are
> two things I'd like to discuss some time:
>
> The first is the minimum of maintenance_work_mem, 1 MB. Since the
> initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
> vacuum with radix tree cannot work with the minimum
> maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
> we can start a new thread for that.

I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).

> The second is how to limit the size of the radix tree to
> maintenance_work_mem. I think that it's tricky to estimate the maximum
> number of keys in the radix tree that fit in maintenance_work_mem. The
> radix tree size varies depending on the key distribution. The next
> idea I considered was how to limit the size when inserting a key. In
> order to strictly limit the radix tree size, probably we have to
> change the rt_set so that it breaks off and returns false if the radix
> tree size is about to exceed the memory limit when we allocate a new
> node or grow a node kind/class.

That seems complex, fragile, and wrong scope.

> Ideally, I'd like to control the size
> outside of radix tree (e.g. TIDStore) since it could introduce
> overhead to rt_set() but probably we need to add such logic in radix
> tree.

Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is? If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local case with slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB) of overflow.

Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > In the meanwhile, I've been working on vacuum integration. There are
> > two things I'd like to discuss some time:
> >
> > The first is the minimum of maintenance_work_mem, 1 MB. Since the
> > initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
> > vacuum with radix tree cannot work with the minimum
> > maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
> > we can start a new thread for that.
>
> I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail
whatexactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some
usefulwork and not fail). 

The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.

>
> > The second is how to limit the size of the radix tree to
> > maintenance_work_mem. I think that it's tricky to estimate the maximum
> > number of keys in the radix tree that fit in maintenance_work_mem. The
> > radix tree size varies depending on the key distribution. The next
> > idea I considered was how to limit the size when inserting a key. In
> > order to strictly limit the radix tree size, probably we have to
> > change the rt_set so that it breaks off and returns false if the radix
> > tree size is about to exceed the memory limit when we allocate a new
> > node or grow a node kind/class.
>
> That seems complex, fragile, and wrong scope.
>
> > Ideally, I'd like to control the size
> > outside of radix tree (e.g. TIDStore) since it could introduce
> > overhead to rt_set() but probably we need to add such logic in radix
> > tree.
>
> Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is?

Yes, TIDStore can check it using dsa_get_total_size().

> If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local
casewith slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size
(~64kB)of overflow. 
>
> Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen
kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the
samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop,
insertinto the store, and check the store's memory usage before continuing. 

Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >

> > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
>
> The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> the radix tree is using dsa_get_total_size(). If the size returned by
> dsa_get_total_size() (+ some memory used by TIDStore meta information)
> exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> and heap vacuum. However, when allocating DSA memory for
> radix_tree_control at creation, we allocate 1MB
> (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> radix_tree_control from it. das_get_total_size() returns 1MB even if
> there is no TID collected.

2MB makes sense.

If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.

> > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
>
> Right, I think it's no problem in slab cases. In DSA cases, the new
> segment size follows a geometric series that approximately doubles the
> total storage each time we create a new segment. This behavior comes
> from the fact that the underlying DSM system isn't designed for large
> numbers of segments.

And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:

maintenance work mem = 256MB, so stop if we go over 128MB:

2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB        -> stop

That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).

And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.

After this feature is complete, I think we should consider a follow-on patch to get rid of vacuum_work_mem, since it would no longer be needed.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Dec 12, 2022 at 7:14 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > >
>
> > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more
detailwhat exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do
someuseful work and not fail). 
> >
> > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > the radix tree is using dsa_get_total_size(). If the size returned by
> > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > and heap vacuum. However, when allocating DSA memory for
> > radix_tree_control at creation, we allocate 1MB
> > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > there is no TID collected.
>
> 2MB makes sense.
>
> If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven
byblocks allocated. I have an idea on that below. 
>
> > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen
kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the
samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop,
insertinto the store, and check the store's memory usage before continuing. 
> >
> > Right, I think it's no problem in slab cases. In DSA cases, the new
> > segment size follows a geometric series that approximately doubles the
> > total storage each time we create a new segment. This behavior comes
> > from the fact that the underlying DSM system isn't designed for large
> > numbers of segments.
>
> And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area
allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the
areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about
3/4full, like this: 
>
> maintenance work mem = 256MB, so stop if we go over 128MB:
>
> 2*(1+2+4+8+16+32) = 126MB -> keep going
> 126MB + 64 = 190MB        -> stop
>
> That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment
wouldbe mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with
PeterG.'s VM snapshot informing the allocation size, I imagine). 

Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.

>
> And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with
technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would
nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and
smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. 

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree. Templating whether or not to count
the memory usage might help avoid the overheads.

> After this feature is complete, I think we should consider a follow-on patch to get rid of vacuum_work_mem, since it
wouldno longer be needed. 

I think you meant autovacuum_work_mem. Yes, I also think we can get rid of it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Dec 12, 2022 at 7:14 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > >
> >
> > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more
detailwhat exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do
someuseful work and not fail). 
> > >
> > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > > the radix tree is using dsa_get_total_size(). If the size returned by
> > > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > > and heap vacuum. However, when allocating DSA memory for
> > > radix_tree_control at creation, we allocate 1MB
> > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > > there is no TID collected.
> >
> > 2MB makes sense.
> >
> > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be
drivenby blocks allocated. I have an idea on that below. 
> >
> > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen
kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the
samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop,
insertinto the store, and check the store's memory usage before continuing. 
> > >
> > > Right, I think it's no problem in slab cases. In DSA cases, the new
> > > segment size follows a geometric series that approximately doubles the
> > > total storage each time we create a new segment. This behavior comes
> > > from the fact that the underlying DSM system isn't designed for large
> > > numbers of segments.
> >
> > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area
allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the
areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about
3/4full, like this: 
> >
> > maintenance work mem = 256MB, so stop if we go over 128MB:
> >
> > 2*(1+2+4+8+16+32) = 126MB -> keep going
> > 126MB + 64 = 190MB        -> stop
> >
> > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last
segmentwould be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even
withPeter G.'s VM snapshot informing the allocation size, I imagine). 
>
> Right. In this case, even if we allocate 64MB, we will use only 2088
> bytes at maximum. So I think the memory space used for vacuum is
> practically limited to half.
>
> >
> > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with
technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would
nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and
smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. 
>
> Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> seems that they look at only memory that are actually dsa_allocate'd.
> To be exact, we estimate the number of hash buckets based on work_mem
> (and hash_mem_multiplier) and use it as the upper limit. So I've
> confirmed that the result of dsa_get_total_size() could exceed the
> limit. I'm not sure it's a known and legitimate usage. If we can
> follow such usage, we can probably track how much dsa_allocate'd
> memory is used in the radix tree.

I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:

w/o 0008 patch

=# select * from bench_load_random_int(1000000)
NOTICE:  num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277,
n32 = 27205, n125 = 0, n256 = 257
 mem_allocated | load_ms
---------------+---------
     298453544 |     282
(1 row)

w/0 0008 patch

=# select * from bench_load_random_int(1000000)
NOTICE:  num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277,
n32 = 27205, n125 = 0, n256 = 257
 mem_allocated | load_ms
---------------+---------
     293603184 |     297
(1 row)

Although it adds some overhead, I think this idea is straightforward
and the most practical for users. And it seems to be consistent with
other components using DSA. We can improve this part in the future for
better memory control, for example, by introducing slab-like DSA
memory management.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Dec 19, 2022 at 4:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Dec 12, 2022 at 7:14 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >
> > > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > > >
> > >
> > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in
moredetail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just
dosome useful work and not fail). 
> > > >
> > > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > > > the radix tree is using dsa_get_total_size(). If the size returned by
> > > > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > > > and heap vacuum. However, when allocating DSA memory for
> > > > radix_tree_control at creation, we allocate 1MB
> > > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > > > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > > > there is no TID collected.
> > >
> > > 2MB makes sense.
> > >
> > > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be
drivenby blocks allocated. I have an idea on that below. 
> > >
> > > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen
kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the
samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop,
insertinto the store, and check the store's memory usage before continuing. 
> > > >
> > > > Right, I think it's no problem in slab cases. In DSA cases, the new
> > > > segment size follows a geometric series that approximately doubles the
> > > > total storage each time we create a new segment. This behavior comes
> > > > from the fact that the underlying DSM system isn't designed for large
> > > > numbers of segments.
> > >
> > > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area
allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the
areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about
3/4full, like this: 
> > >
> > > maintenance work mem = 256MB, so stop if we go over 128MB:
> > >
> > > 2*(1+2+4+8+16+32) = 126MB -> keep going
> > > 126MB + 64 = 190MB        -> stop
> > >
> > > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last
segmentwould be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even
withPeter G.'s VM snapshot informing the allocation size, I imagine). 
> >
> > Right. In this case, even if we allocate 64MB, we will use only 2088
> > bytes at maximum. So I think the memory space used for vacuum is
> > practically limited to half.
> >
> > >
> > > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with
technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would
nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and
smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. 
> >
> > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > seems that they look at only memory that are actually dsa_allocate'd.
> > To be exact, we estimate the number of hash buckets based on work_mem
> > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > confirmed that the result of dsa_get_total_size() could exceed the
> > limit. I'm not sure it's a known and legitimate usage. If we can
> > follow such usage, we can probably track how much dsa_allocate'd
> > memory is used in the radix tree.
>
> I've experimented with this idea. The newly added 0008 patch changes
> the radix tree so that it counts the memory usage for both local and
> shared cases.

I've attached updated version patches to make cfbot happy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > seems that they look at only memory that are actually dsa_allocate'd.
> > To be exact, we estimate the number of hash buckets based on work_mem
> > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > confirmed that the result of dsa_get_total_size() could exceed the
> > limit. I'm not sure it's a known and legitimate usage. If we can
> > follow such usage, we can probably track how much dsa_allocate'd
> > memory is used in the radix tree.
>
> I've experimented with this idea. The newly added 0008 patch changes
> the radix tree so that it counts the memory usage for both local and
> shared cases. As shown below, there is an overhead for that:
>
> w/o 0008 patch
>      298453544 |     282

> w/0 0008 patch
>      293603184 |     297

This adds about as much overhead as the improvement I measured in the v4 slab allocator patch. That's not acceptable, and is exactly what Andres warned about in


I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Dec 20, 2022 at 3:09 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > > seems that they look at only memory that are actually dsa_allocate'd.
> > > To be exact, we estimate the number of hash buckets based on work_mem
> > > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > > confirmed that the result of dsa_get_total_size() could exceed the
> > > limit. I'm not sure it's a known and legitimate usage. If we can
> > > follow such usage, we can probably track how much dsa_allocate'd
> > > memory is used in the radix tree.
> >
> > I've experimented with this idea. The newly added 0008 patch changes
> > the radix tree so that it counts the memory usage for both local and
> > shared cases. As shown below, there is an overhead for that:
> >
> > w/o 0008 patch
> >      298453544 |     282
>
> > w/0 0008 patch
> >      293603184 |     297
>
> This adds about as much overhead as the improvement I measured in the v4 slab allocator patch.

Oh, yes, that's bad.

> https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
>
> I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding
workmem.We don't have that design constraint.
 

You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.

The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.

A better solution would be to have slab-like DSA. We allocate the
dynamic shared memory by adding fixed-length large segments. However,
downside would be since the segment size gets large we need to
increase maintenance_work_mem as well. Also, this patch set is already
getting bigger and more complicated, I don't think it's a good idea to
add more.

If we limit the memory usage by checking the amount of memory actually
used, we can use SlabStats() for the local cases. Since DSA doesn't
have such functionality for now we would need to add it. Or we can
track it in the radix tree only in the shared cases.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Dec 20, 2022 at 3:09 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
> >
> > I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.
>
> You mean that the memory used by the radix tree should be limited not
> by the amount of memory actually used, but by the amount of memory
> allocated? In other words, it checks by MomoryContextMemAllocated() in
> the local cases and by dsa_get_total_size() in the shared case.

I mean, if this patch set uses 10x less memory than v15 (not always, but easy to find cases where it does), and if it's also expensive to track memory use precisely, then we don't have an incentive to track memory precisely. Even if we did, we don't want to assume that every future caller of radix tree is willing to incur that cost.

> The idea of using up to half of maintenance_work_mem might be a good
> idea compared to the current flat-array solution. But since it only
> uses half, I'm concerned that there will be users who double their
> maintenace_work_mem. When it is improved, the user needs to restore
> maintenance_work_mem again.

I find it useful to step back and look at the usage patterns:

Autovacuum: Limiting the memory allocated by vacuum is important, since there are multiple workers and they can run at any time (possibly most of the time). This case will not use parallel index vacuum, so will use slab, where the quick estimation of memory taken by the context is not terribly far off, so we can afford to be more optimistic here.

Manual vacuum: The default configuration assumes we want to finish as soon as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used. My experience leads me to believe users are willing to use a lot of memory to make manual vacuum finish as quickly as possible, and are disappointed to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.

So I don't believe anyone will have to double maintenance work mem after upgrading (even with pessimistic accounting) because we'll be both
- much more efficient with memory on average
- free from the 1GB cap

That said, it's possible 50% is too pessimistic -- a 75% threshold will bring us very close to powers of two for example:

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
766 + 256 = 1022MB -> stop

I'm not sure if that calculation could cause going over the limit, or how common that would be.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Dec 22, 2022 at 7:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Dec 20, 2022 at 3:09 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > > https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
> > >
> > > I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when
exceedingworkmem. We don't have that design constraint. 
> >
> > You mean that the memory used by the radix tree should be limited not
> > by the amount of memory actually used, but by the amount of memory
> > allocated? In other words, it checks by MomoryContextMemAllocated() in
> > the local cases and by dsa_get_total_size() in the shared case.
>
> I mean, if this patch set uses 10x less memory than v15 (not always, but easy to find cases where it does), and if
it'salso expensive to track memory use precisely, then we don't have an incentive to track memory precisely. Even if we
did,we don't want to assume that every future caller of radix tree is willing to incur that cost. 

Understood.

>
> > The idea of using up to half of maintenance_work_mem might be a good
> > idea compared to the current flat-array solution. But since it only
> > uses half, I'm concerned that there will be users who double their
> > maintenace_work_mem. When it is improved, the user needs to restore
> > maintenance_work_mem again.
>
> I find it useful to step back and look at the usage patterns:
>
> Autovacuum: Limiting the memory allocated by vacuum is important, since there are multiple workers and they can run
atany time (possibly most of the time). This case will not use parallel index vacuum, so will use slab, where the quick
estimationof memory taken by the context is not terribly far off, so we can afford to be more optimistic here. 
>
> Manual vacuum: The default configuration assumes we want to finish as soon as possible (vacuum_cost_delay is zero).
Parallelindex vacuum can be used. My experience leads me to believe users are willing to use a lot of memory to make
manualvacuum finish as quickly as possible, and are disappointed to learn that even if maintenance work mem is 10GB,
vacuumcan only use 1GB. 

Agreed.

> So I don't believe anyone will have to double maintenance work mem after upgrading (even with pessimistic accounting)
becausewe'll be both 
> - much more efficient with memory on average
> - free from the 1GB cap

Make sense.

>
> That said, it's possible 50% is too pessimistic -- a 75% threshold will bring us very close to powers of two for
example:
>
> 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
> 766 + 256 = 1022MB -> stop
>
> I'm not sure if that calculation could cause going over the limit, or how common that would be.
>

If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:

2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.

In a more bigger case, if it's 11000MB,

2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MB

That being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:


On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> If the value is a power of 2, it seems to work perfectly fine. But for
> example if it's 700MB, the total memory exceeds the limit:
>
> 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
> 510 + 256 = 766MB -> stop but it exceeds the limit.
>
> In a more bigger case, if it's 11000MB,
>
> 2*(1+2+...+2048) = 8190MB (74.4%)
> 8190 + 4096 = 12286MB
>
> That being said, I don't think they are not common cases. So the 75%
> threshold seems to work fine in most cases.

Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the community, being loose with memory limits by up to 10% is not a good precedent.

Perhaps we can be clever and use 75% when the limit is a power of two and 50% otherwise. I'm skeptical of trying to be clever, and I just thought of an additional concern: We're assuming behavior of the growth in size of new DSA segments, which could possibly change. Given how allocators are typically coded, though, it seems safe to assume that they'll at most double in size.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I wrote:

> - Try templating out the differences between local and shared memory.

Here is a brief progress report before Christmas vacation.

I thought the best way to approach this was to go "inside out", that is, start with the modest goal of reducing duplicated code for v16.

0001-0005 are copies from v13.

0006 whacks around the rt_node_insert_inner function to reduce the "surface area" as far as symbols and casts. This includes replacing the goto with an extra "unlikely" branch.

0007 removes the STRICT pragma for one of our benchmark functions that crept in somewhere -- it should use the default and not just return NULL instantly.

0008 further whacks around the node-growing code in rt_node_insert_inner to remove casts. When growing the size class within the same kind, we have no need for a "new32" (etc) variable. Also, to keep from getting confused about what an assert build verifies at the end, add a "newnode" variable and assign it to "node" as soon as possible.

0009 uses the bitmap logic from 0004 for node256 also. There is no performance reason for this, because there is no iteration needed, but it's good for simplicity and consistency.

0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.

0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.

There is more that could be done here, but I didn't want to get too ahead of myself. For example, it's possible that struct members "children" and "values" are names that don't need to be distinguished. Making them the same would reduce code like

+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif

...but there could be downsides and I don't want to distract from the goal of dealing with shared memory.

The tests pass, but it's not impossible that there is a new bug somewhere.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 23, 2022 at 8:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I wrote:
>
> > - Try templating out the differences between local and shared memory.
>
> Here is a brief progress report before Christmas vacation.

Thanks!

>
> I thought the best way to approach this was to go "inside out", that is, start with the modest goal of reducing
duplicatedcode for v16. 
>
> 0001-0005 are copies from v13.
>
> 0006 whacks around the rt_node_insert_inner function to reduce the "surface area" as far as symbols and casts. This
includesreplacing the goto with an extra "unlikely" branch. 
>
> 0007 removes the STRICT pragma for one of our benchmark functions that crept in somewhere -- it should use the
defaultand not just return NULL instantly. 
>
> 0008 further whacks around the node-growing code in rt_node_insert_inner to remove casts. When growing the size class
withinthe same kind, we have no need for a "new32" (etc) variable. Also, to keep from getting confused about what an
assertbuild verifies at the end, add a "newnode" variable and assign it to "node" as soon as possible. 
>
> 0009 uses the bitmap logic from 0004 for node256 also. There is no performance reason for this, because there is no
iterationneeded, but it's good for simplicity and consistency. 

These 4 patches make sense to me. We can merge them into 0002 patch
and I'll do similar changes for functions for leaf nodes as well.

> 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
>
> 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and
mightgive a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but
theyshould be straightforward. 

Cool!

>
> There is more that could be done here, but I didn't want to get too ahead of myself. For example, it's possible that
structmembers "children" and "values" are names that don't need to be distinguished. Making them the same would reduce
codelike 
>
> +#ifdef RT_NODE_LEVEL_LEAF
> + n32->values[insertpos] = value;
> +#else
> + n32->children[insertpos] = child;
> +#endif
>
> ...but there could be downsides and I don't want to distract from the goal of dealing with shared memory.

With these patches, some functions in radixtree.h load the header
files, radixtree_xxx_impl.h, that have the function body. What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 23, 2022 at 8:47 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> These 4 patches make sense to me.We can merge them into 0002 patch

Okay, then I'll squash them when I post my next patch.

> and I'll do similar changes for functions for leaf nodes as well.

I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away.

In any case, some things are still half-baked, so please wait until my next patch before doing work on these files.

Also, CI found a bug on 32-bit -- I know what I missed and will fix next week.

> > 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
> >
> > 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.

Two things came to mind since I posted this, which I'll make clear next patch:
- A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet shrink nodes.

> What do you
> think about how we can expand this template method to deal with DSA
> memory? I imagined that we load say radixtree_template.h with some
> macros to use the radix tree like we do for simplehash.h. And
> radixtree_template.h further loads xxx_impl.h files for some internal
> functions.

Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Dec 27, 2022 at 2:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Dec 23, 2022 at 8:47 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > These 4 patches make sense to me.We can merge them into 0002 patch
>
> Okay, then I'll squash them when I post my next patch.
>
> > and I'll do similar changes for functions for leaf nodes as well.
>
> I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away.

Right. If we template these routines I don't need that.

>
> In any case, some things are still half-baked, so please wait until my next patch before doing work on these files.
>
> Also, CI found a bug on 32-bit -- I know what I missed and will fix next week.

Thanks!

>
> > > 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
> > >
> > > 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense
andmight give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions,
butthey should be straightforward. 
>
> Two things came to mind since I posted this, which I'll make clear next patch:
> - A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still
looksnicer this way. 
> - Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet
shrinknodes. 

Okay.

>
> > What do you
> > think about how we can expand this template method to deal with DSA
> > memory? I imagined that we load say radixtree_template.h with some
> > macros to use the radix tree like we do for simplehash.h. And
> > radixtree_template.h further loads xxx_impl.h files for some internal
> > functions.
>
> Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint.

Thank you for your confirmation!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
> [working on templating]

In the end, I decided to base my effort on v8, and not v12 (based on one of my less-well-thought-out ideas). The latter was a good experiment, but it did not lead to an increase in readability as I had hoped. The attached v17 is still rough, but it's in good enough shape to evaluate a mostly-complete templating implementation.

Part of what I didn't like about v8 was distinctions like "node" vs "nodep", which hinder readability. I've used "allocnode" for some cases where it makes sense, which is translated to "newnode" for the local pointer. Some places I just gave up and used "nodep" for parameters like in v8, just to get it done. We can revisit naming later.

Not done yet:

- get_handle() is not implemented
- rt_attach is defined but unused
- grow_node_kind() was hackishly removed, but could be turned into a macro (or function that writes to 2 pointers)
- node_update_inner() is back, now that we can share a template with "search". Seems easier to read, and I suspect this is easier for the compiler.
- the value type should really be a template macro, but is still hard-coded to uint64
- I think it's okay if the key is hard coded for PG16: If some use case needs more than uint64, we could consider "single-value leaves" with varlen keys as a template option.
- benchmark tests not updated

v13-0007 had some changes to the regression tests, but I haven't included those. The tests from v13-0003 do pass, both locally and shared. I quickly hacked together changing shared/local tests by hand (need to recompile), but it would be good for maintainability if tests could run once each with local and shmem, but use the same "expected" test output.

Also, I didn't look to see if there were any changes in v14/15 that didn't have to do with precise memory accounting.

At this point, Masahiko, I'd appreciate your feedback on whether this is an improvement at all (or at least a good base for improvement), especially for integrating with the TID store. I think there are some advantages to the template approach. One possible disadvantage is needing separate functions for each local and shared memory.

If we go this route, I do think the TID store should invoke the template as static functions. I'm not quite comfortable with a global function that may not fit well with future use cases.

One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.

Also, v17-0005 has an optimization/simplification for growing into node125 (my version needs an assertion or fallback, but works well now), found by another reading of Andres' prototype There is a lot of good engineering there, we should try to preserve it.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 9, 2023 at 5:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> > [working on templating]
>
> In the end, I decided to base my effort on v8, and not v12 (based on one of my less-well-thought-out ideas). The
latterwas a good experiment, but it did not lead to an increase in readability as I had hoped. The attached v17 is
stillrough, but it's in good enough shape to evaluate a mostly-complete templating implementation. 

I really appreciate your work!

>
> v13-0007 had some changes to the regression tests, but I haven't included those. The tests from v13-0003 do pass,
bothlocally and shared. I quickly hacked together changing shared/local tests by hand (need to recompile), but it would
begood for maintainability if tests could run once each with local and shmem, but use the same "expected" test output. 

Agreed.

> Also, I didn't look to see if there were any changes in v14/15 that didn't have to do with precise memory accounting.
>
> At this point, Masahiko, I'd appreciate your feedback on whether this is an improvement at all (or at least a good
basefor improvement), especially for integrating with the TID store. I think there are some advantages to the template
approach.One possible disadvantage is needing separate functions for each local and shared memory. 
>
> If we go this route, I do think the TID store should invoke the template as static functions. I'm not quite
comfortablewith a global function that may not fit well with future use cases. 

It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:

/* Per-backend state for a TidStore */
struct TidStore
{
    /*
     * Control object. This is allocated in DSA area 'area' in the shared
     * case, otherwise in backend-local memory.
     */
    TidStoreControl *control;

    /* Storage for Tids */
    union tree
    {
        local_radix_tree    *local;
        shared_radix_tree   *shared;
    };

    /* DSA area for TidStore if used */
    dsa_area    *area;
};

In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases. Overall, I think
there is no problem and I'll investigate it in depth.

Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity. If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support. For example, if we want to support
Optimistic Lock Coupling[1], we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.

>
> One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type --
it'sonly set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up
only32 bytes with a fanout of 3, by eliminating padding. 

Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.

> Also, v17-0005 has an optimization/simplification for growing into node125 (my version needs an assertion or
fallback,but works well now), found by another reading of Andres' prototype There is a lot of good engineering there,
weshould try to preserve it. 

Agreed.

Regards,

[1] https://db.in.tum.de/~leis/papers/artsync.pdf

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> It looks no problem in terms of vacuum integration, although I've not
> fully tested yet. TID store uses the radix tree as the main storage,
> and with the template radix tree, the data types for shared and
> non-shared will be different. TID store can have an union for the
> radix tree and the structure would be like follows:

>     /* Storage for Tids */
>     union tree
>     {
>         local_radix_tree    *local;
>         shared_radix_tree   *shared;
>     };

We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.

> In the functions of TID store, we need to call either local or shared
> radix tree functions depending on whether TID store is shared or not.
> We need if-branch for each key-value pair insertion, but I think it
> would not be a big performance problem in TID store use cases, since
> vacuum is an I/O intensive operation in many cases.

Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths.

> Overall, I think
> there is no problem and I'll investigate it in depth.

Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need.

> Apart from that, I've been considering the lock support for shared
> radix tree. As we discussed before, the current usage (i.e, only
> parallel index vacuum) doesn't require locking support at all, so it
> would be enough to have a single lock for simplicity.

Right, that should be enough for PG16.

> If we want to
> use the shared radix tree for other use cases such as the parallel
> heap vacuum or the replacement of the hash table for shared buffers,
> we would need better lock support.

For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course.

> For example, if we want to support
> Optimistic Lock Coupling[1], 

Interesting, from the same authors!

> we would need to change not only the node
> structure but also the logic. Which probably leads to widen the gap
> between the code for non-shared and shared radix tree. In this case,
> once we have a better radix tree optimized for shared case, perhaps we
> can replace the templated shared radix tree with it. I'd like to hear
> your opinion on this line.

I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes.

> > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
>
> Oh, I didn't notice that. The chunk field was originally used when
> redirecting the child pointer in the parent node from old to new
> (grown) node. When redirecting the pointer, since the corresponding
> chunk surely exists on the parent we can skip existence checks.
> Currently we use RT_NODE_UPDATE_INNER() for that (see
> RT_REPLACE_NODE()) but having a dedicated function to update the
> existing chunk and child pointer might improve the performance. Or
> reducing the node size by getting rid of the chunk field might be
> better.

I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).

I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

I wrote:

> I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).

On further reflection, this is completely false and I'm not sure what I was thinking. However, for the update-inner case maybe we can assert that we found a valid slot.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > It looks no problem in terms of vacuum integration, although I've not
> > fully tested yet. TID store uses the radix tree as the main storage,
> > and with the template radix tree, the data types for shared and
> > non-shared will be different. TID store can have an union for the
> > radix tree and the structure would be like follows:
>
> >     /* Storage for Tids */
> >     union tree
> >     {
> >         local_radix_tree    *local;
> >         shared_radix_tree   *shared;
> >     };
>
> We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We
wouldhave to be more careful of things like the 32-bit crash from a few weeks ago. 

One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),

struct radix_tree
{
    bool    is_shared;
    MemoryContext context;
};

typedef struct rt_shared
{
    rt_handle   handle;
    uint32      magic;

    /* Root node */
   dsa_pointer  root;

    uint64      max_val;
    uint64      num_keys;

    /* need a lwlock */

    /* statistics */
#ifdef RT_DEBUG
    int32       cnt[RT_SIZE_CLASS_COUNT];
#endif
} rt_shared;

struct radix_tree_shared
{
    radix_tree rt;

    rt_shared *shared;
    dsa_area *area;
} radix_tree_shared;

struct radix_tree_local
{
    radix_tree rt;

    uint64  max_val;
    uint64  num_keys;

    rt_node *root;

    /* used only when the radix tree is private */
    MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
    MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];

    /* statistics */
#ifdef RT_DEBUG
    int32       cnt[RT_SIZE_CLASS_COUNT];
#endif
} radix_tree_local;

>
> > In the functions of TID store, we need to call either local or shared
> > radix tree functions depending on whether TID store is shared or not.
> > We need if-branch for each key-value pair insertion, but I think it
> > would not be a big performance problem in TID store use cases, since
> > vacuum is an I/O intensive operation in many cases.
>
> Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and
fattercode paths. 
>
> > Overall, I think
> > there is no problem and I'll investigate it in depth.
>
> Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach
forshared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to
easeadding variable-length keys if someone finds a need. 

I agree to keep this as a template. From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.

>
> > Apart from that, I've been considering the lock support for shared
> > radix tree. As we discussed before, the current usage (i.e, only
> > parallel index vacuum) doesn't require locking support at all, so it
> > would be enough to have a single lock for simplicity.
>
> Right, that should be enough for PG16.
>
> > If we want to
> > use the shared radix tree for other use cases such as the parallel
> > heap vacuum or the replacement of the hash table for shared buffers,
> > we would need better lock support.
>
> For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays.
Highlyconcurrent applications will need additional work, of course. 
>
> > For example, if we want to support
> > Optimistic Lock Coupling[1],
>
> Interesting, from the same authors!

+1

>
> > we would need to change not only the node
> > structure but also the logic. Which probably leads to widen the gap
> > between the code for non-shared and shared radix tree. In this case,
> > once we have a better radix tree optimized for shared case, perhaps we
> > can replace the templated shared radix tree with it. I'd like to hear
> > your opinion on this line.
>
> I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the
localimplementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes. 

>
> > > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type
--it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take
uponly 32 bytes with a fanout of 3, by eliminating padding. 
> >
> > Oh, I didn't notice that. The chunk field was originally used when
> > redirecting the child pointer in the parent node from old to new
> > (grown) node. When redirecting the pointer, since the corresponding
> > chunk surely exists on the parent we can skip existence checks.
> > Currently we use RT_NODE_UPDATE_INNER() for that (see
> > RT_REPLACE_NODE()) but having a dedicated function to update the
> > existing chunk and child pointer might improve the performance. Or
> > reducing the node size by getting rid of the chunk field might be
> > better.
>
> I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift"
fromL1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in
growinga node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production
builds,I would hope that branch gets elided (I haven't checked). 
>
> I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path
expansion),and this seems the way to get there. 

Okay, let's get rid of that in the v18.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 11, 2023 at 12:13 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I agree to keep this as a template.

Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.

> From the vacuum integration
> perspective, it would be better if we can use a common data type for
> shared and local. It makes sense to have different data types if the
> radix trees have different values types.

I agree it would be better, all else being equal. I have some further thoughts below.

> > > It looks no problem in terms of vacuum integration, although I've not
> > > fully tested yet. TID store uses the radix tree as the main storage,
> > > and with the template radix tree, the data types for shared and
> > > non-shared will be different. TID store can have an union for the
> > > radix tree and the structure would be like follows:
> >
> > >     /* Storage for Tids */
> > >     union tree
> > >     {
> > >         local_radix_tree    *local;
> > >         shared_radix_tree   *shared;
> > >     };
> >
> > We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.
>
> One idea to have a common data type without unused fields is to use
> radix_tree a base class. We cast it to radix_tree_shared or
> radix_tree_local depending on the flag is_shared in radix_tree. For
> instance we have like (based on non-template version),

> struct radix_tree
> {
>     bool    is_shared;
>     MemoryContext context;
> };

That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts can creep into a large number of places.

Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need one kind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it makes more sense that only code that needs both memory kinds should be responsible for keeping them separate.

The template might be easier for future use cases if shared memory were all-or-nothing, meaning either

- completely different functions and types depending on RT_SHMEM
- use branches (like v8)

The union sounds like a good thing to try, but do whatever seems right.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jan 12, 2023 at 5:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Jan 11, 2023 at 12:13 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I agree to keep this as a template.
>
> Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so
thatyour work on vacuum integration can be easily rebased on top of that, and we can work independently. 

Thanks!

>
> > From the vacuum integration
> > perspective, it would be better if we can use a common data type for
> > shared and local. It makes sense to have different data types if the
> > radix trees have different values types.
>
> I agree it would be better, all else being equal. I have some further thoughts below.
>
> > > > It looks no problem in terms of vacuum integration, although I've not
> > > > fully tested yet. TID store uses the radix tree as the main storage,
> > > > and with the template radix tree, the data types for shared and
> > > > non-shared will be different. TID store can have an union for the
> > > > radix tree and the structure would be like follows:
> > >
> > > >     /* Storage for Tids */
> > > >     union tree
> > > >     {
> > > >         local_radix_tree    *local;
> > > >         shared_radix_tree   *shared;
> > > >     };
> > >
> > > We could possibly go back to using a common data type for this, but with unused fields in each setting, as
before.We would have to be more careful of things like the 32-bit crash from a few weeks ago. 
> >
> > One idea to have a common data type without unused fields is to use
> > radix_tree a base class. We cast it to radix_tree_shared or
> > radix_tree_local depending on the flag is_shared in radix_tree. For
> > instance we have like (based on non-template version),
>
> > struct radix_tree
> > {
> >     bool    is_shared;
> >     MemoryContext context;
> > };
>
> That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts
cancreep into a large number of places. 
>
> Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need
onekind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it
makesmore sense that only code that needs both memory kinds should be responsible for keeping them separate. 

True.

>
> The template might be easier for future use cases if shared memory were all-or-nothing, meaning either
>
> - completely different functions and types depending on RT_SHMEM
> - use branches (like v8)
>
> The union sounds like a good thing to try, but do whatever seems right.

I've implemented the idea of using union. Let me share WIP code for
discussion, I've attached three patches that can be applied on top of
v17-0009 patch. v17-0010 implements missing shared memory support
functions such as RT_DETACH and RT_GET_HANDLE, and some fixes.
v17-0011 patch adds TidStore, and v17-0012 patch is the vacuum
integration.

Overall, TidStore implementation with the union idea doesn't look so
ugly to me. But I got many compiler warning about unused radix tree
functions like:

tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
[-Wunused-function]

I'm not sure there is a convenient way to suppress this warning but
one idea is to have some macros to specify what operations are
enabled/declared.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 23, 2022 at 4:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
>
> On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > If the value is a power of 2, it seems to work perfectly fine. But for
> > example if it's 700MB, the total memory exceeds the limit:
> >
> > 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
> > 510 + 256 = 766MB -> stop but it exceeds the limit.
> >
> > In a more bigger case, if it's 11000MB,
> >
> > 2*(1+2+...+2048) = 8190MB (74.4%)
> > 8190 + 4096 = 12286MB
> >
> > That being said, I don't think they are not common cases. So the 75%
> > threshold seems to work fine in most cases.
>
> Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the
community,being loose with memory limits by up to 10% is not a good precedent. 

Agreed.

> Perhaps we can be clever and use 75% when the limit is a power of two and 50% otherwise. I'm skeptical of trying to
beclever, and I just thought of an additional concern: We're assuming behavior of the growth in size of new DSA
segments,which could possibly change. Given how allocators are typically coded, though, it seems safe to assume that
they'llat most double in size. 

Sounds good to me.

I've written a simple script to simulate the DSA memory usage and the
limit. The 75% limit works fine for a power of two cases, and we can
use the 60% limit for other cases (it seems we can use up to about 66%
but used 60% for safety). It would be best if we can mathematically
prove it but I could prove only the power of two cases. But the script
practically shows the 60% threshold would work for these cases.

Regards

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 12, 2023 at 5:21 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.

There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanup work to do, but this is enough for now.

0003 contains all v17 local-memory coding squashed together.

0004 perf test not updated but it doesn't build by default so it's fine for now

0005 removes node.chunk as discussed, but does not change node4 fanout yet.

0006 is a small cleanup regarding setting node fanout.

0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.

0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.

0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming (it's possible I missed something, so worth checking).

> I've implemented the idea of using union. Let me share WIP code for
> discussion, I've attached three patches that can be applied on top of

Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.

> Overall, TidStore implementation with the union idea doesn't look so
> ugly to me. But I got many compiler warning about unused radix tree
> functions like:
>
> tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
> [-Wunused-function]
>
> I'm not sure there is a convenient way to suppress this warning but
> one idea is to have some macros to specify what operations are
> enabled/declared.

That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting. It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)

Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted to ask about a few things (numbers referring to v17 addendum, not v18):

0011

+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;

uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.

+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */

Maybe the #define and comment should be close to here.

+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.

+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.

Must? What happens otherwise?

+ uint64 last_key = PG_UINT64_MAX;

I'm having some difficulty understanding this sentinel and how it's used.

@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
  if (prunestate.has_lpdead_items)
  {
  Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
 
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+  buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
 
  /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);

This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.

On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've written a simple script to simulate the DSA memory usage and the
> limit. The 75% limit works fine for a power of two cases, and we can
> use the 60% limit for other cases (it seems we can use up to about 66%
> but used 60% for safety). It would be best if we can mathematically
> prove it but I could prove only the power of two cases. But the script
> practically shows the 60% threshold would work for these cases.

Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA increases segment size.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 12, 2023 at 5:21 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same
sothat your work on vacuum integration can be easily rebased on top of that, and we can work independently. 
>
> There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of
cleanupwork to do, but this is enough for now. 

Thanks! cfbot complaints about some warnings but these are expected
(due to unused delete routines etc). But one reported error[1] might
be relevant with 0002 patch?

[05:44:11.759] "link" /MACHINE:x64
/OUT:src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.res
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"/nologo" "/release" "/nologo" "/DEBUG"
"/PDB:src/test\modules\test_radixtree\test_radixtree.pdb" "/DLL"
"/IMPLIB:src/test\modules\test_radixtree\test_radixtree.lib"
"/INCREMENTAL:NO" "/STACK:4194304" "/NOEXP" "/DEBUG:FASTLINK"
"/NOIMPLIB" "C:/cirrus/build/src/backend/postgres.exe.lib"
"wldap32.lib" "c:/openssl/1.1/lib/libssl.lib"
"c:/openssl/1.1/lib/libcrypto.lib" "ws2_32.lib" "kernel32.lib"
"user32.lib" "gdi32.lib" "winspool.lib" "shell32.lib" "ole32.lib"
"oleaut32.lib" "uuid.lib" "comdlg32.lib" "advapi32.lib"
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals

> 0003 contains all v17 local-memory coding squashed together.

+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.

This comment seems to be out-of-date since we made it a template.

---
+#ifndef RT_COMMON
+#define RT_COMMON

What are we using this macro RT_COMMON for?

---
The following macros are defined but not undefined in radixtree.h:

RT_MAKE_PREFIX
RT_MAKE_NAME
RT_MAKE_NAME_
RT_SEARCH
UINT64_FORMAT_HEX
RT_NODE_SPAN
RT_NODE_MAX_SLOTS
RT_CHUNK_MASK
RT_MAX_SHIFT
RT_MAX_LEVEL
RT_NODE_125_INVALID_IDX
RT_GET_KEY_CHUNK
BM_IDX
BM_BIT
RT_NODE_KIND_4
RT_NODE_KIND_32
RT_NODE_KIND_125
RT_NODE_KIND_256
RT_NODE_KIND_COUNT
RT_PTR_LOCAL
RT_PTR_ALLOC
RT_INVALID_PTR_ALLOC
NODE_SLAB_BLOCK_SIZE

> 0004 perf test not updated but it doesn't build by default so it's fine for now

Okay.

> 0005 removes node.chunk as discussed, but does not change node4 fanout yet.

LGTM.

> 0006 is a small cleanup regarding setting node fanout.

LGTM.

> 0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.

+        /* XXX: do we need to set a callback on exit to detach dsa? */

In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.

---
+        dsa_free(tree->dsa, tree->ctl->handle); // XXX
+        //dsa_detach(tree->dsa);

Similar to above, I think we should not detach from the DSA area here.

Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.

--
-        Assert(tree->root);
+        //Assert(tree->ctl->root);

I think we don't need this assertion in the first place. We check it
at the beginning of the function.

---

+#ifdef RT_NODE_LEVEL_LEAF
+        Assert(NODE_IS_LEAF(node));
+#else
+        Assert(!NODE_IS_LEAF(node));
+#endif
+

I think we can move this change to 0003 patch.

> 0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.

LGTM.

>
> 0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable
renaming(it's possible I missed something, so worth checking). 
>
> > I've implemented the idea of using union. Let me share WIP code for
> > discussion, I've attached three patches that can be applied on top of
>
> Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.

+1

>
> > Overall, TidStore implementation with the union idea doesn't look so
> > ugly to me. But I got many compiler warning about unused radix tree
> > functions like:
> >
> > tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
> > [-Wunused-function]
> >
> > I'm not sure there is a convenient way to suppress this warning but
> > one idea is to have some macros to specify what operations are
> > enabled/declared.
>
> That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is
capableof keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The
vacuumcase cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to
blocknumberand value to Bitmapset, the number of keys might not be interesting. 

Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.

> It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an
implementationdidn't care about iteration or its order, we could optimize insertion into linear nodes) 

Agreed.

>
> Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted
toask about a few things (numbers referring to v17 addendum, not v18): 
>
> 0011
>
> + * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
> + * bytes a TidStore can use. These two fields are commonly used in both
> + * non-shared case and shared case.
> + */
> + uint32 num_tids;
>
> uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.

Agreed, will fix.

>
> + * We calculate the maximum bytes for the TidStore in different ways
> + * for non-shared case and shared case. Please refer to the comment
> + * TIDSTORE_MEMORY_DEDUCT for details.
> + */
>
> Maybe the #define and comment should be close to here.

Will fix.

>
> + * Destroy a TidStore, returning all memory. The caller must be certain that
> + * no other backend will attempt to access the TidStore before calling this
> + * function. Other backend must explicitly call tidstore_detach to free up
> + * backend-local memory associated with the TidStore. The backend that calls
> + * tidstore_destroy must not call tidstore_detach.
> + */
> +void
> +tidstore_destroy(TidStore *ts)
>
> If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.

Will fix.

>
> + * Add Tids on a block to TidStore. The caller must ensure the offset numbers
> + * in 'offsets' are ordered in ascending order.
>
> Must? What happens otherwise?

It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?

>
> + uint64 last_key = PG_UINT64_MAX;
>
> I'm having some difficulty understanding this sentinel and how it's used.

Will improve the logic.

>
> @@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
>   if (prunestate.has_lpdead_items)
>   {
>   Size freespace;
> + TidStoreIter *iter;
> + TidStoreIterResult *result;
>
> - lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
> + iter = tidstore_begin_iterate(vacrel->dead_items);
> + result = tidstore_iterate_next(iter);
> + lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
> +  buf, &vmbuffer);
> + Assert(!tidstore_iterate_next(iter));
> + tidstore_end_iterate(iter);
>
>   /* Forget the LP_DEAD items that we just vacuumed */
> - dead_items->num_items = 0;
> + tidstore_reset(dead_items);
>
> This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because
lazy_scan_prune()populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune()
couldpass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as
needed.It's quite possible I'm missing some detail, so some description of the design choices made would be helpful. 

I agree that we don't need complexity here. I'll try this idea.

>
> On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I've written a simple script to simulate the DSA memory usage and the
> > limit. The 75% limit works fine for a power of two cases, and we can
> > use the 60% limit for other cases (it seems we can use up to about 66%
> > but used 60% for safety). It would be best if we can mathematically
> > prove it but I could prove only the power of two cases. But the script
> > practically shows the 60% threshold would work for these cases.
>
> Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA
increasessegment size. 

Agreed.

Since it seems you're working on another cleanup, I can address the
above comments after your work is completed. But I'm also fine with
including them into your cleanup work.

Regards,

[1] https://cirrus-ci.com/task/5078505327689728

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >

> Thanks! cfbot complaints about some warnings but these are expected
> (due to unused delete routines etc). But one reported error[1] might
> be relevant with 0002 patch?

> [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> external symbol pg_popcount64
> [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> fatal error LNK1120: 1 unresolved externals

Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.

> ---
> +#ifndef RT_COMMON
> +#define RT_COMMON
>
> What are we using this macro RT_COMMON for?

It was a quick way to define some things only once, so they probably all showed up in the list of things you found not undefined. It's different from the style of simplehash.h, which is to have a local name and #undef for every single thing. simplehash.h is a precedent, so I'll change it to match. I'll take a look at your list, too.

> > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?

> Since it seems you're working on another cleanup, I can address the
> above comments after your work is completed. But I'm also fine with
> including them into your cleanup work.

I think we can work mostly simultaneously, if you work on tid store and vacuum, and I work on the template. We can always submit a full patchset including each other's latest work. That will catch rebase issues sooner.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radix tree template from previously. I've kept out the perf test module for now -- still needs updating.

> > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> > external symbol pg_popcount64
> > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> > fatal error LNK1120: 1 unresolved externals
>
> Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.

I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CI so elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by using the general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a whole lot.

> +        /* XXX: do we need to set a callback on exit to detach dsa? */
>
> In the current shared radix tree design, it's a caller responsible
> that they create (or attach to) a DSA area and pass it to RT_CREATE()
> or RT_ATTACH(). It enables us to use one DSA not only for the radix
> tree but also other data. Which is more flexible. So the caller needs
> to detach from the DSA somehow, so I think we don't need to set a
> callback here for that.
>
> ---
> +        dsa_free(tree->dsa, tree->ctl->handle); // XXX
> +        //dsa_detach(tree->dsa);
>
> Similar to above, I think we should not detach from the DSA area here.
>
> Given that the DSA area used by the radix tree could be used also by
> other data, I think that in RT_FREE() we need to free each radix tree
> node allocated in DSA. In lazy vacuum, we check the memory usage
> instead of the number of TIDs and need to reset the TidScan after an
> index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
> to the OS. I've implemented rt_free_recurse() for this purpose in the
> v15 version patch.
>
> --
> -        Assert(tree->root);
> +        //Assert(tree->ctl->root);
>
> I think we don't need this assertion in the first place. We check it
> at the beginning of the function.

I've removed these in v19-0006.

> > That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.
>
> Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.

I've moved it to the test module, which uses it extensively. There, the name is more clear what it's for, so I didn't change the name.

> > It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)
>
> Agreed.

Done in v19-0007.

v19-0009 is just a rebase over some more vacuum cleanups.

I'll continue working on internals cleanup.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?

Now that I've studied it some more, I see what's happening: We need all bits set in the "value" before we insert it, since it would be too expensive to retrieve the current value, add one bit, and put it back. Also, as a consequence of the encoding, part of the tid is in the key, and part in the value. It makes more sense now, but it needs more than zero comments.

As for the order, I don't think it's the responsibility of the caller to guess if it needs sorting -- if unordered offsets lead to data loss, this function needs to take care of it.

> > + uint64 last_key = PG_UINT64_MAX;
> >
> > I'm having some difficulty understanding this sentinel and how it's used.
>
> Will improve the logic.

Part of the problem is the English language: "last" can mean "previous" or "at the end", so maybe some name changes would help.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jan 17, 2023 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire
radixtree template from previously. I've kept out the perf test module for now -- still needs updating. 
>
> > > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> > > external symbol pg_popcount64
> > > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> > > fatal error LNK1120: 1 unresolved externals
> >
> > Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it
wouldbe nice to understand why, so I'll probably have to experiment on my CI repo. 
>
> I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in
CIso elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by
usingthe general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a
wholelot. 

I spent today investigating this issue, I found out that on Windows,
libpgport_src.a is not linked when building codes outside of
src/backend unless explicitly linking it. It's not a problem on Linux
etc. but the linker raises a fatal error on Windows. I'm not sure the
right way to fix it but the attached patch resolved the issue on
cfbot. Since it seems not to be related to 0002 patch but maybe the
designed behavior or a problem in meson. We can discuss it on a
separate thread.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jan 17, 2023 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire
radixtree template from previously. I've kept out the perf test module for now -- still needs updating. 
>
> > > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> > > external symbol pg_popcount64
> > > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> > > fatal error LNK1120: 1 unresolved externals
> >
> > Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it
wouldbe nice to understand why, so I'll probably have to experiment on my CI repo. 
>
> I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in
CIso elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by
usingthe general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a
wholelot. 
>
> > +        /* XXX: do we need to set a callback on exit to detach dsa? */
> >
> > In the current shared radix tree design, it's a caller responsible
> > that they create (or attach to) a DSA area and pass it to RT_CREATE()
> > or RT_ATTACH(). It enables us to use one DSA not only for the radix
> > tree but also other data. Which is more flexible. So the caller needs
> > to detach from the DSA somehow, so I think we don't need to set a
> > callback here for that.
> >
> > ---
> > +        dsa_free(tree->dsa, tree->ctl->handle); // XXX
> > +        //dsa_detach(tree->dsa);
> >
> > Similar to above, I think we should not detach from the DSA area here.
> >
> > Given that the DSA area used by the radix tree could be used also by
> > other data, I think that in RT_FREE() we need to free each radix tree
> > node allocated in DSA. In lazy vacuum, we check the memory usage
> > instead of the number of TIDs and need to reset the TidScan after an
> > index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
> > to the OS. I've implemented rt_free_recurse() for this purpose in the
> > v15 version patch.
> >
> > --
> > -        Assert(tree->root);
> > +        //Assert(tree->ctl->root);
> >
> > I think we don't need this assertion in the first place. We check it
> > at the beginning of the function.
>
> I've removed these in v19-0006.
>
> > > That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller
iscapable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The
vacuumcase cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to
blocknumberand value to Bitmapset, the number of keys might not be interesting. 
> >
> > Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.
>
> I've moved it to the test module, which uses it extensively. There, the name is more clear what it's for, so I didn't
changethe name. 
>
> > > It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an
implementationdidn't care about iteration or its order, we could optimize insertion into linear nodes) 
> >
> > Agreed.
>
> Done in v19-0007.
>
> v19-0009 is just a rebase over some more vacuum cleanups.

Thank you for updating the patches!

I've attached new version patches. There is no change from v19 patch
for 0001 through 0006. And 0004, 0005 and 0006 patches look good to
me. We can merge them into 0003 patch.

0007 patch fixes functions that are defined when RT_DEBUG. These
functions might be removed before the commit but this is useful at
least under development. 0008 patch fixes a bug in
RT_CHUNK_VALUES_ARRAY_SHIFT() and adds tests for that. 0009 patch
fixes the cfbot issue by linking pgport_srv. 0010 patch adds
RT_FREE_RECURSE() to free all radix tree nodes allocated in DSA. 0011
patch updates copyright etc. 0012 and 0013 patches are updated patches
that incorporate all comments I got so far.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.

> + * XXX: Most functions in this file have two variants for inner nodes and leaf
> + * nodes, therefore there are duplication codes. While this sometimes makes the
> + * code maintenance tricky, this reduces branch prediction misses when judging
> + * whether the node is a inner node of a leaf node.
>
> This comment seems to be out-of-date since we made it a template.

Done in 0020, along with a bunch of other comment editing.

> The following macros are defined but not undefined in radixtree.h:

Fixed in v21-0018.

Also:

0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meant for commit).

The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patch can squash them unless there is any discussion.

> > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
>
> Agreed, will fix.

Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, as the usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired. There's probably more that can be done here to change to signed types, but I think it's still a bit early to get to that level of nitpicking. (Soon, I hope :-) )

> > + * We calculate the maximum bytes for the TidStore in different ways
> > + * for non-shared case and shared case. Please refer to the comment
> > + * TIDSTORE_MEMORY_DEDUCT for details.
> > + */
> >
> > Maybe the #define and comment should be close to here.
>
> Will fix.

For this, I intended that "here" meant "in or just above the function".

+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6

These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. The symbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers in the function, with the explanation about the numbers near where they are used.

> > + * Destroy a TidStore, returning all memory. The caller must be certain that
> > + * no other backend will attempt to access the TidStore before calling this
> > + * function. Other backend must explicitly call tidstore_detach to free up
> > + * backend-local memory associated with the TidStore. The backend that calls
> > + * tidstore_destroy must not call tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
>
> Will fix.

Did anything change here? There is also this, in the template, which I'm not sure has been addressed:

 * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
 * has the local pointers to nodes, rather than RT_PTR_ALLOC.
 * We need either a safeguard to disallow other processes to begin the iteration
 * while one process is doing or to allow multiple processes to do the iteration.

> > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
>
> I agree that we don't need complexity here. I'll try this idea.

Keeping the offsets array in the prunestate seems to work out well.

Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:

TID store:

+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit

I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left, low on the right).

+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with

+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.

Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.

+typedef dsa_pointer tidstore_handle;

It's not clear why we need a typedef here, since here:

+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;

...there is a differently-named dsa_pointer variable that just gets the function parameter.

+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)

size_t is more suitable for memory.

+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */

Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insert with no lock? Am I missing something?

VACUUM integration:

-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2

Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, since that matches surrounding code.

+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)

This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. The following is in the patch and seems perfectly clear without the macro:

- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)

About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)

Now might be a good time to look at earlier XXX comments and come up with a plan to address them.

That's all I have for now.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Dilip Kumar
Date:
On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> Attached is a rebase to fix conflicts from recent commits.

I have reviewed v22-0022* patch and I have some comments.

1.
>It also changes to the column names max_dead_tuples and num_dead_tuples and to
>show the progress information in bytes.

I think this statement needs to be rephrased.

2.

/*
 *    vac_tid_reaped() -- is a particular tid deletable?
 *
 *        This has the right signature to be an IndexBulkDeleteCallback.
 *
 *        Assumes dead_items array is sorted (in ascending TID order).
 */

I think this comment 'Assumes dead_items array is sorted' is not valid anymore.

3.

We are changing the min value of 'maintenance_work_mem' to 2MB. Should
we do the same for the 'autovacuum_work_mem'?

4.
+
+    /* collected LP_DEAD items including existing LP_DEAD items */
+    int            lpdead_items;
+    OffsetNumber    deadoffsets[MaxHeapTuplesPerPage];

We are actually collecting dead offsets but the variable name says
'lpdead_items' instead of something like ndeadoffsets num_deadoffsets.
And the comment is also saying dead items.

5.
/*
 *    lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
 *                          vacrel->dead_items array.
 *
 * Caller must have an exclusive buffer lock on the buffer (though a full
 * cleanup lock is also acceptable).  vmbuffer must be valid and already have
 * a pin on blkno's visibility map page.
 *
 * index is an offset into the vacrel->dead_items array for the first listed
 * LP_DEAD item on the page.  The return value is the first index immediately
 * after all LP_DEAD items for the same page in the array.
 */

This comment needs to be changed as this is referring to the
'vacrel->dead_items array' which no longer exists.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one
exception:v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I
believeone of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with
thatapproach? I don't recall where that discussion went. 

Hmm, I don't remember I proposed such a patch, either.

One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.

Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.

FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.c

>
> > + * XXX: Most functions in this file have two variants for inner nodes and leaf
> > + * nodes, therefore there are duplication codes. While this sometimes makes the
> > + * code maintenance tricky, this reduces branch prediction misses when judging
> > + * whether the node is a inner node of a leaf node.
> >
> > This comment seems to be out-of-date since we made it a template.
>
> Done in 0020, along with a bunch of other comment editing.
>
> > The following macros are defined but not undefined in radixtree.h:
>
> Fixed in v21-0018.
>
> Also:
>
> 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is
agnostic.

radixtree_search_impl.h still assumes that the value type is an
integer type as follows:

#ifdef RT_NODE_LEVEL_LEAF
    RT_VALUE_TYPE       value = 0;

    Assert(RT_NODE_IS_LEAF(node));
#else

Also, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.

> 0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
> 0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not
meantfor commit). 
>
> The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next
patchcan squash them unless there is any discussion. 

0008 patch

        for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-               fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize
%zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+               fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
                                RT_SIZE_CLASS_INFO[i].name,
                                RT_SIZE_CLASS_INFO[i].inner_size,
-                               RT_SIZE_CLASS_INFO[i].inner_blocksize,
-                               RT_SIZE_CLASS_INFO[i].leaf_size,
-                               RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+                               RT_SIZE_CLASS_INFO[i].leaf_size);

There is additional '%zu' at the end of the format string:

---
0011 patch

+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statments.

typo: s/statments/statements/

The rest look good to me. I'll incorporate these fixes in the next
version patch.

>
> > > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
> >
> > Agreed, will fix.
>
> Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers,
asthe usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is
desired.There's probably more that can be done here to change to signed types, but I think it's still a bit early to
getto that level of nitpicking. (Soon, I hope :-) ) 

Agreed. I'll change it in the next version patch.

>
> > > + * We calculate the maximum bytes for the TidStore in different ways
> > > + * for non-shared case and shared case. Please refer to the comment
> > > + * TIDSTORE_MEMORY_DEDUCT for details.
> > > + */
> > >
> > > Maybe the #define and comment should be close to here.
> >
> > Will fix.
>
> For this, I intended that "here" meant "in or just above the function".
>
> +#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
> +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
> +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
>
> These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments.
Thesymbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers
inthe function, with the explanation about the numbers near where they are used. 

Agreed, will fix.

>
> > > + * Destroy a TidStore, returning all memory. The caller must be certain that
> > > + * no other backend will attempt to access the TidStore before calling this
> > > + * function. Other backend must explicitly call tidstore_detach to free up
> > > + * backend-local memory associated with the TidStore. The backend that calls
> > > + * tidstore_destroy must not call tidstore_detach.
> > > + */
> > > +void
> > > +tidstore_destroy(TidStore *ts)
> > >
> > > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
> >
> > Will fix.
>
> Did anything change here?

Oops, the fix is missed in the patch for some reason. I'll fix it.

> There is also this, in the template, which I'm not sure has been addressed:
>
>  * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
>  * has the local pointers to nodes, rather than RT_PTR_ALLOC.
>  * We need either a safeguard to disallow other processes to begin the iteration
>  * while one process is doing or to allow multiple processes to do the iteration.

It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?

>
> > > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because
lazy_scan_prune()populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune()
couldpass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as
needed.It's quite possible I'm missing some detail, so some description of the design choices made would be helpful. 
> >
> > I agree that we don't need complexity here. I'll try this idea.
>
> Keeping the offsets array in the prunestate seems to work out well.
>
> Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:
>
> TID store:
>
> + * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
> + *
> + * X = bits used for offset number
> + * Y = bits used for block number
> + * u = unused bit
>
> I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on
left,low on the right). 

I borrowed it from ginpostinglist.c but it seems better to write in
the common order.

>
> + * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
> + * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
>
> + * XXX: if we want to support non-heap table AM that want to use the full
> + * range of possible offset numbers, we'll need to reconsider
> + * TIDSTORE_OFFSET_NBITS value.
>
> Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback
forother table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine. 

I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.

>
> +typedef dsa_pointer tidstore_handle;
>
> It's not clear why we need a typedef here, since here:
>
> +tidstore_attach(dsa_area *area, tidstore_handle handle)
> +{
> + TidStore *ts;
> + dsa_pointer control;
> ...
> + control = handle;
>
> ...there is a differently-named dsa_pointer variable that just gets the function parameter.

I guess one reason is to improve compatibility; we can stash the
actual value of the handle, which could help some cases, for example,
when we need to change the actual value of the handle. dshash.c uses
the same idea. Another reason would be to improve readability.

>
> +/* Return the maximum memory TidStore can use */
> +uint64
> +tidstore_max_memory(TidStore *ts)
>
> size_t is more suitable for memory.

WIll fix.

>
> + /*
> + * Since the shared radix tree supports concurrent insert,
> + * we don't need to acquire the lock.
> + */
>
> Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to
insertwith no lock? Am I missing something? 

You're right. I was missing something. The lock should be taken before
adding key-value pairs.

>
> VACUUM integration:
>
> -#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
> +#define PARALLEL_VACUUM_KEY_DSA 2
>
> Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock,
sincethat matches surrounding code. 

Agreed, will remove.

>
> +#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
>
> This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole.
Thefollowing is in the patch and seems perfectly clear without the macro: 
>
> - if (lpdead_items > 0)
> + if (prunestate->lpdead_items > 0)

Will remove the macro.

>
> About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared
memory.Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is
theactual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't
havea better naming scheme, though, and might not be that important. (Added a WIP comment) 

That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT  but not sure these are better than the current
one.

>
> Now might be a good time to look at earlier XXX comments and come up with a plan to address them.

Agreed.

Other XXX comments that are not mentioned yet are:

+   /* XXX: memory context support */
+   tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));

I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.

---
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+   // XXX is this necessary?
+   Size        total = sizeof(RT_RADIX_TREE);

Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.

---
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)

I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.

---
There is no XXX comment but I'll try to add lock support in the next
version patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 23, 2023 at 8:20 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> >
> > In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.
>
> Hmm, I don't remember I proposed such a patch, either.

I went looking, and it turns out I remembered wrong, sorry.

> One idea to address it would be that we pass a shared memory to
> RT_CREATE() and we create a DSA area dedicated to the radix tree in
> place. We should return the created DSA area along with the radix tree
> so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
> and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
> area. A downside of this idea would be that one DSA area only for a
> radix tree is always required.
>
> Another idea would be that we allocate a big enough DSA area and
> quarry small memory for nodes from there. But it would need to
> introduce another complexity so I prefer to avoid it.
>
> FYI the current design is inspired by dshash.c. In dshash_destory(),
> we dsa_free() each elements allocated by dshash.c

Okay, thanks for the info.

> > 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
>
> radixtree_search_impl.h still assumes that the value type is an
> integer type as follows:
>
> #ifdef RT_NODE_LEVEL_LEAF
>     RT_VALUE_TYPE       value = 0;
>
>     Assert(RT_NODE_IS_LEAF(node));
> #else
>
> Also, I think if we make the value type configurable, it's better to
> pass the pointer of the value to RT_SET() instead of copying the
> values since the value size could be large.

Thanks, I will remove the assignment and look into pass-by-reference.

> Oops, the fix is missed in the patch for some reason. I'll fix it.
>
> > There is also this, in the template, which I'm not sure has been addressed:
> >
> >  * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
> >  * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> >  * We need either a safeguard to disallow other processes to begin the iteration
> >  * while one process is doing or to allow multiple processes to do the iteration.
>
> It's not addressed yet. I think adding a safeguard is better for the
> first version. A simple solution is to add a flag, say iter_active, to
> allow only one process to enable the iteration. What do you think?

I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's come up before, but could you describe why iteration is different from other operations, regarding concurrency?

> > Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.
>
> I think we can pass the maximum offset numbers to tidstore_create()
> and calculate these values.

That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I haven't looked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that would be to pass along the right value.

> > About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)
>
> That seems a valid concern. I borrowed the "control object" from
> dshash.c but it supports only shared cases. The fact that the radix
> tree supports both local and shared seems to introduce this confusion.
> I came up with other names such as RT_RADIX_TREE_CORE or
> RT_RADIX_TREE_ROOT  but not sure these are better than the current
> one.

Okay, if dshash uses it, we have some precedent.

> > Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
>
> Agreed.
>
> Other XXX comments that are not mentioned yet are:
>
> +   /* XXX: memory context support */
> +   tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
>
> I'm not sure we really need memory context support for RT_ATTACH()
> since in the shared case, we allocate backend-local memory only for
> RT_RADIX_TREE.

Okay, we can remove this.

> ---
> +RT_SCOPE uint64
> +RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
> +{
> +   // XXX is this necessary?
> +   Size        total = sizeof(RT_RADIX_TREE);
>
> Regarding this, I followed intset_memory_usage(). But in the radix
> tree, RT_RADIX_TREE is very small so probably we can ignore it.

That was more a note to myself that I forgot about, so here is my reasoning: In the shared case, we just overwrite that initial total, but for the local case we add to it. A future reader could think this is inconsistent and needs to be fixed. Since we deduct from the guc limit to guard against worst-case re-allocation, and that deduction is not very precise (nor needs to be), I agree we should just forget about tiny sizes like this in both cases.

> ---
> +/* XXX For display, assumes value type is numeric */
> +static void
> +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
>
> I think we can display values in hex encoded format but given the
> value could be large, we don't necessarily need to display actual
> values. Displaying the tree structure and chunks would be helpful for
> debugging the radix tree.

Okay, I can try that unless you do it first.

> There is no XXX comment but I'll try to add lock support in the next
> version patch.

Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.

Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.


--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 23, 2023 at 6:00 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > Attached is a rebase to fix conflicts from recent commits.
>
> I have reviewed v22-0022* patch and I have some comments.
>
> 1.
> >It also changes to the column names max_dead_tuples and num_dead_tuples and to
> >show the progress information in bytes.
>
> I think this statement needs to be rephrased.

Could you be more specific?

> 3.
>
> We are changing the min value of 'maintenance_work_mem' to 2MB. Should
> we do the same for the 'autovacuum_work_mem'?

Yes, we should change that, too. We've discussed previously that autovacuum_work_mem is possibly rendered unnecessary by this work, but we agreed that that should be a separate thread. And needs additional testing to verify.

I agree with your other comments.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jan 26, 2023 at 3:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 23, 2023 at 8:20 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > > > <john.naylor@enterprisedb.com> wrote:
> > >
> > > In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one
exception:v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I
believeone of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with
thatapproach? I don't recall where that discussion went. 
> >
> > Hmm, I don't remember I proposed such a patch, either.
>
> I went looking, and it turns out I remembered wrong, sorry.
>
> > One idea to address it would be that we pass a shared memory to
> > RT_CREATE() and we create a DSA area dedicated to the radix tree in
> > place. We should return the created DSA area along with the radix tree
> > so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
> > and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
> > area. A downside of this idea would be that one DSA area only for a
> > radix tree is always required.
> >
> > Another idea would be that we allocate a big enough DSA area and
> > quarry small memory for nodes from there. But it would need to
> > introduce another complexity so I prefer to avoid it.
> >
> > FYI the current design is inspired by dshash.c. In dshash_destory(),
> > we dsa_free() each elements allocated by dshash.c
>
> Okay, thanks for the info.
>
> > > 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest
isagnostic. 
> >
> > radixtree_search_impl.h still assumes that the value type is an
> > integer type as follows:
> >
> > #ifdef RT_NODE_LEVEL_LEAF
> >     RT_VALUE_TYPE       value = 0;
> >
> >     Assert(RT_NODE_IS_LEAF(node));
> > #else
> >
> > Also, I think if we make the value type configurable, it's better to
> > pass the pointer of the value to RT_SET() instead of copying the
> > values since the value size could be large.
>
> Thanks, I will remove the assignment and look into pass-by-reference.
>
> > Oops, the fix is missed in the patch for some reason. I'll fix it.
> >
> > > There is also this, in the template, which I'm not sure has been addressed:
> > >
> > >  * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
> > >  * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> > >  * We need either a safeguard to disallow other processes to begin the iteration
> > >  * while one process is doing or to allow multiple processes to do the iteration.
> >
> > It's not addressed yet. I think adding a safeguard is better for the
> > first version. A simple solution is to add a flag, say iter_active, to
> > allow only one process to enable the iteration. What do you think?
>
> I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's
comeup before, but could you describe why iteration is different from other operations, regarding concurrency? 

I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration. So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

>
> > > Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a
fallbackfor other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine. 
> >
> > I think we can pass the maximum offset numbers to tidstore_create()
> > and calculate these values.
>
> That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I
haven'tlooked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that
wouldbe to pass along the right value. 

I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.

>
> > > About shared memory: I have some mild reservations about the naming of the "control object", which may be in
sharedmemory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared
memoryis the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the
tree.I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment) 
> >
> > That seems a valid concern. I borrowed the "control object" from
> > dshash.c but it supports only shared cases. The fact that the radix
> > tree supports both local and shared seems to introduce this confusion.
> > I came up with other names such as RT_RADIX_TREE_CORE or
> > RT_RADIX_TREE_ROOT  but not sure these are better than the current
> > one.
>
> Okay, if dshash uses it, we have some precedent.
>
> > > Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
> >
> > Agreed.
> >
> > Other XXX comments that are not mentioned yet are:
> >
> > +   /* XXX: memory context support */
> > +   tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
> >
> > I'm not sure we really need memory context support for RT_ATTACH()
> > since in the shared case, we allocate backend-local memory only for
> > RT_RADIX_TREE.
>
> Okay, we can remove this.
>
> > ---
> > +RT_SCOPE uint64
> > +RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
> > +{
> > +   // XXX is this necessary?
> > +   Size        total = sizeof(RT_RADIX_TREE);
> >
> > Regarding this, I followed intset_memory_usage(). But in the radix
> > tree, RT_RADIX_TREE is very small so probably we can ignore it.
>
> That was more a note to myself that I forgot about, so here is my reasoning: In the shared case, we just overwrite
thatinitial total, but for the local case we add to it. A future reader could think this is inconsistent and needs to
befixed. Since we deduct from the guc limit to guard against worst-case re-allocation, and that deduction is not very
precise(nor needs to be), I agree we should just forget about tiny sizes like this in both cases. 

Thanks for your explanation, agreed.

>
> > ---
> > +/* XXX For display, assumes value type is numeric */
> > +static void
> > +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
> >
> > I think we can display values in hex encoded format but given the
> > value could be large, we don't necessarily need to display actual
> > values. Displaying the tree structure and chunks would be helpful for
> > debugging the radix tree.
>
> Okay, I can try that unless you do it first.
>
> > There is no XXX comment but I'll try to add lock support in the next
> > version patch.
>
> Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next
patch,the email should contain a few sentences describing how locking is intended to work, including for iteration. 

The lock I'm thinking of adding is a simple readers-writer lock. This
lock is used for concurrent radix tree operations except for the
iteration. For operations concurrent to the iteration, I used a flag
for the reason I mentioned above.

>
> Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this
datastructure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and
writesand so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we
can'trely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we
mustcome up with test cases ourselves. 

Using the isolation tester to test locking seems like a good idea. We
can include it in test_radixtree. But given that the locking in the
radix tree is very simple, the test case would be very simple. It may
be controversial whether it's worth adding such testing by adding both
the new test module and test cases.

I'm working on the fixes I mentioned in the previous email and going
to share the updated patch today. Please wait to do these fixes if
you're okay.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jan 26, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I'm working on the fixes I mentioned in the previous email and going
> to share the updated patch today. Please wait to do these fixes if
> you're okay.
>

I've attached updated version patches. As we agreed I've merged your
changes in v22 into the main (0003) patch. But I still kept the patch
of recursively freeing nodes separate as we might need more
discussion. In v23 I attached, 0006 through 0016 patches are fixes and
improvements for the radix tree. I've incorporated all comments I got
unless I'm missing something.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 26, 2023 at 3:54 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> I think that we need to prevent concurrent updates (RT_SET() and
> RT_DELETE()) during the iteration to get the consistent result through
> the whole iteration operation. Unlike other operations such as
> RT_SET(), we cannot expect that a job doing something for each
> key-value pair in the radix tree completes in a short time, so we
> cannot keep holding the radix tree lock until the end of the
> iteration.

This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we should worry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent read-write workloads anyway, and anyone interested in those workloads will have to completely replace the locking scheme, possibly using one of the ideas in the last ART paper you mentioned.

The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.

> So the idea is that we set iter_active to true (with the
> lock in exclusive mode), and prevent concurrent updates when the flag
> is true.

...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.

> > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
>
> The lock I'm thinking of adding is a simple readers-writer lock. This
> lock is used for concurrent radix tree operations except for the
> iteration. For operations concurrent to the iteration, I used a flag
> for the reason I mentioned above.

This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only have a vague idea about the tradeoffs made regarding iteration.

+ * WIP: describe about how locking works.

A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no comments and a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until I have a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be tested.

[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.

> > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
>
> Using the isolation tester to test locking seems like a good idea. We
> can include it in test_radixtree. But given that the locking in the
> radix tree is very simple, the test case would be very simple. It may
> be controversial whether it's worth adding such testing by adding both
> the new test module and test cases.

I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant testing.

> I think the user (e.g, vacuumlazy.c) can pass the maximum offset
> number to the parallel vacuum.

Okay, sounds good.

Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very closely. There is one exception:

0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange. It turns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable and just writes to the passed pointer. This required callers to initialize a couple of their own variables, but only child pointers, at least on gcc 12. And I will work later on making "value" in the public API a pointer.

0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:

+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);

There is repetition in the else branch.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 26, 2023 at 3:54 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > I think that we need to prevent concurrent updates (RT_SET() and
> > RT_DELETE()) during the iteration to get the consistent result through
> > the whole iteration operation. Unlike other operations such as
> > RT_SET(), we cannot expect that a job doing something for each
> > key-value pair in the radix tree completes in a short time, so we
> > cannot keep holding the radix tree lock until the end of the
> > iteration.
>
> This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we
shouldworry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent
read-writeworkloads anyway, and anyone interested in those workloads will have to completely replace the locking
scheme,possibly using one of the ideas in the last ART paper you mentioned. 
>
> The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as
possibleanyway. 

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

>
> > So the idea is that we set iter_active to true (with the
> > lock in exclusive mode), and prevent concurrent updates when the flag
> > is true.
>
> ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

>
> > > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the
nextpatch, the email should contain a few sentences describing how locking is intended to work, including for
iteration.
> >
> > The lock I'm thinking of adding is a simple readers-writer lock. This
> > lock is used for concurrent radix tree operations except for the
> > iteration. For operations concurrent to the iteration, I used a flag
> > for the reason I mentioned above.
>
> This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only
havea vague idea about the tradeoffs made regarding iteration. 
>
> + * WIP: describe about how locking works.
>
> A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no
commentsand a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until
Ihave a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be
tested.
>
> [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a
parameterfor whether the iteration function wants to reserve the privilege to perform writes? It could take the
appropriatelock at the start, and there could then be multiple read-only iterators, but only one read/write iterator.
Note,I'm just guessing here, and I don't want to make things more difficult for future improvements. 

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

>
> > > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of
thisdata structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads
andwrites and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs,
wecan't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So
wemust come up with test cases ourselves. 
> >
> > Using the isolation tester to test locking seems like a good idea. We
> > can include it in test_radixtree. But given that the locking in the
> > radix tree is very simple, the test case would be very simple. It may
> > be controversial whether it's worth adding such testing by adding both
> > the new test module and test cases.
>
> I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant
testing.

Okay, understood.

>
> > I think the user (e.g, vacuumlazy.c) can pass the maximum offset
> > number to the parallel vacuum.
>
> Okay, sounds good.
>
> Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very
closely.There is one exception: 
>
> 0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange.
Itturns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable
andjust writes to the passed pointer. This required callers to initialize a couple of their own variables, but only
childpointers, at least on gcc 12. 

Agreed with the attached patch.

>  And I will work later on making "value" in the public API a pointer.

Thanks!

>
> 0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:
>
> + if (TidStoreIsShared(ts))
> + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
> + else
> + return sizeof(TidStore) + sizeof(TidStore) +
> + local_rt_memory_usage(ts->tree.local);
>
> There is repetition in the else branch.

Agreed, will remove.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Dilip Kumar
Date:
On Thu, Jan 26, 2023 at 12:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jan 23, 2023 at 6:00 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > Attached is a rebase to fix conflicts from recent commits.
> >
> > I have reviewed v22-0022* patch and I have some comments.
> >
> > 1.
> > >It also changes to the column names max_dead_tuples and num_dead_tuples and to
> > >show the progress information in bytes.
> >
> > I think this statement needs to be rephrased.
>
> Could you be more specific?

I mean the below statement in the commit message doesn't look
grammatically correct to me.

"It also changes to the column names max_dead_tuples and
num_dead_tuples and to show the progress information in bytes."

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Jan 28, 2023 at 8:33 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
>
> Yes, but if a concurrent writer waits for another process to finish
> the iteration, it ends up waiting on a lwlock, which is not
> interruptible.
>
> >
> > > So the idea is that we set iter_active to true (with the
> > > lock in exclusive mode), and prevent concurrent updates when the flag
> > > is true.
> >
> > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
>
> Right. I think if we want to wait rather than an ERROR, the waiter
> should wait in an interruptible way, for example, a condition
> variable. I did a simpler way in the v22 patch.
>
> ...but looking at dshash.c, dshash_seq_next() seems to return an entry
> while holding a lwlock on the partition. My assumption might be wrong.

Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.

If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.

> > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
>
> Seems a good idea. Given the use case for parallel heap vacuum, it
> would be a good idea to support having multiple read-only writers. The
> iteration of the v22 is read-only, so if we want to support read-write
> iterator, we would need to support a function that modifies the
> current key-value returned by the iteration.

Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:

1) parallel heap vacuum  -> multiple read-only iterators
2) parallel heap pruning -> multiple writers

It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 30, 2023 at 1:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 26, 2023 at 12:39 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jan 23, 2023 at 6:00 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > > Attached is a rebase to fix conflicts from recent commits.
> > >
> > > I have reviewed v22-0022* patch and I have some comments.
> > >
> > > 1.
> > > >It also changes to the column names max_dead_tuples and num_dead_tuples and to
> > > >show the progress information in bytes.
> > >
> > > I think this statement needs to be rephrased.
> >
> > Could you be more specific?
>
> I mean the below statement in the commit message doesn't look
> grammatically correct to me.
>
> "It also changes to the column names max_dead_tuples and
> num_dead_tuples and to show the progress information in bytes."
>

I've changed the commit message in the v23 patch. Please check it.
Other comments are also incorporated in the v23 patch. Thank you for
the comments!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 30, 2023 at 1:31 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sat, Jan 28, 2023 at 8:33 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much
aspossible anyway. 
> >
> > Yes, but if a concurrent writer waits for another process to finish
> > the iteration, it ends up waiting on a lwlock, which is not
> > interruptible.
> >
> > >
> > > > So the idea is that we set iter_active to true (with the
> > > > lock in exclusive mode), and prevent concurrent updates when the flag
> > > > is true.
> > >
> > > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
> >
> > Right. I think if we want to wait rather than an ERROR, the waiter
> > should wait in an interruptible way, for example, a condition
> > variable. I did a simpler way in the v22 patch.
> >
> > ...but looking at dshash.c, dshash_seq_next() seems to return an entry
> > while holding a lwlock on the partition. My assumption might be wrong.
>
> Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.
>
> If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write
workloads,I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should
justgo all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level
commentand include a link to the second ART paper. 

Agreed. Will update the comments.

>
> > > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a
parameterfor whether the iteration function wants to reserve the privilege to perform writes? It could take the
appropriatelock at the start, and there could then be multiple read-only iterators, but only one read/write iterator.
Note,I'm just guessing here, and I don't want to make things more difficult for future improvements. 
> >
> > Seems a good idea. Given the use case for parallel heap vacuum, it
> > would be a good idea to support having multiple read-only writers. The
> > iteration of the v22 is read-only, so if we want to support read-write
> > iterator, we would need to support a function that modifies the
> > current key-value returned by the iteration.
>
> Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait
forfine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support: 
>
> 1) parallel heap vacuum  -> multiple read-only iterators
> 2) parallel heap pruning -> multiple writers
>
> It may or may not be worth it for someone to actually start either of those projects, and there are other ways to
improvevacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work
finefor #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process
could"pre-warm" the tid store with zero-values using block numbers from the visibility map. 

True. Using a larger batching method seems to be worth testing when we
implement the parallel heap pruning.

In the next version patch, I'm going to update the locking support
part and incorporate other comments I got.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 30, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 1:31 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Sat, Jan 28, 2023 at 8:33 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> >
> > > > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As
muchas possible anyway. 
> > >
> > > Yes, but if a concurrent writer waits for another process to finish
> > > the iteration, it ends up waiting on a lwlock, which is not
> > > interruptible.
> > >
> > > >
> > > > > So the idea is that we set iter_active to true (with the
> > > > > lock in exclusive mode), and prevent concurrent updates when the flag
> > > > > is true.
> > > >
> > > > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
> > >
> > > Right. I think if we want to wait rather than an ERROR, the waiter
> > > should wait in an interruptible way, for example, a condition
> > > variable. I did a simpler way in the v22 patch.
> > >
> > > ...but looking at dshash.c, dshash_seq_next() seems to return an entry
> > > while holding a lwlock on the partition. My assumption might be wrong.
> >
> > Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.
> >
> > If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed
read-writeworkloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point
weshould just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the
top-levelcomment and include a link to the second ART paper. 
>
> Agreed. Will update the comments.
>
> >
> > > > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a
parameterfor whether the iteration function wants to reserve the privilege to perform writes? It could take the
appropriatelock at the start, and there could then be multiple read-only iterators, but only one read/write iterator.
Note,I'm just guessing here, and I don't want to make things more difficult for future improvements. 
> > >
> > > Seems a good idea. Given the use case for parallel heap vacuum, it
> > > would be a good idea to support having multiple read-only writers. The
> > > iteration of the v22 is read-only, so if we want to support read-write
> > > iterator, we would need to support a function that modifies the
> > > current key-value returned by the iteration.
> >
> > Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also
waitfor fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support: 
> >
> > 1) parallel heap vacuum  -> multiple read-only iterators
> > 2) parallel heap pruning -> multiple writers
> >
> > It may or may not be worth it for someone to actually start either of those projects, and there are other ways to
improvevacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work
finefor #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process
could"pre-warm" the tid store with zero-values using block numbers from the visibility map. 
>
> True. Using a larger batching method seems to be worth testing when we
> implement the parallel heap pruning.
>
> In the next version patch, I'm going to update the locking support
> part and incorporate other comments I got.
>

I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've attached v24 patches. The locking support patch is separated
> (0005 patch). Also I kept the updates for TidStore and the vacuum
> integration from v23 separate.

Okay, that's a lot more simple, and closer to what I imagined. For v25, I squashed v24's additions and added a couple of my own. I've kept the CF status at "needs review" because no specific action is required at the moment.

I did start to review the TID store some more, but that's on hold because something else came up: On a lark I decided to re-run some benchmarks to see if anything got lost in converting to a template, and that led me down a rabbit hole -- some good and bad news on that below.

0001:

I removed the uint64 case, as discussed. There is now a brief commit message, but needs to be fleshed out a bit. I took another look at the Arm optimization that Nathan found some month ago, for forming the highbit mask, but that doesn't play nicely with how node32 uses it, so I decided against it. I added a comment to describe the reasoning in case someone else gets a similar idea.

I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their own header to clean up the #ifdef maze.", but decided it wasn't such a clear win to justify starting the work now. It's still in the back of my mind, but I removed the reminder from the commit message.

0003:

The template now requires the value to be passed as a pointer. That was a pretty trivial change, but affected multiple other patches, so not sent separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a top comment to the *_impl.h headers. There are some comment fixes. The changes were either trivial or discussed earlier, so also not sent separately.

0004/5: I wanted to measure the load time as well as search time in bench_search_random_nodes(). That's kept separate to make it easier to test other patch versions.

The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this in any other bench function and was the reason for writing 0005 to begin with. More confusingly, my efforts to fix this improved *other* functions, but the former didn't budge at all. First the patches:

0006 adds and removes some "inline" declarations (where it made sense), and added some for "pg_noinline" based on Andres' advice some months ago.

0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.

v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key chunks in order for the linear-search nodes. I believe the TID store no longer cares about the ordering, but this is a text file for now because I don't want to clutter the CI with a behavior change. Also, the second ART paper (on concurrency) mentioned that some locking schemes don't allow these arrays to be shifted. So it might make sense to give up entirely on guaranteeing ordered iteration, or at least make it optional as in the patch.

Now for some numbers:

========================================
psql -c "select * from bench_search_random_nodes(10*1000*1000)"
(min load time of three)

v15:
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     334182184 |    3352 |      2073

v25-0005:
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    3426 |      2126

v25-0006 (inlining or not):
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    3327 |      2035

v25-0007 (remove dead code):
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    3313 |      2037

v25-addendum...txt (no ordering):
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    2762 |      2042

Allowing unordered inserts helps a lot here in loading. That's expected because there are a lot of inserts into the linear nodes. 0006 might help a little.

========================================
psql -c "select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a"

v15:
         avg          
----------------------
 207.3000000000000000

v25-0005:
         avg          
----------------------
 190.6000000000000000

v25-0006 (inlining or not):
         avg          
----------------------
 189.3333333333333333

v25-0007 (remove dead code):
         avg          
----------------------
 186.4666666666666667

v25-addendum...txt (no ordering):
         avg          
----------------------
 179.7000000000000000

Most of the improvement from v15 to v25 probably comes from the change from node4 to node3, and this test stresses that node the most. That shows in the total memory used: it goes from 152MB to 132MB. Allowing unordered inserts helps some, the others are not convincing.

========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 1000)"
(min load time of three)

v15:
 rt_load_ms | rt_search_ms
------------+--------------
        113 |          455

v25-0005:
 rt_load_ms | rt_search_ms
------------+--------------
        135 |          456

v25-0006 (inlining or not):
 rt_load_ms | rt_search_ms
------------+--------------
        136 |          455

v25-0007 (remove dead code):
 rt_load_ms | rt_search_ms
------------+--------------
        135 |          455

v25-addendum...txt (no ordering):
 rt_load_ms | rt_search_ms
------------+--------------
        134 |          455

Note: The regression seems to have started in v17, which is the first with a full template.

Nothing so far has helped here, and previous experience has shown that trying to profile 100ms will not be useful. Instead of putting more effort into diving deeper, it seems a better use of time to write a benchmark that calls the tid store itself. That's more realistic, since this function was intended to test load and search of tids, but the tid store doesn't quite operate so simply anymore. What do you think, Masahiko?

I'm inclined to keep 0006, because it might give a slight boost, and 0007 because it's never a bad idea to remove dead code.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Tue, Feb 7, 2023 at 6:25 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I've attached v24 patches. The locking support patch is separated
> > (0005 patch). Also I kept the updates for TidStore and the vacuum
> > integration from v23 separate.
>
> Okay, that's a lot more simple, and closer to what I imagined. For v25, I squashed v24's additions and added a couple
ofmy own. I've kept the CF status at "needs review" because no specific action is required at the moment. 
>
> I did start to review the TID store some more, but that's on hold because something else came up: On a lark I decided
tore-run some benchmarks to see if anything got lost in converting to a template, and that led me down a rabbit hole --
somegood and bad news on that below. 
>
> 0001:
>
> I removed the uint64 case, as discussed. There is now a brief commit message, but needs to be fleshed out a bit. I
tookanother look at the Arm optimization that Nathan found some month ago, for forming the highbit mask, but that
doesn'tplay nicely with how node32 uses it, so I decided against it. I added a comment to describe the reasoning in
casesomeone else gets a similar idea. 
>
> I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their own header to clean up the #ifdef
maze.",but decided it wasn't such a clear win to justify starting the work now. It's still in the back of my mind, but
Iremoved the reminder from the commit message. 

The changes make sense to me.

>
> 0003:
>
> The template now requires the value to be passed as a pointer. That was a pretty trivial change, but affected
multipleother patches, so not sent separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a top
commentto the *_impl.h headers. There are some comment fixes. The changes were either trivial or discussed earlier, so
alsonot sent separately. 

Great.

>
> 0004/5: I wanted to measure the load time as well as search time in bench_search_random_nodes(). That's kept separate
tomake it easier to test other patch versions. 
>
> The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() has regressed noticeably. I can't
reproducethis in any other bench function and was the reason for writing 0005 to begin with. More confusingly, my
effortsto fix this improved *other* functions, but the former didn't budge at all. First the patches: 
>
> 0006 adds and removes some "inline" declarations (where it made sense), and added some for "pg_noinline" based on
Andres'advice some months ago. 

Agreed.

>
> 0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an
existingkey. This kind of change is much easier with the inner/node cases handled together in a template, as far as
beingsure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't
exist,but thought yet another #ifdef would be too messy. 

Agreed.

>
> v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key chunks in order for the linear-search
nodes.I believe the TID store no longer cares about the ordering, but this is a text file for now because I don't want
toclutter the CI with a behavior change. Also, the second ART paper (on concurrency) mentioned that some locking
schemesdon't allow these arrays to be shifted. So it might make sense to give up entirely on guaranteeing ordered
iteration,or at least make it optional as in the patch. 

I think it's still important for lazy vacuum that an iteration over a
TID store returns TIDs in ascending order, because otherwise a heap
vacuum does random writes. That being said, we can have
RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
the key chunks are stored in a node.

> ========================================
> psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 1000)"
> (min load time of three)
>
> v15:
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         113 |          455
>
> v25-0005:
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         135 |          456
>
> v25-0006 (inlining or not):
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         136 |          455
>
> v25-0007 (remove dead code):
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         135 |          455
>
> v25-addendum...txt (no ordering):
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         134 |          455
>
> Note: The regression seems to have started in v17, which is the first with a full template.
>
> Nothing so far has helped here, and previous experience has shown that trying to profile 100ms will not be useful.
Insteadof putting more effort into diving deeper, it seems a better use of time to write a benchmark that calls the tid
storeitself. That's more realistic, since this function was intended to test load and search of tids, but the tid store
doesn'tquite operate so simply anymore. What do you think, Masahiko? 

Yeah, that's more realistic. TidStore now encodes TIDs slightly
differently from the benchmark test.

I've attached the patch that adds a simple benchmark test using
TidStore. With this test, I got similar trends of results to yours
with gcc, but I've not analyzed them in depth yet.

query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)

v15:
 load_ms
---------
     816

v25-0007 (remove dead code):
load_ms
---------
     839

v25-addendum...txt (no ordering):
 load_ms
---------
     820

BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c.

>
> I'm inclined to keep 0006, because it might give a slight boost, and 0007 because it's never a bad idea to remove
deadcode. 

Yeah, these two changes make sense to me too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I think it's still important for lazy vacuum that an iteration over a
> TID store returns TIDs in ascending order, because otherwise a heap
> vacuum does random writes. That being said, we can have
> RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
> the key chunks are stored in a node.

Okay, we can keep that possibility in mind if we need to go there.

> > Note: The regression seems to have started in v17, which is the first with a full template.

> > 0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.

It just occurred to me that these facts might be related. v17 was the first use of the full template, and I decided then I liked one of your earlier patches where replace_node() calls node_update_inner() better than calling node_insert_inner() with a NULL parent, which was a bit hard to understand. That now-dead code was actually used in the latter case for updating the (original) parent. It's possible that trying to use separate paths contributed to the regression. I'll try the other way and report back.

> I've attached the patch that adds a simple benchmark test using
> TidStore. With this test, I got similar trends of results to yours
> with gcc, but I've not analyzed them in depth yet.

Thanks for that! I'll take a look.

> BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c.

Absolutely.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)
>
> v15:
>  load_ms
> ---------
>      816

How did you build the tid store and test on v15? I first tried to apply v15-0009-PoC-lazy-vacuum-integration.patch, which conflicts with vacuum now, so reset all that, but still getting build errors because the tid store types and functions have changed.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Feb 10, 2023 at 3:51 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)
> >
> > v15:
> >  load_ms
> > ---------
> >      816
>
> How did you build the tid store and test on v15? I first tried to apply v15-0009-PoC-lazy-vacuum-integration.patch,
whichconflicts with vacuum now, so reset all that, but still getting build errors because the tid store types and
functionshave changed.
 

I applied v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
on top of v15 radix tree and changed the TidStore so that it uses v15
(non-templated) radixtree. That way, we can test TidStore using v15
radix tree. I've attached the patch that I applied on top of
v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I didn't get any closer to radix-tree regression, but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough fashion in the attached .txt addendums that I can clean up and incorporate later.

To start, I can reproduce the regression with this test as well:

select * from bench_tidstore_load(0, 10 * 1000 * 1000);

v15 + v26 store + adjustments:
 mem_allocated | load_ms
---------------+---------
      98202152 |    1676

v26 0001-0008
 mem_allocated | load_ms
---------------+---------
      98202032 |    1826

...and reverting to the alternate way to update the parent didn't help:

v26 0001-6, 0008, insert_inner w/ null parent

 mem_allocated | load_ms
---------------+---------
      98202032 |    1825

...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.

Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declared locally in v26):

v15 + v26 store + adjustments:
  65.88%  postgres  postgres             [.] tidstore_add_tids
  10.74%  postgres  postgres             [.] rt_set
   9.20%  postgres  postgres             [.] palloc0
   6.49%  postgres  postgres             [.] rt_node_insert_leaf

v26 0001-0008
  78.50%  postgres  postgres             [.] tidstore_add_tids
   8.88%  postgres  postgres             [.] palloc0
   6.24%  postgres  postgres             [.] local_rt_node_insert_leaf

v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compiler doesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnostic about what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoid about that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert that the keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think of a reason we would ever encounter offsets out of order.) Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. That might shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, which I think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero should work the same way. These together led to a nice speedup on the v26 branch:

 mem_allocated | load_ms
---------------+---------
      98202032 |    1386

v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids() because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd:

static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}

Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:

static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
}

There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, I created a new function encode_key_off() [name could be better], which deals with the raw block number that we already have. Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid().

v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that case or treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measure this separately, but 0002+0003 gives:

 mem_allocated | load_ms
---------------+---------
      98202032 |    1259

If these are acceptable, I can incorporate them into a later patchset. In any case, speeding up tidstore_add_tids() will make any regressions in the backing radix tree more obvious. I will take a look at that next week.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

> but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough
fashionin the attached .txt addendums that I can clean up and incorporate later. 
>
> To start, I can reproduce the regression with this test as well:
>
> select * from bench_tidstore_load(0, 10 * 1000 * 1000);
>
> v15 + v26 store + adjustments:
>  mem_allocated | load_ms
> ---------------+---------
>       98202152 |    1676
>
> v26 0001-0008
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1826
>
> ...and reverting to the alternate way to update the parent didn't help:
>
> v26 0001-6, 0008, insert_inner w/ null parent
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1825
>
> ...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.
>
> Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when
declaredlocally in v26): 
>
> v15 + v26 store + adjustments:
>   65.88%  postgres  postgres             [.] tidstore_add_tids
>   10.74%  postgres  postgres             [.] rt_set
>    9.20%  postgres  postgres             [.] palloc0
>    6.49%  postgres  postgres             [.] rt_node_insert_leaf
>
> v26 0001-0008
>   78.50%  postgres  postgres             [.] tidstore_add_tids
>    8.88%  postgres  postgres             [.] palloc0
>    6.24%  postgres  postgres             [.] local_rt_node_insert_leaf
>
> v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the
compilerdoesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm
agnosticabout what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too
paranoidabout that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert
thatthe keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think
ofa reason we would ever encounter offsets out of order.) 

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

> Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest.
Thatmight shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page,
whichI think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero
shouldwork the same way. These together led to a nice speedup on the v26 branch: 
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1386
>
> v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for
tidstore_add_tids()because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd: 
>
> static inline void
> BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
> {
> blockId->bi_hi = blockNumber >> 16;
> blockId->bi_lo = blockNumber & 0xffff;
> }
>
> Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:
>
> static inline BlockNumber
> BlockIdGetBlockNumber(const BlockIdData *blockId)
> {
> return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
> }
>
> There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this,
Icreated a new function encode_key_off() [name could be better], which deals with the raw block number that we already
have.Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for
tidstore_lookup_tid().
>
> v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that
caseor treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't
measurethis separately, but 0002+0003 gives: 
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1259
>
> If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Feb 11, 2023 at 2:33 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > I didn't get any closer to radix-tree regression,
>
> Me neither. It seems that in v26, inserting chunks into node-32 is
> slow but needs more analysis. I'll share if I found something
> interesting.

If that were the case, then the other benchmarks I ran would likely have slowed down as well, but they are the same or faster. There is one microbenchmark I didn't run before: "select * from bench_fixed_height_search(15)" (15 to reduce noise from growing size class, and despite the name it measures load time as well). Trying this now shows no difference: a few runs range 19 to 21ms in each version. That also reinforces that update_inner is fine and that the move to value pointer API didn't regress.

Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of 5, perf run separate from measurements):

v15 + v26 store:

 mem_allocated | load_ms
---------------+---------
      98202152 |     553

  19.71%  postgres  postgres             [.] tidstore_add_tids
+ 31.47%  postgres  postgres             [.] rt_set
= 51.18%

  20.62%  postgres  postgres             [.] rt_node_insert_leaf
   6.05%  postgres  postgres             [.] AllocSetAlloc
   4.74%  postgres  postgres             [.] AllocSetFree
   4.62%  postgres  postgres             [.] palloc
   2.23%  postgres  postgres             [.] SlabAlloc

v26:

 mem_allocated | load_ms
---------------+---------
      98202032 |     617

  57.45%  postgres  postgres             [.] tidstore_add_tids

  20.67%  postgres  postgres             [.] local_rt_node_insert_leaf
   5.99%  postgres  postgres             [.] AllocSetAlloc
   3.55%  postgres  postgres             [.] palloc
   3.05%  postgres  postgres             [.] AllocSetFree
   2.05%  postgres  postgres             [.] SlabAlloc

So it seems the store itself got faster when we removed shared memory paths from the v26 store to test it against v15.

I thought to favor the local memory case in the tidstore by controlling inlining -- it's smaller and will be called much more often, so I tried the following (done in 0007)

 #define RT_PREFIX shared_rt
 #define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline

That brings it down to

 mem_allocated | load_ms
---------------+---------
      98202032 |     590

That's better, but not still not within noise level. Perhaps some slowdown is unavoidable, but it would be nice to understand why.

> I can think that something like traversing a HOT chain could visit
> offsets out of order. But fortunately we prune such collected TIDs
> before heap vacuum in heap case.

Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...

> > If these are acceptable, I can incorporate them into a later patchset.
>
> These are nice improvements! I agree with all changes.

Great, I've squashed these into the tidstore patch (0004). Also added 0005, which is just a simplification.

I squashed the earlier dead code removal into the radix tree patch.

v27-0008 measures tid store iteration performance and adds a stub function to prevent spurious warnings, so the benchmarking module can always be built.

Getting the list of offsets from the old array for a given block is always trivial, but tidstore_iter_extract_tids() is doing a huge amount of unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the load time:

 mem_allocated | load_ms | iter_ms
---------------+---------+---------
      98202032 |     589 |     915

Fortunately, it's an easy fix, done in 0009.

 mem_allocated | load_ms | iter_ms
---------------+---------+---------
      98202032 |     589 |     153

I'll soon resume more cosmetic review of the tid store, but this is enough to post.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sat, Feb 11, 2023 at 2:33 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > I didn't get any closer to radix-tree regression,
> >
> > Me neither. It seems that in v26, inserting chunks into node-32 is
> > slow but needs more analysis. I'll share if I found something
> > interesting.
>
> If that were the case, then the other benchmarks I ran would likely have slowed down as well, but they are the same
orfaster. There is one microbenchmark I didn't run before: "select * from bench_fixed_height_search(15)" (15 to reduce
noisefrom growing size class, and despite the name it measures load time as well). Trying this now shows no difference:
afew runs range 19 to 21ms in each version. That also reinforces that update_inner is fine and that the move to value
pointerAPI didn't regress. 
>
> Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of 5, perf run separate from measurements):
>
> v15 + v26 store:
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202152 |     553
>
>   19.71%  postgres  postgres             [.] tidstore_add_tids
> + 31.47%  postgres  postgres             [.] rt_set
> = 51.18%
>
>   20.62%  postgres  postgres             [.] rt_node_insert_leaf
>    6.05%  postgres  postgres             [.] AllocSetAlloc
>    4.74%  postgres  postgres             [.] AllocSetFree
>    4.62%  postgres  postgres             [.] palloc
>    2.23%  postgres  postgres             [.] SlabAlloc
>
> v26:
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |     617
>
>   57.45%  postgres  postgres             [.] tidstore_add_tids
>
>   20.67%  postgres  postgres             [.] local_rt_node_insert_leaf
>    5.99%  postgres  postgres             [.] AllocSetAlloc
>    3.55%  postgres  postgres             [.] palloc
>    3.05%  postgres  postgres             [.] AllocSetFree
>    2.05%  postgres  postgres             [.] SlabAlloc
>
> So it seems the store itself got faster when we removed shared memory paths from the v26 store to test it against
v15.
>
> I thought to favor the local memory case in the tidstore by controlling inlining -- it's smaller and will be called
muchmore often, so I tried the following (done in 0007) 
>
>  #define RT_PREFIX shared_rt
>  #define RT_SHMEM
> -#define RT_SCOPE static
> +#define RT_SCOPE static pg_noinline
>
> That brings it down to
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |     590

The improvement makes sense to me. I've also done the same test (with
changing TIDS_PER_BLOCK_FOR_LOAD to 1):

w/o 0007 patch:
 mem_allocated | load_ms | iter_ms
---------------+---------+---------
      98202032 |     334 |     445
(1 row)

w/ 0007 patch:
 mem_allocated | load_ms | iter_ms
---------------+---------+---------
      98202032 |     316 |     434
(1 row)

On the other hand, with TIDS_PER_BLOCK_FOR_LOAD being 30, the load
performance didn't improve:

w/0 0007 patch:
 mem_allocated | load_ms | iter_ms
---------------+---------+---------
      98202032 |     601 |     608
(1 row)

w/ 0007 patch:
 mem_allocated | load_ms | iter_ms
---------------+---------+---------
      98202032 |     610 |     606
(1 row)

That being said, it might be within noise level, so I agree with 0007 patch.

> Perhaps some slowdown is unavoidable, but it would be nice to understand why.

True.

>
> > I can think that something like traversing a HOT chain could visit
> > offsets out of order. But fortunately we prune such collected TIDs
> > before heap vacuum in heap case.
>
> Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just
continueassuming that (with an assert added since it's more public in this form). I'm not sure why such basic common
senseevaded me a few versions ago... 

Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.

> > > If these are acceptable, I can incorporate them into a later patchset.
> >
> > These are nice improvements! I agree with all changes.
>
> Great, I've squashed these into the tidstore patch (0004). Also added 0005, which is just a simplification.
>

I've attached some small patches to improve the radix tree and tidstrore:

We have the following WIP comment in test_radixtree:

// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM

How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?

FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.

> I squashed the earlier dead code removal into the radix tree patch.

Thanks!

>
> v27-0008 measures tid store iteration performance and adds a stub function to prevent spurious warnings, so the
benchmarkingmodule can always be built. 
>
> Getting the list of offsets from the old array for a given block is always trivial, but tidstore_iter_extract_tids()
isdoing a huge amount of unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the load time: 
>
>  mem_allocated | load_ms | iter_ms
> ---------------+---------+---------
>       98202032 |     589 |     915
>
> Fortunately, it's an easy fix, done in 0009.
>
>  mem_allocated | load_ms | iter_ms
> ---------------+---------+---------
>       98202032 |     589 |     153

Cool!

>
> I'll soon resume more cosmetic review of the tid store, but this is enough to post.

Thanks!

You removed the vacuum integration patch from v27, is there any reason for that?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > > I can think that something like traversing a HOT chain could visit
> > > offsets out of order. But fortunately we prune such collected TIDs
> > > before heap vacuum in heap case.
> >
> > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
>
> Right. TidStore is implemented not only for heap, so loading
> out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.

> We have the following WIP comment in test_radixtree:
>
> // WIP: compiles with warnings because rt_attach is defined but not used
> // #define RT_SHMEM
>
> How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> and friends?

Sounds good to me, and the other fixes make sense as well.

> FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> seems to work fine.

That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)

> You removed the vacuum integration patch from v27, is there any reason for that?

Just an oversight.

Now for some general comments on the tid store...

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

Do we need to do anything for this todo?

It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.

The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".

I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?

maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/

It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?

The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.

Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.

I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.


Some comments on vacuum:

I think we'd better get some real-world testing of this, fairly soon.

I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed. Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can get excited about.

We also want to verify that progress reporting works as designed and has no weird corner cases.

  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
...
+ * create a TidStore with the maximum bytes that can be used by the TidStore.

This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentioned in the previous paragraph that we set an upper bound.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));

I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version?

- * the memory space for storing dead items allocated in the DSM segment.  We
[a lot of whitespace adjustment]
+ * the shared TidStore. We launch parallel worker processes at the start of

The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor.

- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);

If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.


Lastly, on the radix tree:

I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?

RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template.

+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);

+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)

These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer description.

In the test:

+ 4, /* RT_NODE_KIND_4 */

The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it didn't fail. Should it? Maybe we need symbols for the various fanouts.

I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the tidstore never shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for future changes targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand, someone might have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of node-shrinking logic, but needs some work because we need the (local pointer) parent to update to the new smaller node, just like the growing case.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2023-02-16 16:22:56 +0700, John Naylor wrote:
> On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > Right. TidStore is implemented not only for heap, so loading
> > out-of-order TIDs might be important in the future.
> 
> That's what I was probably thinking about some weeks ago, but I'm having a
> hard time imagining how it would come up, even for something like the
> conveyor-belt concept.

We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-02-16 16:22:56 +0700, John Naylor wrote:
> > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > > Right. TidStore is implemented not only for heap, so loading
> > > out-of-order TIDs might be important in the future.
> >
> > That's what I was probably thinking about some weeks ago, but I'm having a
> > hard time imagining how it would come up, even for something like the
> > conveyor-belt concept.
>
> We really ought to replace the tid bitmap used for bitmap heap scans. The
> hashtable we use is a pretty awful data structure for it. And that's not
> filled in-order, for example.

I took a brief look at that and agree we should sometime make it work there as well.

v26 tidstore_add_tids() appears to assume that it's only called once per blocknumber. While the order of offsets doesn't matter there for a single block, calling it again with the same block would wipe out the earlier offsets, IIUC. To do an actual "add tid" where the order doesn't matter, it seems we would need to (acquire lock if needed), read the current bitmap and OR in the new bit if it exists, then write it back out.

That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets that are assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets().

--

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Feb 16, 2023 at 6:23 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > > > I can think that something like traversing a HOT chain could visit
> > > > offsets out of order. But fortunately we prune such collected TIDs
> > > > before heap vacuum in heap case.
> > >
> > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just
continueassuming that (with an assert added since it's more public in this form). I'm not sure why such basic common
senseevaded me a few versions ago... 
> >
> > Right. TidStore is implemented not only for heap, so loading
> > out-of-order TIDs might be important in the future.
>
> That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up,
evenfor something like the conveyor-belt concept. 
>
> > We have the following WIP comment in test_radixtree:
> >
> > // WIP: compiles with warnings because rt_attach is defined but not used
> > // #define RT_SHMEM
> >
> > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> > and friends?
>
> Sounds good to me, and the other fixes make sense as well.

Thanks, I merged them.

>
> > FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> > seems to work fine.
>
> That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but
thiswould make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.) 

According to the doc, the minimum block size is 1kB. It seems to work
fine with 1kB blocks.

>
> > You removed the vacuum integration patch from v27, is there any reason for that?
>
> Just an oversight.
>
> Now for some general comments on the tid store...
>
> + * TODO: The caller must be certain that no other backend will attempt to
> + * access the TidStore before calling this function. Other backend must
> + * explicitly call tidstore_detach to free up backend-local memory associated
> + * with the TidStore. The backend that calls tidstore_destroy must not call
> + * tidstore_detach.
> + */
> +void
> +tidstore_destroy(TidStore *ts)
>
> Do we need to do anything for this todo?

Since it's practically no problem, I think we can live with it for
now. dshash also has the same todo.

>
> It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly.
Thekey is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of
encode_key_off().Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even
surethe TIDSTORE_ prefix is valuable for these local macros. 
>
> The word "value" as a variable name is pretty generic in this context, and it might be better to call it the
off_lower_bitmap,at least in some places. The "key" doesn't have a good short term for naming, but in comments we
shouldmake sure we're clear it's "block# + off_upper". 
>
> I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
>
> maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
>
> It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable
future,maybe the radix tree template should define a key typedef? 
>
> The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very
descriptive.I don't have a good idea, though. 
>
> Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for
tidstorewould help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of
taste,though. 
>
> I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the
bestway to code this, it needs more commentary. 

The attached 0008 patch addressed all above comments on tidstore.

> Some comments on vacuum:
>
> I think we'd better get some real-world testing of this, fairly soon.
>
> I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the
store(not caring about its memory limits or progress reporting etc). During index scan, check both the new store and
thearray and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the
resultdoesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch
wouldjust restore the rest of the current patch. That would help reassure us it's working as designed. 

Yeah, I did a similar thing in an earlier version of tidstore patch.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.

> Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can
getexcited about. 

Thanks!

>
> We also want to verify that progress reporting works as designed and has no weird corner cases.
>
>   * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
> ...
> + * create a TidStore with the maximum bytes that can be used by the TidStore.
>
> This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already
mentionedin the previous paragraph that we set an upper bound. 

Agreed.

>
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
> + vacrel->relname, tidstore_num_tids(vacrel->dead_items),
> + vacuumed_pages)));
>
> I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version?

I think we need to change the format to INT64_FORMAT.

>
> - * the memory space for storing dead items allocated in the DSM segment.  We
> [a lot of whitespace adjustment]
> + * the shared TidStore. We launch parallel worker processes at the start of
>
> The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor.
>
> - /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
> - est_dead_items_len = vac_max_items_to_alloc_size(max_items);
> - shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
> + /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
> + shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
>
> If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"?
> What does dsa_minimum_size() work out to in practice? 1MB?
> Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.
>

Right. The attached 0009 patch addressed comments on vacuum
integration except for the correctness checking.


> Lastly, on the radix tree:
>
> I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP ,
SET_EXTEND-> EXTEND_DOWN ? 
>
> RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template.

It was used in radixtree_iter_impl.h. But I removed it as it was not necessary.

>
> + /*
> + * Set the node to the node iterator and update the iterator stack
> + * from this node.
> + */
> + RT_UPDATE_ITER_STACK(iter, child, level - 1);
>
> +/*
> + * Update each node_iter for inner nodes in the iterator node stack.
> + */
> +static void
> +RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
>
> These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer
description.
>

I agree with all of the above comments. The attached 0007 patch
addressed comments on the radix tree.

> In the test:
>
> + 4, /* RT_NODE_KIND_4 */
>
> The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it
didn'tfail. Should it? Maybe we need symbols for the various fanouts. 
>

Since this information is used to the number of keys inserted, it
doesn't check the node kind. So we just didn't test node-3. It might
be better to expose and use both RT_SIZE_CLASS and RT_SIZE_CLASS_INFO.

> I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the
tidstorenever shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for
futurechanges targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand,
someonemight have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of
node-shrinkinglogic, but needs some work because we need the (local pointer) parent to update to the new smaller node,
justlike the growing case. 

Thanks, that's also on my todo list. TBH I'm not sure we should
improve the deletion at this stage as there is no use case of deletion
in the core. I'd prefer to focus on improving the quality of the
current radix tree and tidstore now, and I think we can support
node-shrinking once we are confident with the current implementation.

On Fri, Feb 17, 2023 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets
thatare assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets(). 

tidstore_set_block_offsets() sounds better. I used
TidStoreSetBlockOffsets() in the latest patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 16, 2023 at 6:23 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> >
> > > > > I can think that something like traversing a HOT chain could visit
> > > > > offsets out of order. But fortunately we prune such collected TIDs
> > > > > before heap vacuum in heap case.
> > > >
> > > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just
continueassuming that (with an assert added since it's more public in this form). I'm not sure why such basic common
senseevaded me a few versions ago... 
> > >
> > > Right. TidStore is implemented not only for heap, so loading
> > > out-of-order TIDs might be important in the future.
> >
> > That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come
up,even for something like the conveyor-belt concept. 
> >
> > > We have the following WIP comment in test_radixtree:
> > >
> > > // WIP: compiles with warnings because rt_attach is defined but not used
> > > // #define RT_SHMEM
> > >
> > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> > > and friends?
> >
> > Sounds good to me, and the other fixes make sense as well.
>
> Thanks, I merged them.
>
> >
> > > FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> > > seems to work fine.
> >
> > That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter,
butthis would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.) 
>
> According to the doc, the minimum block size is 1kB. It seems to work
> fine with 1kB blocks.
>
> >
> > > You removed the vacuum integration patch from v27, is there any reason for that?
> >
> > Just an oversight.
> >
> > Now for some general comments on the tid store...
> >
> > + * TODO: The caller must be certain that no other backend will attempt to
> > + * access the TidStore before calling this function. Other backend must
> > + * explicitly call tidstore_detach to free up backend-local memory associated
> > + * with the TidStore. The backend that calls tidstore_destroy must not call
> > + * tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > Do we need to do anything for this todo?
>
> Since it's practically no problem, I think we can live with it for
> now. dshash also has the same todo.
>
> >
> > It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly.
Thekey is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of
encode_key_off().Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even
surethe TIDSTORE_ prefix is valuable for these local macros. 
> >
> > The word "value" as a variable name is pretty generic in this context, and it might be better to call it the
off_lower_bitmap,at least in some places. The "key" doesn't have a good short term for naming, but in comments we
shouldmake sure we're clear it's "block# + off_upper". 
> >
> > I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
> >
> > maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
> >
> > It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable
future,maybe the radix tree template should define a key typedef? 
> >
> > The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very
descriptive.I don't have a good idea, though. 
> >
> > Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that
fortidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter
oftaste, though. 
> >
> > I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is
thebest way to code this, it needs more commentary. 
>
> The attached 0008 patch addressed all above comments on tidstore.
>
> > Some comments on vacuum:
> >
> > I think we'd better get some real-world testing of this, fairly soon.
> >
> > I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the
store(not caring about its memory limits or progress reporting etc). During index scan, check both the new store and
thearray and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the
resultdoesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch
wouldjust restore the rest of the current patch. That would help reassure us it's working as designed. 
>
> Yeah, I did a similar thing in an earlier version of tidstore patch.
> Since we're trying to introduce two new components: radix tree and
> tidstore, I sometimes find it hard to investigate failures happening
> during lazy (parallel) vacuum due to a bug either in tidstore or radix
> tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
> it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
> with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
> stores tids to both the radix tree and array, and checks if the
> results match when lookup or iteration. It will use more memory but it
> would not be a big problem in USE_ASSERT_CHECKING builds. It would
> also be great if we can enable such checks on some bf animals.

I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Yeah, I did a similar thing in an earlier version of tidstore patch.

Okay, if you had checks against the old array lookup in development, that gives us better confidence. 

> > Since we're trying to introduce two new components: radix tree and
> > tidstore, I sometimes find it hard to investigate failures happening
> > during lazy (parallel) vacuum due to a bug either in tidstore or radix
> > tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
> > it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
> > with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
> > stores tids to both the radix tree and array, and checks if the
> > results match when lookup or iteration. It will use more memory but it
> > would not be a big problem in USE_ASSERT_CHECKING builds. It would
> > also be great if we can enable such checks on some bf animals.
>
> I've tried this idea. Enabling this check on all debug builds (i.e.,
> with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
> special macro for that, TIDSTORE_DEBUG. I think we can define this
> macro on some bf animals (or possibly a new bf animal).

 I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Feb 22, 2023 at 4:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Yeah, I did a similar thing in an earlier version of tidstore patch.
>
> Okay, if you had checks against the old array lookup in development, that gives us better confidence.
>
> > > Since we're trying to introduce two new components: radix tree and
> > > tidstore, I sometimes find it hard to investigate failures happening
> > > during lazy (parallel) vacuum due to a bug either in tidstore or radix
> > > tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
> > > it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
> > > with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
> > > stores tids to both the radix tree and array, and checks if the
> > > results match when lookup or iteration. It will use more memory but it
> > > would not be a big problem in USE_ASSERT_CHECKING builds. It would
> > > also be great if we can enable such checks on some bf animals.
> >
> > I've tried this idea. Enabling this check on all debug builds (i.e.,
> > with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
> > special macro for that, TIDSTORE_DEBUG. I think we can define this
> > macro on some bf animals (or possibly a new bf animal).
>
>  I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth
carryingthe old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world
workload.

I guess that It would also be helpful at least until the GA release.
People will be able to test them easily on their workloads or their
custom test scenarios.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Feb 22, 2023 at 4:35 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >  I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.
>
> I guess that It would also be helpful at least until the GA release.
> People will be able to test them easily on their workloads or their
> custom test scenarios.

That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.

TPC-C was just an example. It should have testing comparing the old and new methods. If you have already done that to some degree, that might be enough. After performance tests, I'll also try some vacuums that use the comparison patch.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I ran a couple "in situ" tests on server hardware using UUID columns, since they are common in the real world and have bad correlation to heap order, so are a challenge for index vacuum.

=== test 1, delete everything from a small table, with very small maintenance_work_mem:

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

-- unrealistically low
alter system set maintenance_work_mem = '32MB';

create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,50*1000*1000);
create index on test (x);

delete from test;
vacuum (verbose, truncate off) test;
--

master:
INFO:  finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s

v29 patch:
INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s

This is a bit artificial, but it's easy to construct cases where the array leads to multiple index scans but the new tid store can fit everythin without breaking a sweat. I didn't save the progress reporting, but v29 was using about 11MB for tid storage.


=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary search's pre-check

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

alter system set maintenance_work_mem = '1GB';

create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,1000*1000*1000);
vacuum_freeze test;

select pg_size_pretty(pg_table_size('test'));
 pg_size_pretty
----------------
 41 GB

create index on test (x);

select pg_size_pretty(pg_total_relation_size('test'));
 pg_size_pretty
----------------
 71 GB

select max(ctid) from test;
     max      
--------------
 (5405405,75)

delete from test where ctid <  '(100000,0)'::tid;
delete from test where ctid > '(5300000,0)'::tid;

vacuum (verbose, truncate off) test;

both:
INFO:  vacuuming "john.naylor.public.test"
INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 38000000 dead item identifiers removed

--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s

v29 patch:
system usage: CPU: user:  97.71 s, system: 45.78 s, elapsed: 573.94 s

The entire vacuum took 25% less wall clock time. Reminder that this is without wal logging, and also unscientific because only one run.

--
I took 10 seconds of perf data while index vacuuming was going on (showing calls > 2%):

master:
  40.59%  postgres  postgres            [.] vac_cmp_itemptr
  24.97%  postgres  libc-2.17.so        [.] bsearch
   6.67%  postgres  postgres            [.] btvacuumpage
   4.61%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
   3.48%  postgres  postgres            [.] PageIndexMultiDelete
   2.67%  postgres  postgres            [.] vac_tid_reaped
   2.03%  postgres  postgres            [.] compactify_tuples
   2.01%  postgres  libc-2.17.so        [.] __memcpy_ssse3_back

v29 patch:

  29.22%  postgres  postgres            [.] TidStoreIsMember
   9.30%  postgres  postgres            [.] btvacuumpage
   7.76%  postgres  postgres            [.] PageIndexMultiDelete
   6.31%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
   5.60%  postgres  postgres            [.] compactify_tuples
   4.26%  postgres  libc-2.17.so        [.] __memcpy_ssse3_back
   4.12%  postgres  postgres            [.] hash_search_with_hash_value

--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, num_dead_tuples from pg_stat_progress_vacuum"
       phase       | heap_blks_total | heap_blks_scanned | max_dead_tuples | num_dead_tuples
-------------------+-----------------+-------------------+-----------------+-----------------
 vacuuming indexes |         5405406 |           5405406 |       178956969 |        38000000

v29 patch:
psql  -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
       phase       | heap_blks_total | heap_blks_scanned | max_dead_tuple_bytes | dead_tuple_bytes
-------------------+-----------------+-------------------+----------------------+------------------
 vacuuming indexes |         5405406 |           5405406 |           1073670144 |          8678064

Here, the old array pessimistically needs 1GB allocated (as for any table > ~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables that only fit, say, 30-50 tuples per page will have less extreme differences in memory use. Same for the case where only a couple dead items occur per page, with many uninteresting pages in between. Even so, the allocation will be much more accurately sized in the patch, especially in non-parallel vacuum.

There are other cases that could be tested (I mentioned some above), but this is enough to show the improvements possible.

I still need to do some cosmetic follow-up to v29 as well as a status report, and I will try to get back to that soon.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Feb 22, 2023 at 4:35 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >  I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth
carryingthe old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world
workload.
> >
> > I guess that It would also be helpful at least until the GA release.
> > People will be able to test them easily on their workloads or their
> > custom test scenarios.
>
> That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer,
theold way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole
patchsetwon't be committed. 

True. Even if we're done enough testing we cannot claim there is no
bug. My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose. Instead, it
seems to be better to add more necessary assertions. What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Feb 23, 2023 at 6:41 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I ran a couple "in situ" tests on server hardware using UUID columns, since they are common in the real world and
havebad correlation to heap order, so are a challenge for index vacuum. 

Thank you for the test!

>
> === test 1, delete everything from a small table, with very small maintenance_work_mem:
>
> alter system set shared_buffers ='4GB';
> alter system set max_wal_size ='10GB';
> alter system set checkpoint_timeout ='30 min';
> alter system set autovacuum =off;
>
> -- unrealistically low
> alter system set maintenance_work_mem = '32MB';
>
> create table if not exists test (x uuid);
> truncate table test;
> insert into test (x) select gen_random_uuid() from generate_series(1,50*1000*1000);
> create index on test (x);
>
> delete from test;
> vacuum (verbose, truncate off) test;
> --
>
> master:
> INFO:  finished vacuuming "john.naylor.public.test": index scans: 9
> system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s
>
> v29 patch:
> INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
> system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s
>
> This is a bit artificial, but it's easy to construct cases where the array leads to multiple index scans but the new
tidstore can fit everythin without breaking a sweat. I didn't save the progress reporting, but v29 was using about 11MB
fortid storage. 

Cool.

>
>
> === test 2: try to stress tid lookup with production maintenance_work_mem:
> 1. use unlogged table to reduce noise
> 2. vacuum freeze first to reduce heap scan time
> 3. delete some records at the beginning and end of heap to defeat binary search's pre-check
>
> alter system set shared_buffers ='4GB';
> alter system set max_wal_size ='10GB';
> alter system set checkpoint_timeout ='30 min';
> alter system set autovacuum =off;
>
> alter system set maintenance_work_mem = '1GB';
>
> create unlogged table if not exists test (x uuid);
> truncate table test;
> insert into test (x) select gen_random_uuid() from generate_series(1,1000*1000*1000);
> vacuum_freeze test;
>
> select pg_size_pretty(pg_table_size('test'));
>  pg_size_pretty
> ----------------
>  41 GB
>
> create index on test (x);
>
> select pg_size_pretty(pg_total_relation_size('test'));
>  pg_size_pretty
> ----------------
>  71 GB
>
> select max(ctid) from test;
>      max
> --------------
>  (5405405,75)
>
> delete from test where ctid <  '(100000,0)'::tid;
> delete from test where ctid > '(5300000,0)'::tid;
>
> vacuum (verbose, truncate off) test;
>
> both:
> INFO:  vacuuming "john.naylor.public.test"
> INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
> index scan needed: 205406 pages from table (3.80% of total) had 38000000 dead item identifiers removed
>
> --
> master:
> system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s
>
> v29 patch:
> system usage: CPU: user:  97.71 s, system: 45.78 s, elapsed: 573.94 s

In v29 vacuum took twice as long (286 s vs. 573 s)?

>
> The entire vacuum took 25% less wall clock time. Reminder that this is without wal logging, and also unscientific
becauseonly one run. 
>
> --
> I took 10 seconds of perf data while index vacuuming was going on (showing calls > 2%):
>
> master:
>   40.59%  postgres  postgres            [.] vac_cmp_itemptr
>   24.97%  postgres  libc-2.17.so        [.] bsearch
>    6.67%  postgres  postgres            [.] btvacuumpage
>    4.61%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>    3.48%  postgres  postgres            [.] PageIndexMultiDelete
>    2.67%  postgres  postgres            [.] vac_tid_reaped
>    2.03%  postgres  postgres            [.] compactify_tuples
>    2.01%  postgres  libc-2.17.so        [.] __memcpy_ssse3_back
>
> v29 patch:
>
>   29.22%  postgres  postgres            [.] TidStoreIsMember
>    9.30%  postgres  postgres            [.] btvacuumpage
>    7.76%  postgres  postgres            [.] PageIndexMultiDelete
>    6.31%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>    5.60%  postgres  postgres            [.] compactify_tuples
>    4.26%  postgres  libc-2.17.so        [.] __memcpy_ssse3_back
>    4.12%  postgres  postgres            [.] hash_search_with_hash_value
>
> --
> master:
> psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, num_dead_tuples from
pg_stat_progress_vacuum"
>        phase       | heap_blks_total | heap_blks_scanned | max_dead_tuples | num_dead_tuples
> -------------------+-----------------+-------------------+-----------------+-----------------
>  vacuuming indexes |         5405406 |           5405406 |       178956969 |        38000000
>
> v29 patch:
> psql  -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuple_bytes, dead_tuple_bytes from
pg_stat_progress_vacuum"
>        phase       | heap_blks_total | heap_blks_scanned | max_dead_tuple_bytes | dead_tuple_bytes
> -------------------+-----------------+-------------------+----------------------+------------------
>  vacuuming indexes |         5405406 |           5405406 |           1073670144 |          8678064
>
> Here, the old array pessimistically needs 1GB allocated (as for any table > ~5GB), but only fills 228MB for tid
lookup.The patch reports 8.7MB. Tables that only fit, say, 30-50 tuples per page will have less extreme differences in
memoryuse. Same for the case where only a couple dead items occur per page, with many uninteresting pages in between.
Evenso, the allocation will be much more accurately sized in the patch, especially in non-parallel vacuum. 

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Feb 24, 2023 at 3:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> In v29 vacuum took twice as long (286 s vs. 573 s)?

Not sure what happened there, and clearly I was looking at the wrong number :/
I scripted the test for reproducibility and ran it three times. Also included some variations (attached):

UUID times look comparable here, so no speedup or regression:

master:
system usage: CPU: user: 216.05 s, system: 35.81 s, elapsed: 634.22 s
system usage: CPU: user: 173.71 s, system: 31.24 s, elapsed: 599.04 s
system usage: CPU: user: 171.16 s, system: 30.21 s, elapsed: 583.21 s

v29:
system usage: CPU: user:  93.47 s, system: 40.92 s, elapsed: 594.10 s
system usage: CPU: user:  99.58 s, system: 44.73 s, elapsed: 606.80 s
system usage: CPU: user:  96.29 s, system: 42.74 s, elapsed: 600.10 s

Then, I tried sequential integers, which is a much more favorable access pattern in general, and the new tid storage shows substantial improvement:

master:
system usage: CPU: user: 100.39 s, system: 7.79 s, elapsed: 121.57 s
system usage: CPU: user: 104.90 s, system: 8.81 s, elapsed: 124.24 s
system usage: CPU: user:  95.04 s, system: 7.55 s, elapsed: 116.44 s

v29:
system usage: CPU: user:  24.57 s, system: 8.53 s, elapsed: 61.07 s
system usage: CPU: user:  23.18 s, system: 8.25 s, elapsed: 58.99 s
system usage: CPU: user:  23.20 s, system: 8.98 s, elapsed: 66.86 s

That's fast enough that I thought an improvement would show up even with standard WAL logging (no separate attachment, since it's a trivial change). Seems a bit faster:

master:
system usage: CPU: user: 152.27 s, system: 11.76 s, elapsed: 216.86 s
system usage: CPU: user: 137.25 s, system: 11.07 s, elapsed: 213.62 s
system usage: CPU: user: 149.48 s, system: 12.15 s, elapsed: 220.96 s

v29:
system usage: CPU: user: 40.88 s, system: 15.99 s, elapsed: 170.98 s
system usage: CPU: user: 41.33 s, system: 15.45 s, elapsed: 166.75 s
system usage: CPU: user: 41.51 s, system: 18.20 s, elapsed: 203.94 s

There is more we could test here, but I feel better about these numbers.

In the next few days, I'll resume style review and list the remaining issues we need to address.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.

> My idea is to make the bug investigation easier but on
> reflection, it seems not the best idea given this purpose.

My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.

Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.

> What do you think
> about the attached patch? Please note that it also includes the
> changes for minimum memory requirement.

Most of the asserts look logical, or at least harmless.

- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */

I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.

This change, however, defies common sense:

+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */

+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);

Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.

This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.

But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem. It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?). 

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same
answer,the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the
wholepatchset won't be committed. 
>
> > My idea is to make the bug investigation easier but on
> > reflection, it seems not the best idea given this purpose.
>
> My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't
seemlike a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same
thingas comparing existing code with new code. That was my idea upthread. 
>
> Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that
ifI want that level of paranoia, I'm going to have to do it myself. 
>
> > What do you think
> > about the attached patch? Please note that it also includes the
> > changes for minimum memory requirement.
>
> Most of the asserts look logical, or at least harmless.
>
> - int max_off; /* the maximum offset number */
> + OffsetNumber max_off; /* the maximum offset number */
>
> I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we
decidedagainst the new asserts, this would be easy to lose. 

Right. I'll separate this change as a separate patch.

>
> This change, however, defies common sense:
>
> +/*
> + * The minimum amount of memory required by TidStore is 2MB, the current minimum
> + * valid value for the maintenance_work_mem GUC. This is required to allocate the
> + * DSA initial segment, 1MB, and some meta data. This number is applied also to
> + * the local TidStore cases for simplicity.
> + */
> +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
>
> + /* Sanity check for the max_bytes */
> + if (max_bytes < TIDSTORE_MIN_MEMORY)
> + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> + TIDSTORE_MIN_MEMORY, max_bytes);
>
> Aside from the fact that this elog's something that would never get past development, the #define just adds a
hard-codedcopy of something that is already hard-coded somewhere else, whose size depends on an implementation detail
ina third place. 
>
> This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of
someday unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. 
>
> But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a
designproblem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even
thoughoperations like parallel hash and parallel bitmap heap scan are limited by work_mem. 

Right.

>  It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is
noteven considered?). 

IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.

I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB.  Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same
answer,the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the
wholepatchset won't be committed. 
> >
> > > My idea is to make the bug investigation easier but on
> > > reflection, it seems not the best idea given this purpose.
> >
> > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't
seemlike a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same
thingas comparing existing code with new code. That was my idea upthread. 
> >
> > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that
ifI want that level of paranoia, I'm going to have to do it myself. 
> >
> > > What do you think
> > > about the attached patch? Please note that it also includes the
> > > changes for minimum memory requirement.
> >
> > Most of the asserts look logical, or at least harmless.
> >
> > - int max_off; /* the maximum offset number */
> > + OffsetNumber max_off; /* the maximum offset number */
> >
> > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If
wedecided against the new asserts, this would be easy to lose. 
>
> Right. I'll separate this change as a separate patch.
>
> >
> > This change, however, defies common sense:
> >
> > +/*
> > + * The minimum amount of memory required by TidStore is 2MB, the current minimum
> > + * valid value for the maintenance_work_mem GUC. This is required to allocate the
> > + * DSA initial segment, 1MB, and some meta data. This number is applied also to
> > + * the local TidStore cases for simplicity.
> > + */
> > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> >
> > + /* Sanity check for the max_bytes */
> > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> > + TIDSTORE_MIN_MEMORY, max_bytes);
> >
> > Aside from the fact that this elog's something that would never get past development, the #define just adds a
hard-codedcopy of something that is already hard-coded somewhere else, whose size depends on an implementation detail
ina third place. 
> >
> > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of
someday unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. 
> >
> > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a
designproblem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even
thoughoperations like parallel hash and parallel bitmap heap scan are limited by work_mem. 
>
> Right.
>
> >  It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is
noteven considered?). 
>
> IIUC both don't care about the allocated DSA segment size. Parallel
> hash accounts actual tuple (+ header) size as used memory but doesn't
> consider how much DSA segment is allocated behind. Both parallel hash
> and parallel bitmap scan can work even with work_mem = 64kB, but when
> checking the total DSA segment size allocated during these operations,
> it was 1MB.
>
> I realized that there is a similar memory limit design issue also on
> the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> won't work fine with work_mem = 64kB.  Probably we need to reconsider
> it. FYI 70kB comes from the maximum slab block size for node256.

Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it. For example, with 16 chunks the slab block size is 33,408
bytes and with 8 chunks it's 16,704 bytes. I ran a brief benchmark
test with 70kB block size and 16kB block size:

* 70kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     143085184 |    1216 |       750
(1 row)

* 16kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     157601248 |    1220 |       786
(1 row)

There is a performance difference a bit but a smaller slab block size
seems to be acceptable if there is no other better way.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >
> > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > > <john.naylor@enterprisedb.com> wrote:
> > > > >
> > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
> > >
> > > > My idea is to make the bug investigation easier but on
> > > > reflection, it seems not the best idea given this purpose.
> > >
> > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.
> > >
> > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.
> > >
> > > > What do you think
> > > > about the attached patch? Please note that it also includes the
> > > > changes for minimum memory requirement.
> > >
> > > Most of the asserts look logical, or at least harmless.
> > >
> > > - int max_off; /* the maximum offset number */
> > > + OffsetNumber max_off; /* the maximum offset number */
> > >
> > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
> >
> > Right. I'll separate this change as a separate patch.
> >
> > >
> > > This change, however, defies common sense:
> > >
> > > +/*
> > > + * The minimum amount of memory required by TidStore is 2MB, the current minimum
> > > + * valid value for the maintenance_work_mem GUC. This is required to allocate the
> > > + * DSA initial segment, 1MB, and some meta data. This number is applied also to
> > > + * the local TidStore cases for simplicity.
> > > + */
> > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> > >
> > > + /* Sanity check for the max_bytes */
> > > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> > > + TIDSTORE_MIN_MEMORY, max_bytes);
> > >
> > > Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.
> > >
> > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.
> > >
> > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
> >
> > Right.
> >
> > >  It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).
> >
> > IIUC both don't care about the allocated DSA segment size. Parallel
> > hash accounts actual tuple (+ header) size as used memory but doesn't
> > consider how much DSA segment is allocated behind. Both parallel hash
> > and parallel bitmap scan can work even with work_mem = 64kB, but when
> > checking the total DSA segment size allocated during these operations,
> > it was 1MB.
> >
> > I realized that there is a similar memory limit design issue also on
> > the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> > won't work fine with work_mem = 64kB.  Probably we need to reconsider
> > it. FYI 70kB comes from the maximum slab block size for node256.
>
> Currently, we calculate the slab block size enough to allocate 32
> chunks from there. For node256, the leaf node is 2,088 bytes and the
> slab block size is 66,816 bytes. One idea to fix this issue to
> decrease it.

I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault. If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every block allocation within the configured limit. I'm guessing we can just ask the context if the block space has gone *over* the limit, and we can assume that the last allocation we perform will only fault one additional page. We need to have a clear answer on this before doing anything else.

If that's correct, and I'm not positive yet, we can get rid of all the fragile assumptions about things the tid store has no business knowing about, as well as the guc change. I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > >
> > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > > > <john.naylor@enterprisedb.com> wrote:
> > > > > >
> > > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the
sameanswer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then
thewhole patchset won't be committed. 
> > > >
> > > > > My idea is to make the bug investigation easier but on
> > > > > reflection, it seems not the best idea given this purpose.
> > > >
> > > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that
doesn'tseem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not
thesame thing as comparing existing code with new code. That was my idea upthread. 
> > > >
> > > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear
thatif I want that level of paranoia, I'm going to have to do it myself. 
> > > >
> > > > > What do you think
> > > > > about the attached patch? Please note that it also includes the
> > > > > changes for minimum memory requirement.
> > > >
> > > > Most of the asserts look logical, or at least harmless.
> > > >
> > > > - int max_off; /* the maximum offset number */
> > > > + OffsetNumber max_off; /* the maximum offset number */
> > > >
> > > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch.
Ifwe decided against the new asserts, this would be easy to lose. 
> > >
> > > Right. I'll separate this change as a separate patch.
> > >
> > > >
> > > > This change, however, defies common sense:
> > > >
> > > > +/*
> > > > + * The minimum amount of memory required by TidStore is 2MB, the current minimum
> > > > + * valid value for the maintenance_work_mem GUC. This is required to allocate the
> > > > + * DSA initial segment, 1MB, and some meta data. This number is applied also to
> > > > + * the local TidStore cases for simplicity.
> > > > + */
> > > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> > > >
> > > > + /* Sanity check for the max_bytes */
> > > > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> > > > + TIDSTORE_MIN_MEMORY, max_bytes);
> > > >
> > > > Aside from the fact that this elog's something that would never get past development, the #define just adds a
hard-codedcopy of something that is already hard-coded somewhere else, whose size depends on an implementation detail
ina third place. 
> > > >
> > > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example
ofsome day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. 
> > > >
> > > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a
designproblem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even
thoughoperations like parallel hash and parallel bitmap heap scan are limited by work_mem. 
> > >
> > > Right.
> > >
> > > >  It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe
parallelismis not even considered?). 
> > >
> > > IIUC both don't care about the allocated DSA segment size. Parallel
> > > hash accounts actual tuple (+ header) size as used memory but doesn't
> > > consider how much DSA segment is allocated behind. Both parallel hash
> > > and parallel bitmap scan can work even with work_mem = 64kB, but when
> > > checking the total DSA segment size allocated during these operations,
> > > it was 1MB.
> > >
> > > I realized that there is a similar memory limit design issue also on
> > > the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> > > won't work fine with work_mem = 64kB.  Probably we need to reconsider
> > > it. FYI 70kB comes from the maximum slab block size for node256.
> >
> > Currently, we calculate the slab block size enough to allocate 32
> > chunks from there. For node256, the leaf node is 2,088 bytes and the
> > slab block size is 66,816 bytes. One idea to fix this issue to
> > decrease it.
>
> I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to
staywithin a memory limit only needs to track what's been allocated in chunks within a block, since writing there is
whatinvokes a page fault. 

Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.

I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.

> If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every
blockallocation within the configured limit. I'm guessing we can just ask the context if the block space has gone
*over*the limit, and we can assume that the last allocation we perform will only fault one additional page. We need to
havea clear answer on this before doing anything else. 
>
> If that's correct, and I'm not positive yet, we can get rid of all the fragile assumptions about things the tid store
hasno business knowing about, as well as the guc change. 

True.

> I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes
biggerthan max_dead_tuple_bytes. 

Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.
>
> Right. I guess we've discussed what we use for calculating the *used*
> memory amount but I don't remember.
>
> I think I was confused by the fact that we use some different
> approaches to calculate the amount of used memory. Parallel hash and
> tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> allocated block size.

That's good to know. The latter says:

 * After adding a new group to the hash table, check whether we need to enter
 * spill mode. Allocations may happen without adding new groups (for instance,
 * if the transition state size grows), so this check is imperfect.

I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficient in used space, and 2) no longer bound by the 1GB limit. 

> > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
>
> Yes, the progress reporting could be confusable. Particularly, in
> shared tidstore cases, the dead_tuple_bytes could be much bigger than
> max_dead_tuple_bytes. Probably what we need might be functions for
> MemoryContext and dsa_area to get the amount of memory that has been
> allocated, by not tracking every chunk space. For example, the
> functions would be like what SlabStats() does; iterate over every
> block and calculates the total/free memory usage.

I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations for memory accounting is:

First, get the block-level space -- stop and vacuum indexes if we exceed the limit:

/*
 * Consider if we definitely have enough space to process TIDs on page
 * already.  If we are close to overrunning the available space for
 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 * this page.
 */
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"

Then, after pruning the current page, store the tids and then get the block-level space again:

else if (prunestate.num_offsets > 0)
{
  /* Save details of the LP_DEAD items from the page in dead_items */
  TidStoreSetBlockOffsets(...);

  pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
                               TidStoreMemoryUsage(dead_items));
}

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

Thoughts?

But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is called twice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems like we should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicates that TidStoreIsFull() is not a useful interface, at least in this form.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > >
> > > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that
needsto stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing
thereis what invokes a page fault. 
> >
> > Right. I guess we've discussed what we use for calculating the *used*
> > memory amount but I don't remember.
> >
> > I think I was confused by the fact that we use some different
> > approaches to calculate the amount of used memory. Parallel hash and
> > tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> > in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> > allocated block size.
>
> That's good to know. The latter says:
>
>  * After adding a new group to the hash table, check whether we need to enter
>  * spill mode. Allocations may happen without adding new groups (for instance,
>  * if the transition state size grows), so this check is imperfect.
>
> I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more
efficientin used space, and 2) no longer bound by the 1GB limit. 
>
> > > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes
biggerthan max_dead_tuple_bytes. 
> >
> > Yes, the progress reporting could be confusable. Particularly, in
> > shared tidstore cases, the dead_tuple_bytes could be much bigger than
> > max_dead_tuple_bytes. Probably what we need might be functions for
> > MemoryContext and dsa_area to get the amount of memory that has been
> > allocated, by not tracking every chunk space. For example, the
> > functions would be like what SlabStats() does; iterate over every
> > block and calculates the total/free memory usage.
>
> I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations
formemory accounting is: 
>
> First, get the block-level space -- stop and vacuum indexes if we exceed the limit:
>
> /*
>  * Consider if we definitely have enough space to process TIDs on page
>  * already.  If we are close to overrunning the available space for
>  * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
>  * this page.
>  */
> if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) >
ts->control->max_bytes)"
>
> Then, after pruning the current page, store the tids and then get the block-level space again:
>
> else if (prunestate.num_offsets > 0)
> {
>   /* Save details of the LP_DEAD items from the page in dead_items */
>   TidStoreSetBlockOffsets(...);
>
>   pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
>                                TidStoreMemoryUsage(dead_items));
> }
>
> Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the
actionshere, effectively reporting progress for the *last page* and not the current one: First update progress with the
currentmemory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be
writtento. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our
earlierattempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to,
Ithink it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. 
>
> Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

BTW I realized that since the caller can pass dsa_area to tidstore
(and radix tree), if other data are allocated in the same DSA are,
TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage
that includes not only itself but also other data. Probably it's
better to comment that the passed dsa_area should be dedicated to a
tidstore (or a radix tree).

>
> But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is
calledtwice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems
likewe should call it once per loop and save the result somewhere. If that's the right way to go, that possibly
indicatesthat TidStoreIsFull() is not a useful interface, at least in this form. 

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> >
> > Thoughts?
>
> It looks to work but it still doesn't work in a case where a shared
> tidstore is created with a 64kB memory limit, right?
> TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> from the beginning.

I have two ideas:

1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Mar 7, 2023 at 1:01 AM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of
theactions here, effectively reporting progress for the *last page* and not the current one: First update progress with
thecurrent memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be
writtento. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our
earlierattempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to,
Ithink it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. 
> > >
> > > Thoughts?
> >
> > It looks to work but it still doesn't work in a case where a shared
> > tidstore is created with a 64kB memory limit, right?
> > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > from the beginning.
>
> I have two ideas:
>
> 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else
thatvacuum does. That would allow other users to avoid that overhead. 
> 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not
suresuch a top-down facility exists, and I'm not feeling well enough today to study this further. 

I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.

[1] https://www.postgresql.org/message-id/CAD21AoDK3gbX-jVxT6Pfso1Na0Krzr8Q15498Aj6tmXgzMFksA%40mail.gmail.com

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
> > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.
>
> I prefer option (1) as it's straight forward. I mentioned a similar
> idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
> defined. It might be worth checking if there is visible overhead of
> tracking chunk memory space. IIRC we've not evaluated it yet.

Ok, let's try this -- I can test and profile later this week.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 8, 2023 at 1:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>

> On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything
elsethat vacuum does. That would allow other users to avoid that overhead. 
> > > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm
notsure such a top-down facility exists, and I'm not feeling well enough today to study this further. 
> >
> > I prefer option (1) as it's straight forward. I mentioned a similar
> > idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
> > defined. It might be worth checking if there is visible overhead of
> > tracking chunk memory space. IIRC we've not evaluated it yet.
>
> Ok, let's try this -- I can test and profile later this week.

Thanks!

I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch. 0007 through 0010 are updates from v29. The
main change made in v30 is to make the memory measurement and
RT_MEMORY_USAGE() optional, which is done in 0007 patch. The 0008 and
0009 patches are the updates for tidstore and the vacuum integration
patches. Here are results of quick tests (an average of 3 executions):

query: select * from bench_load_random_int(10 * 1000 * 1000)

* w/ RT_MEASURE_MEMORY_USAGE:
 mem_allocated | load_ms
---------------+---------
    1996512000 |    3305
(1 row)

* w/o RT_MEASURE_MEMORY_USAGE:
 mem_allocated | load_ms
---------------+---------
             0 |    3258
(1 row)

It seems to be within a noise level but I agree to make it optional.

Apart from the memory measurement stuff, I've done another todo item
on my list; adding min max classes for node3 and node125. I've done
that in 0010 patch, and here is a quick test result:

query: select * from bench_load_random_int(10 * 1000 * 1000)

* w/ 0000 patch
 mem_allocated | load_ms
---------------+---------
    1268630080 |    3275
(1 row)

* w/o 0000 patch
 mem_allocated | load_ms
---------------+---------
    1996512000 |    3214
(1 row)

That's a good improvement on the memory usage, without a noticeable
performance overhead. FYI CLASS_3_MIN has 1 fanout and is 24 bytes in
size, and CLASS_125_MIN has 61 fanouts and is 768 bytes in size.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've attached the new version patches. I merged improvements and fixes
> I did in the v29 patch.

I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).

> Apart from the memory measurement stuff, I've done another todo item
> on my list; adding min max classes for node3 and node125. I've done

This didn't help us move us closer to something committable the first time you coded this without making sure it was a good idea. It's still not helping and arguably makes it worse. To be fair, I did speak positively about _considering_ additional size classes some months ago, but that has a very obvious maintenance cost, something we can least afford right now.

I'm frankly baffled you thought this was important enough to work on again, yet thought it was a waste of time to try to prove to ourselves that autovacuum in a realistic, non-deterministic workload gave the same answer as the current tid lookup. Even if we had gone that far, it doesn't seem like a good idea to add non-essential code to critical paths right now.

We're rapidly running out of time, and we're at the point in the cycle where it's impossible to get meaningful review from anyone not already intimately familiar with the patch series. I only want to see progress on addressing possible (especially architectural) objections from the community, because if they don't notice them now, they surely will after commit. I have my own list of possible objections as well as bikeshedding points, which I'll clean up and share next week. I plan to invite Andres to look at that list and give his impressions, because it's a lot quicker than reading the patches. Based on that, I'll hopefully be able to decide whether we have enough time to address any feedback and do remaining polishing in time for feature freeze.

I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I've attached the new version patches. I merged improvements and fixes
> > I did in the v29 patch.
>
> I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I
wasn'tparticularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things.
Separatingthose out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I
needto look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results,
theylook encouraging). 

Okay, I'll separate them again.

>
> > Apart from the memory measurement stuff, I've done another todo item
> > on my list; adding min max classes for node3 and node125. I've done
>
> This didn't help us move us closer to something committable the first time you coded this without making sure it was
agood idea. It's still not helping and arguably makes it worse. To be fair, I did speak positively about _considering_
additionalsize classes some months ago, but that has a very obvious maintenance cost, something we can least afford
rightnow. 
>
> I'm frankly baffled you thought this was important enough to work on again, yet thought it was a waste of time to try
toprove to ourselves that autovacuum in a realistic, non-deterministic workload gave the same answer as the current tid
lookup.Even if we had gone that far, it doesn't seem like a good idea to add non-essential code to critical paths right
now.

I didn't think that proving tidstore and the current tid lookup return
the same result was a waste of time. I've shared a patch to do that in
tidstore before. I agreed not to add it to the tree but we can test
that using this patch. Actually I've done a test that ran pgbench
workload for a few days.

IIUC it's still important to consider whether to have node1 since it
could be a good alternative for the path compression. The prototype
also implemented it. Of course we can leave it for future improvement.
But considering this item with the performance tests helps us to prove
our decoupling approach is promising.

> We're rapidly running out of time, and we're at the point in the cycle where it's impossible to get meaningful review
fromanyone not already intimately familiar with the patch series. I only want to see progress on addressing possible
(especiallyarchitectural) objections from the community, because if they don't notice them now, they surely will after
commit.

Right, we've been making many design decisions. Some of them are
agreed just between you and me and some are agreed with other hackers.
There are some irrevertible design decisions due to the remaining
time.

>  I have my own list of possible objections as well as bikeshedding points, which I'll clean up and share next week.

Thanks.

>  I plan to invite Andres to look at that list and give his impressions, because it's a lot quicker than reading the
patches.Based on that, I'll hopefully be able to decide whether we have enough time to address any feedback and do
remainingpolishing in time for feature freeze. 
>
> I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.

Apart from more rounds of reviews and tests, my todo items that need
discussion and possibly implementation are:

* The memory measurement in radix trees and the memory limit in
tidstores. I've implemented it in v30-0007 through 0009 but we need to
review it. This is the highest priority for me.

* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.

* Node shrinking support. Low priority.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I've attached the new version patches. I merged improvements and fixes
> > > I did in the v29 patch.
> >
> > I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember
Iwasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random
things.Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to
this.I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new
results,they look encouraging). 
>
> Okay, I'll separate them again.

Attached new patch series. In addition to separate them again, I've
fixed a conflict with HEAD.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
>
> Apart from more rounds of reviews and tests, my todo items that need
> discussion and possibly implementation are:

Quick thoughts on these:

> * The memory measurement in radix trees and the memory limit in
> tidstores. I've implemented it in v30-0007 through 0009 but we need to
> review it. This is the highest priority for me.

Agreed.

> * Additional size classes. It's important for an alternative of path
> compression as well as supporting our decoupling approach. Middle
> priority.

I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression. I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.

About a smaller node125-type class: I'm actually not even sure we need to have any sub-max node bigger about 64 (node size 768 bytes). I'd just let 65+ go to the max node -- there won't be many of them, at least in synthetic workloads we've seen so far.

> * Node shrinking support. Low priority.

This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?

I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > > I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
> >
> > Apart from more rounds of reviews and tests, my todo items that need
> > discussion and possibly implementation are:
>
> Quick thoughts on these:
>
> > * The memory measurement in radix trees and the memory limit in
> > tidstores. I've implemented it in v30-0007 through 0009 but we need to
> > review it. This is the highest priority for me.
>
> Agreed.
>
> > * Additional size classes. It's important for an alternative of path
> > compression as well as supporting our decoupling approach. Middle
> > priority.
>
> I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1
fromAndres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the
lackof path compression. 

But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.

> I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still
usethat loop. 

I've evaluated the performance of node1 but the result seems to show
the opposite. I used the test query:

select * from bench_search_random_nodes(100 * 1000 * 1000,
'0xFF000000000000FF');

Which make the radix tree that has node1 like:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n1 = 1536, n3 = 0, n15 = 0, n32 = 0, n61 = 0, n256 = 257

All internal nodes except for the root node are node1. The radix tree
that doesn't have node1 is:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n3 = 1536, n15 = 0, n32 = 0, n125 = 0, n256 = 257

Here is the result:

* w/ node1
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
        573448 |    1848 |      1707
(1 row)

* w/o node1
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
        598024 |    2014 |      1825
(1 row)

Am I missing something?

>
> About a smaller node125-type class: I'm actually not even sure we need to have any sub-max node bigger about 64 (node
size768 bytes). I'd just let 65+ go to the max node -- there won't be many of them, at least in synthetic workloads
we'veseen so far. 

Makes sense to me.

>
> > * Node shrinking support. Low priority.
>
> This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it
sometime.If we're not going to make this work, why ship a deletion API at all? 
>
> I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect
whento shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into
that,but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this
point.

I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it. There might not be such extensions. Given the
lack of use cases in the core (and the rest of time), I think it's
okay even if the implementation of such API is minimal and not
optimized enough.  For instance, the implementation of dshash.c is
minimalist, and doesn't have resizing. We can improve them in the
future if extensions or other core features want.

Personally I think we should focus on addressing feedback that we
would get and improving the existing use cases for the rest of time.
That's why considering min-max size class has a higher priority than
the node shrinking support in my todo list.

FYI, I've run TPC-C workload over the weekend, and didn't get any
failures of the assertion proving tidstore and the current tid lookup
return the same result.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Mar 12, 2023 at 12:54 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > > * Additional size classes. It's important for an alternative of path
> > > compression as well as supporting our decoupling approach. Middle
> > > priority.
> >
> > I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression.
>
> But does it mean that our node1 would help reduce the memory further
> since since our base node type (i.e. RT_NODE) is smaller than the base
> node type of Andres's prototype? The result I shared before showed
> 1.2GB vs. 1.9GB.

The benefit is found in a synthetic benchmark with random integers. I highly doubt that anyone would be willing to force us to keep binary-searching the 1GB array for one more cycle on account of not adding a size class here. I'll repeat myself and say that there are also maintenance costs.

In contrast, I'm fairly certain that our attempts thus far at memory accounting/limiting are not quite up to par, and lacking enough to jeopardize the feature. We're already discussing that, so I'll say no more.

> > I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.
>
> I've evaluated the performance of node1 but the result seems to show
> the opposite.

As an aside, I meant the loop in our node3 might make your node1 slower than the prototype's node1, which was coded for 1 member only. 

> > > * Node shrinking support. Low priority.
> >
> > This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?
> >
> > I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.
>
> I think that the deletion (and locking support) doesn't have use cases
> in the core (i.e. tidstore) but is implemented so that external
> extensions can use it.

I think these cases are a bit different: Doing anything with a data structure stored in shared memory without a synchronization scheme is completely unthinkable and insane. I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's preferable in v16 than no deletion at all.

Anything we don't implement now is a limit on future use cases, and thus a cause for objection. On the other hand, anything we implement also represents more stuff that will have to be rewritten for high-concurrency.

> FYI, I've run TPC-C workload over the weekend, and didn't get any
> failures of the assertion proving tidstore and the current tid lookup
> return the same result.

Great!

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 13, 2023 at 10:28 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Mar 12, 2023 at 12:54 AM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > > * Additional size classes. It's important for an alternative of path
> > > > compression as well as supporting our decoupling approach. Middle
> > > > priority.
> > >
> > > I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The
node1from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate
thelack of path compression. 
> >
> > But does it mean that our node1 would help reduce the memory further
> > since since our base node type (i.e. RT_NODE) is smaller than the base
> > node type of Andres's prototype? The result I shared before showed
> > 1.2GB vs. 1.9GB.
>
> The benefit is found in a synthetic benchmark with random integers. I highly doubt that anyone would be willing to
forceus to keep binary-searching the 1GB array for one more cycle on account of not adding a size class here. I'll
repeatmyself and say that there are also maintenance costs. 
>
> In contrast, I'm fairly certain that our attempts thus far at memory accounting/limiting are not quite up to par, and
lackingenough to jeopardize the feature. We're already discussing that, so I'll say no more. 

I agree that memory accounting/limiting stuff is the highest priority.
So what kinds of size classes do you think we need? node3, 15, 32, 61
and 256?

>
> > > I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition
stilluse that loop. 
> >
> > I've evaluated the performance of node1 but the result seems to show
> > the opposite.
>
> As an aside, I meant the loop in our node3 might make your node1 slower than the prototype's node1, which was coded
for1 member only. 

Agreed.

>
> > > > * Node shrinking support. Low priority.
> > >
> > > This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it
sometime.If we're not going to make this work, why ship a deletion API at all? 
> > >
> > > I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to
detectwhen to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort
intothat, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at
thispoint. 
> >
> > I think that the deletion (and locking support) doesn't have use cases
> > in the core (i.e. tidstore) but is implemented so that external
> > extensions can use it.
>
> I think these cases are a bit different: Doing anything with a data structure stored in shared memory without a
synchronizationscheme is completely unthinkable and insane. 

Right.

> I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's preferable in v16 than no deletion at
all.
>
> Anything we don't implement now is a limit on future use cases, and thus a cause for objection. On the other hand,
anythingwe implement also represents more stuff that will have to be rewritten for high-concurrency. 

Okay. Given that adding shrinking support also requires maintenance
costs (and probably new test cases?) and there are no use cases in the
core, I'm not sure it's worth supporting it at this stage. So I prefer
either shipping the deletion API as it is and removing the deletion
API. I think that it's a discussion point that we'd like to hear
feedback from other hackers.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I wrote:

> > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > >
> > > Thoughts?
> >
> > It looks to work but it still doesn't work in a case where a shared
> > tidstore is created with a 64kB memory limit, right?
> > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > from the beginning.
>
> I have two ideas:
>
> 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
> 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.

Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at now. Unfortunately, If I had spent five minutes reminding myself what the original objections were to this approach, I could have saved us some effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace() is slow [1], and fragmentation [2] (leading to underestimation).

In v31, in the local case at least, the underestimation is actually worse than tracking chunk space, since it ignores chunk header and alignment.  I'm not sure about the DSA case. This doesn't seem great.

It shouldn't be a surprise why a simple increment of raw allocation size is comparable in speed -- GetMemoryChunkSpace() calls the right function through a pointer, which is slower. If we were willing to underestimate for the sake of speed, that takes away the reason for making memory tracking optional.

Further, if the option is not specified, in v31 there is no way to get the memory use at all, which seems odd. Surely the caller should be able to ask the context/area, if it wants to.

I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.

I'll put this item and a couple other things together in a separate email tomorrow.

[1] https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
[2] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

--

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I wrote:
>
> > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order
ofthe actions here, effectively reporting progress for the *last page* and not the current one: First update progress
withthe current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will
bewritten to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like
ourearlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written
to,I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. 
> > > >
> > > > Thoughts?
> > >
> > > It looks to work but it still doesn't work in a case where a shared
> > > tidstore is created with a 64kB memory limit, right?
> > > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > > from the beginning.
> >
> > I have two ideas:
> >
> > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything
elsethat vacuum does. That would allow other users to avoid that overhead. 
> > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm
notsure such a top-down facility exists, and I'm not feeling well enough today to study this further. 
>
> Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at now. Unfortunately, If I had spent five
minutesreminding myself what the original objections were to this approach, I could have saved us some effort. Back in
July(!), Andres raised two points: GetMemoryChunkSpace() is slow [1], and fragmentation [2] (leading to
underestimation).
>
> In v31, in the local case at least, the underestimation is actually worse than tracking chunk space, since it ignores
chunkheader and alignment.  I'm not sure about the DSA case. This doesn't seem great. 

Right.

>
> It shouldn't be a surprise why a simple increment of raw allocation size is comparable in speed --
GetMemoryChunkSpace()calls the right function through a pointer, which is slower. If we were willing to underestimate
forthe sake of speed, that takes away the reason for making memory tracking optional. 
>
> Further, if the option is not specified, in v31 there is no way to get the memory use at all, which seems odd. Surely
thecaller should be able to ask the context/area, if it wants to. 

There are precedents that don't provide a way to return memory usage,
such as simplehash.h and dshash.c.

>
> I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's
rightbut I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases
atthe moment. 
>

What does it mean by "the precise usage" in your idea? Quoting from
the email you referred to, Andres said:

---
One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().
---

IIUC he suggested measuring memory usage in block-level in order to
count blocks that are not actually freed but some of its chunks are
freed. That's why we used MemoryContextMemAllocated(). On the other
hand, recently you pointed out[1]:

---
I think we're trying to solve the wrong problem here. I need to study
this more, but it seems that code that needs to stay within a memory
limit only needs to track what's been allocated in chunks within a
block, since writing there is what invokes a page fault.
---

IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.

> I'll put this item and a couple other things together in a separate email tomorrow.

Thanks!

Regards,

[1] https://www.postgresql.org/message-id/CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y%3Dq6ow%3DLTzyDvA%40mail.gmail.com


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > I wrote:
> >
> > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

> > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.

> IIUC you suggested measuring memory usage by tracking how much memory
> chunks are allocated within a block. If your idea at the top of the
> page follows this method, it still doesn't deal with the point Andres
> mentioned.

Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.

However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.

I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA area when it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get smaller. Modifying an example we used for the fudge-factor idea some time ago:

m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:

766 + 2*(128) + 64 = 1086MB -> stop

That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).

And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is looking like v17 material.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > I wrote:
> > >
> > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the
orderof the actions here, effectively reporting progress for the *last page* and not the current one: First update
progresswith the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of
thatwill be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's
kindof like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have
actuallywritten to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers
willmake sense. 
>
> > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if
it'sright but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future
casesat the moment. 
>
> > IIUC you suggested measuring memory usage by tracking how much memory
> > chunks are allocated within a block. If your idea at the top of the
> > page follows this method, it still doesn't deal with the point Andres
> > mentioned.
>
> Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The
re-orderingwas just to make sure that progress reporting didn't show current-use > max-use. 

Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.

>
> However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations.
Itseems it just wasn't designed for things where memory growth is unpredictable. 
>
> I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA
areawhen it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get
smaller.Modifying an example we used for the fudge-factor idea some time ago: 
>
> m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.
>
> 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:
>
> 766 + 2*(128) + 64 = 1086MB -> stop
>
> That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).

This is an interesting idea. But I'm concerned we don't have enough
time to get confident with adding this new concept to DSA.

>
> And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is
lookinglike v17 material. 

Another problem we need to deal with is the supported minimum memory
in shared tidstore cases. Since the initial DSA segment size is 1MB,
memory usage of a shared tidstore will start from 1MB+. This is higher
than the minimum values of both work_mem and maintenance_work_mem,
64kB and 1MB respectively. Increasing the minimum m_w_m to 2MB seems
to be acceptable in the community but not for work_mem. One idea is to
deny the memory limit less than 2MB so it won't work with small m_w_m
settings. While it might be an acceptable restriction at this stage
(where there is no use case of using tidstore with work_mem in the
core) but it will be a blocker for the future adoptions such as
unifying with tidbitmap.c. Another idea is that the process can
specify the initial segment size at dsa_create() so that DSA can start
with a smaller segment, say 32kB. That way, a tidstore with a 32kB
limit gets full once it allocates the next DSA segment, 32kB. . But a
downside of this idea is to increase the number of segments behind
DSA. Assuming it's a relatively rare case where we use such a low
work_mem, it might be acceptable. FYI, the total number of DSM
segments available on the system is calculated by:

#define PG_DYNSHMEM_FIXED_SLOTS         64
#define PG_DYNSHMEM_SLOTS_PER_BACKEND   5

maxitems = PG_DYNSHMEM_FIXED_SLOTS
    + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 17, 2023 at 4:03 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > > I wrote:
> > > >
> > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the
orderof the actions here, effectively reporting progress for the *last page* and not the current one: First update
progresswith the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of
thatwill be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's
kindof like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have
actuallywritten to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers
willmake sense. 
> >
> > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear
ifit's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future
casesat the moment. 
> >
> > > IIUC you suggested measuring memory usage by tracking how much memory
> > > chunks are allocated within a block. If your idea at the top of the
> > > page follows this method, it still doesn't deal with the point Andres
> > > mentioned.
> >
> > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The
re-orderingwas just to make sure that progress reporting didn't show current-use > max-use. 
>
> Right. I still like your re-ordering idea. It's true that the most
> area of the last allocated block before heap scanning stops is not
> actually used yet. I'm guessing we can just check if the context
> memory has gone over the limit. But I'm concerned it might not work
> well in systems where overcommit memory is disabled.
>
> >
> > However, the big question remains DSA, since a new segment can be as large as the entire previous set of
allocations.It seems it just wasn't designed for things where memory growth is unpredictable. 

aset.c also has a similar characteristic; allocates an 8K block upon
the first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of another idea to specify both
to DSA and both values are calculated based on m_w_m. For example, we
can create a DSA in parallel_vacuum_init() as follows:

initial block size = min(m_w_m / 4, 1MB)
max block size = max(m_w_m / 8, 8MB)

In most cases, we can start with a 1MB initial segment, the same as
before. For small memory cases, say 1MB, we start with a 256KB initial
segment and heap scanning stops after DSA allocated 1.5MB (= 256kB +
256kB + 512kB + 512kB). For larger memory, we can have heap scan stop
after DSA allocates 1.25 times more memory than m_w_m. For example, if
m_w_m = 1GB, the both initial and maximum segment sizes are 1MB and
128MB respectively, and then DSA allocates the segments as follows
until heap scanning stops:

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + (128 * 5) = 1150MB

dsa_allocate() will be extended to have the initial and maximum block
sizes like AllocSetContextCreate().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 17, 2023 at 4:03 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > > > <john.naylor@enterprisedb.com> wrote:
> > > > >
> > > > > I wrote:
> > > > >
> > > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > >
> > > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
> > >
> > > > IIUC you suggested measuring memory usage by tracking how much memory
> > > > chunks are allocated within a block. If your idea at the top of the
> > > > page follows this method, it still doesn't deal with the point Andres
> > > > mentioned.
> > >
> > > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
> >
> > Right. I still like your re-ordering idea. It's true that the most
> > area of the last allocated block before heap scanning stops is not
> > actually used yet. I'm guessing we can just check if the context
> > memory has gone over the limit. But I'm concerned it might not work
> > well in systems where overcommit memory is disabled.
> >
> > >
> > > However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
>
> aset.c also has a similar characteristic; allocates an 8K block upon
> the first allocation in a context, and doubles that size for each
> successive block request. But we can specify the initial block size
> and max blocksize. This made me think of another idea to specify both
> to DSA and both values are calculated based on m_w_m. For example, we

That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Mar 17, 2023 at 4:03 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > > > > <john.naylor@enterprisedb.com> wrote:
> > > > > >
> > > > > > I wrote:
> > > > > >
> > > > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse
theorder of the actions here, effectively reporting progress for the *last page* and not the current one: First update
progresswith the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of
thatwill be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's
kindof like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have
actuallywritten to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers
willmake sense. 
> > > >
> > > > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely
clearif it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all
futurecases at the moment. 
> > > >
> > > > > IIUC you suggested measuring memory usage by tracking how much memory
> > > > > chunks are allocated within a block. If your idea at the top of the
> > > > > page follows this method, it still doesn't deal with the point Andres
> > > > > mentioned.
> > > >
> > > > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The
re-orderingwas just to make sure that progress reporting didn't show current-use > max-use. 
> > >
> > > Right. I still like your re-ordering idea. It's true that the most
> > > area of the last allocated block before heap scanning stops is not
> > > actually used yet. I'm guessing we can just check if the context
> > > memory has gone over the limit. But I'm concerned it might not work
> > > well in systems where overcommit memory is disabled.
> > >
> > > >
> > > > However, the big question remains DSA, since a new segment can be as large as the entire previous set of
allocations.It seems it just wasn't designed for things where memory growth is unpredictable. 
> >
> > aset.c also has a similar characteristic; allocates an 8K block upon
> > the first allocation in a context, and doubles that size for each
> > successive block request. But we can specify the initial block size
> > and max blocksize. This made me think of another idea to specify both
> > to DSA and both values are calculated based on m_w_m. For example, we
>
> That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and
maintainability.Worth seeing if it's workable. 

I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable. The patch includes a test
function, test_memory_usage() to simulate how DSA segments grow behind
the shared radix tree. If we set the first argument to true, it
calculates both initial and maximum block size based on work_mem (I
used work_mem here just because its value range is larger than m_w_m):

postgres(1:833654)=# select test_memory_usage(true);
NOTICE:  memory limit 134217728
NOTICE:  init 1048576 max 16777216
NOTICE:  initial: 1048576
NOTICE:  rt_create: 1048576
NOTICE:  allocate new DSM [1] 1048576
NOTICE:  allocate new DSM [2] 2097152
NOTICE:  allocate new DSM [3] 2097152
NOTICE:  allocate new DSM [4] 4194304
NOTICE:  allocate new DSM [5] 4194304
NOTICE:  allocate new DSM [6] 8388608
NOTICE:  allocate new DSM [7] 8388608
NOTICE:  allocate new DSM [8] 16777216
NOTICE:  allocate new DSM [9] 16777216
NOTICE:  allocate new DSM [10] 16777216
NOTICE:  allocate new DSM [11] 16777216
NOTICE:  allocate new DSM [12] 16777216
NOTICE:  allocate new DSM [13] 16777216
NOTICE:  allocate new DSM [14] 16777216
NOTICE:  reached: 148897792 (+14680064)
NOTICE:  12718205 keys inserted: 148897792
 test_memory_usage
-------------------

(1 row)

Time: 7195.664 ms (00:07.196)

Setting the first argument to false, we can specify both manually in
second and third arguments:

postgres(1:833654)=# select test_memory_usage(false, 1024 * 1024, 1024
* 1024 * 1024 * 10::bigint);
NOTICE:  memory limit 134217728
NOTICE:  init 1048576 max 10737418240
NOTICE:  initial: 1048576
NOTICE:  rt_create: 1048576
NOTICE:  allocate new DSM [1] 1048576
NOTICE:  allocate new DSM [2] 2097152
NOTICE:  allocate new DSM [3] 2097152
NOTICE:  allocate new DSM [4] 4194304
NOTICE:  allocate new DSM [5] 4194304
NOTICE:  allocate new DSM [6] 8388608
NOTICE:  allocate new DSM [7] 8388608
NOTICE:  allocate new DSM [8] 16777216
NOTICE:  allocate new DSM [9] 16777216
NOTICE:  allocate new DSM [10] 33554432
NOTICE:  allocate new DSM [11] 33554432
NOTICE:  allocate new DSM [12] 67108864
NOTICE:  reached: 199229440 (+65011712)
NOTICE:  12718205 keys inserted: 199229440
 test_memory_usage
-------------------

(1 row)

Time: 7187.571 ms (00:07.188)

It seems to work fine. The differences between the above two cases is
the maximum block size (16MB .vs 10GB). We allocated two more DSA
segments in the first segments but there was no big difference in the
performance in my test environment.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Mar 20, 2023 at 9:34 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.
>
> I've attached a quick hack patch. It can be applied on top of v32
> patches. The changes to dsa.c are straightforward since it makes the
> initial and max block sizes configurable.

Good to hear -- this should probably be proposed in a separate thread for wider visibility.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Mar 21, 2023 at 2:41 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Mar 20, 2023 at 9:34 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > > That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and
maintainability.Worth seeing if it's workable. 
> >
> > I've attached a quick hack patch. It can be applied on top of v32
> > patches. The changes to dsa.c are straightforward since it makes the
> > initial and max block sizes configurable.
>
> Good to hear -- this should probably be proposed in a separate thread for wider visibility.

Agreed. I'll start a new thread for that.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
>
> We really ought to replace the tid bitmap used for bitmap heap scans. The
> hashtable we use is a pretty awful data structure for it. And that's not
> filled in-order, for example.

I spent some time studying tidbitmap.c, and not only does it make sense to use a radix tree there, but since it has more complex behavior and stricter runtime requirements, it should really be the thing driving the design and tradeoffs, not vacuum:

- With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.

- Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.

- Managing lossy storage. With pure blocknumber keys, replacing exact storage for a range of 256 pages amounts to replacing a last-level node with a single leaf containing one lossy PagetableEntry. The leader could iterate over the nodes, and rank the last-level nodes by how much storage they (possibly with leaf children) are using, and come up with an optimal lossy-conversion plan.

The above would address the points (not including better iteration and parallel bitmap index scans) raised in

https://www.postgresql.org/message-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com

Ironically, by targeting a more difficult use case, it's easier since there is less freedom. There are many ways to beat a binary search, but fewer good ways to improve bitmap heap scan. I'd like to put aside vacuum for some time and try killing two birds with one stone, building upon our work thus far. 

Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Apr 7, 2023 at 6:55 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > We really ought to replace the tid bitmap used for bitmap heap scans. The
> > hashtable we use is a pretty awful data structure for it. And that's not
> > filled in-order, for example.
>
> I spent some time studying tidbitmap.c, and not only does it make sense to use a radix tree there, but since it has
morecomplex behavior and stricter runtime requirements, it should really be the thing driving the design and tradeoffs,
notvacuum: 
>
> - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get
ridof the need to track TBMStatus, since setting a single-leaf tree should be cheap. 
>

Instead of introducing single-value leaves to the radix tree as
another structure, can we store pointers to PagetableEntry as values?

> - Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to
beingcomplex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also
nextpoint). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets
canbe stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value
leaf.That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory
useand node-traversal in common cases. 
>
> - Managing lossy storage. With pure blocknumber keys, replacing exact storage for a range of 256 pages amounts to
replacinga last-level node with a single leaf containing one lossy PagetableEntry. The leader could iterate over the
nodes,and rank the last-level nodes by how much storage they (possibly with leaf children) are using, and come up with
anoptimal lossy-conversion plan. 
>
> The above would address the points (not including better iteration and parallel bitmap index scans) raised in
>
> https://www.postgresql.org/message-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com
>
> Ironically, by targeting a more difficult use case, it's easier since there is less freedom. There are many ways to
beata binary search, but fewer good ways to improve bitmap heap scan. I'd like to put aside vacuum for some time and
trykilling two birds with one stone, building upon our work thus far. 
>
> Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently
requiredfrom Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF
andresurrect at a later time. 

Thanks. I'm going to continue researching the memory limitation and
try lazy path expansion until PG17 development begins.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Mar 11, 2023 at 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I've attached the new version patches. I merged improvements and fixes
> > > > I did in the v29 patch.
> > >
> > > I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I
rememberI wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other
randomthings. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back
tothis. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new
results,they look encouraging). 
> >
> > Okay, I'll separate them again.
>
> Attached new patch series. In addition to separate them again, I've
> fixed a conflict with HEAD.
>

I've attached updated version patches to make cfbot happy. Also, I've
splitted fixup patches further(from 0007 except for 0016 and 0018) to
make reviews easy. These patches have the prefix radix tree, tidstore,
and vacuum, indicating the part it changes. 0016 patch is to change
DSA so that we can specify both the initial and max segment size and
0017 makes use of it in vacuumparallel.c I'm still researching a
better solution for memory limitation but it's the best solution for
me for now.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
> >
>
> Instead of introducing single-value leaves to the radix tree as
> another structure, can we store pointers to PagetableEntry as values?

Well, that's pretty much what a single-value leaf is. Now that I've had time to pause and regroup, I've looked into some aspects we previously put off for future work, and this is one of them.

The concept is really quite trivial, and it's the simplest and most flexible way to implement ART. Our, or at least my, documented reason not to go that route was due to "an extra pointer traversal", but that's partially mitigated by "lazy expansion", which is actually fairly easy to do with single-value leaves. The two techniques complement each other in a natural way. (Path compression, on the other hand, is much more complex.)

> > Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.
>
> Thanks. I'm going to continue researching the memory limitation and

Sounds like the best thing to nail down at this point.

> try lazy path expansion until PG17 development begins.

This doesn't seem like a useful thing to try and attach into the current patch (if that's what you mean), as the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Apr 19, 2023 at 4:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might
getrid of the need to track TBMStatus, since setting a single-leaf tree should be cheap. 
> > >
> >
> > Instead of introducing single-value leaves to the radix tree as
> > another structure, can we store pointers to PagetableEntry as values?
>
> Well, that's pretty much what a single-value leaf is. Now that I've had time to pause and regroup, I've looked into
someaspects we previously put off for future work, and this is one of them. 
>
> The concept is really quite trivial, and it's the simplest and most flexible way to implement ART. Our, or at least
my,documented reason not to go that route was due to "an extra pointer traversal", but that's partially mitigated by
"lazyexpansion", which is actually fairly easy to do with single-value leaves. The two techniques complement each other
ina natural way. (Path compression, on the other hand, is much more complex.) 
>
> > > Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently
requiredfrom Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF
andresurrect at a later time. 
> >
> > Thanks. I'm going to continue researching the memory limitation and
>
> Sounds like the best thing to nail down at this point.
>
> > try lazy path expansion until PG17 development begins.
>
> This doesn't seem like a useful thing to try and attach into the current patch (if that's what you mean), as the
currentinsert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus
complexityto where it's most needed, and aggressively simplify where possible. 
>

I agree that we don't want to make the current patch complex further.

Thinking about the memory limitation more, I think that combination of
the idea of specifying the initial and max DSA segment size and
dsa_set_size_limit() works well. There are two points in terms of
memory limitation; when the memory usage reaches the limit we want (1)
to minimize the last allocated memory block that is allocated but not
used yet and (2) to minimize the amount of memory that exceeds the
memory limit. Since we can specify the maximum DSA segment size, the
last allocated block before reaching the memory limit is small. Also,
thanks to dsa_set_size_limit(), the total DSA size will stop at the
limit, so (memory_usage >= memory_limit) returns true without any
exceeding memory.

Given that we need to configure the initial and maximum DSA segment
size and set the DSA limit for TidStore memory accounting and
limiting, it would be better to create the DSA for TidStore by
TidStoreCreate() API, rather than creating DSA in the caller and pass
it to TidStoreCreate().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Apr 7, 2023 at 4:55 PM John Naylor <john.naylor@enterprisedb.com> wrote:

> - Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.

[just getting some thoughts out there before I have something concrete]

Thinking some more, this needn't be complicated at all. We'd just need to reserve some bits of a bitmapword for the tag, as well as flags for "ischunk" and "recheck". The other bits can be used for offsets. Getting/storing the offsets basically amounts to adjusting the shift by a constant. That way, this "embeddable PTE" could serve as both "PTE embedded in a node pointer" and also the first member of a full PTE. A full PTE is now just an array of embedded PTEs, except only the first one has the flags we need. That reduces the number of places that have to be different. Storing any set of offsets all less than ~60 would save allocation/traversal in a large number of real cases. Furthermore, that would reduce a full PTE to 40 bytes because there would be no padding.

This all assumes the key (block number) is no longer stored in the PTE, whether embedded or not. That would mean this technique:

> - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.

...is not a good trade off because it requires each leaf to have the key, and would thus reduce the utility of embedded leaves. We just need to make sure storing a single value is not costly, and I suspect it's not. (Currently the overhead avoided is allocating and zeroing a few kilobytes for a hash table). If it is not, then we don't need a special case in tidbitmap, which would be a great simplification. If it is, there are other ways to mitigate.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I wrote:
> the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.

Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).

0001

This combines a few concepts that I didn't bother separating out after the fact:
- Split insert_impl.h into multiple functions for improved readability and maintainability.
- Use single-value leaves as the basis for storing values, with the goal to get to "combined pointer-value slots" for efficiency and flexibility.
- With the latter in mind, searching the child within a node now returns the address of the slot. This allows the same interface whether the slot contains a child pointer or a value.
- Starting with RT_SET, start turning some iterative algorithms into recursive ones. This is a more natural way to traverse a tree structure, and we already see an advantage: Previously when growing a node, we searched within the parent to update its reference to the new node, because we didn't know the slot we descended from. Now we can simply update a single variable.
- Since we recursively pass the "shift" down the stack, it doesn't have to be stored in any node -- only the "top-level" start shift is stored in the tree control struct. This was easy to code since the node's shift value was hardly ever accessed anyway! The node header shrinks from 5 bytes to 4.

0002

Back in v15, we tried keeping DSA/local pointers as members of a struct. I did not like the result, but still thought it was a good idea. RT_DELETE is a complex function and I didn't want to try rewriting it without a pointer abstraction, so I've resurrected this idea, but in a simpler, less intrusive way. A key difference from v15 is using a union type for the non-shmem case.

0004

Rewrite RT_DELETE using recursion. I find this simpler than the previous open-coded stack.

0005-06

Deletion has an inefficiency: One function searches for the child to see if it's there, then another function searches for it again to delete it. Since 0001, a successful child search returns the address of the slot, so we can save it. For the two smaller "linear search" node kinds we can then use a single subtraction to compute the chunk/slot index for deletion. Also, split RT_NODE_DELETE_INNER into separate functions, for a similar reason as the insert case in 0001.

0007

Anticipate node shrinking: If only one node-kind needs to be freed, we can move a branch to that one code path, rather than every place where RT_FREE is inlined.

0009

Teach node256 how to shrink *. Since we know the number of children in a node256 can't possibly be zero, we can use uint8 to store the count and interpret an overflow to zero as 256 for this node. The node header shrinks from 4 bytes to 3.

* Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)

0010

Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.

What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).

I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout. This is for readability (by matching the language in the paper) and maintainability (should *not* ever change again). The size classes (including multiple classes per kind) could be determined by macros and #ifdef's. For example, in non-SIMD architectures, it's likely slow to search an array of 32 key chunks, so in that case the compiler should choose size classes similar to these four nominal kinds.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Tue, May 23, 2023 at 7:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I wrote:
> > the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to
refocuscomplexity to where it's most needed, and aggressively simplify where possible. 
>
> Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just
wantto share some progress on making the radix tree usable not only for that, but hopefully a wider range of
applications,while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no
iteration)and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ). 
>

Thank you for making progress on this. I agree with these directions
overall. I have some comments and questions:

> - With the latter in mind, searching the child within a node now returns the address of the slot. This allows the
sameinterface whether the slot contains a child pointer or a value. 

Probably we can apply similar changes to the iteration as well.

> * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently
node32'stwo size classes work fine for growing, but the code should be simplified before extending to other cases.) 

Within the size class, we just alloc a new node of lower size class
and do memcpy(). I guess it will be almost same as what we do for
growing. It might be a good idea to support node shrinking within the
size class for node32 (and node125 if we support). I don't think
shrinking class-3 to class-1 makes sense.

>
> Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves"
basedon the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level
childslots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256
leaf',since zero is a valid value. Aside from that, this was a small change. 

Yes, but it also means that we use pointer-sized value anyway even if
the value size is less than that, which wastes the memory, no?

>
> What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested)
becauseof better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea
andbranch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not,
it'llbe a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require
e.g.an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In
addition,we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct
padding).

Sounds good.

> I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout. This is
forreadability (by matching the language in the paper) and maintainability (should *not* ever change again). The size
classes(including multiple classes per kind) could be determined by macros and #ifdef's. For example, in non-SIMD
architectures,it's likely slow to search an array of 32 key chunks, so in that case the compiler should choose size
classessimilar to these four nominal kinds. 

If we want to use the node kinds used in the paper, I think we should
change the number in RT_NODE_KIND_X too. Otherwise, it would be
confusing when reading the code without referring to the paper.
Particularly, this part is very confusing:

        case RT_NODE_KIND_3:
            RT_ADD_CHILD_4(tree, ref, node, chunk, child);
            break;
        case RT_NODE_KIND_32:
            RT_ADD_CHILD_16(tree, ref, node, chunk, child);
            break;
        case RT_NODE_KIND_125:
            RT_ADD_CHILD_48(tree, ref, node, chunk, child);
            break;
        case RT_NODE_KIND_256:
            RT_ADD_CHILD_256(tree, ref, node, chunk, child);
            break;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).
> >
>
> Thank you for making progress on this. I agree with these directions
> overall. I have some comments and questions:

Glad to hear it and thanks for looking!

> > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)
>
> Within the size class, we just alloc a new node of lower size class
> and do memcpy(). I guess it will be almost same as what we do for
> growing.

Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:

.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),

...may be better with a #defined symbol that can also be used elsewhere.

> I don't think
> shrinking class-3 to class-1 makes sense.

Agreed. The smallest kind should just be freed when empty.

> > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.
>
> Yes, but it also means that we use pointer-sized value anyway even if
> the value size is less than that, which wastes the memory, no?

At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, which may compensate:

psql -c "select * from bench_search_random_nodes(1*1000*1000)"
num_keys = 992660

(using a low enough number that the experimental change n125->n63 doesn't affect anything)
height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025

v31:
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
      47800768 |     253 |       134

(unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
      42926048 |     221 |       127

I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)

So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer 2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memory access. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain. That said, I've mostly left alone the "leaf" types and functions, as well as added some detritus like "const bool = false;". It would look a *lot* nicer if we gave up on multi-value leaves entirely, but there's no rush and I don't want to close that door entirely just yet.

> > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).
>
> Sounds good.

The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint. That also has a curious side-effect for TID offsets: They are one-based so reserving the zero bit would actually simplify things: getting rid of the +1/-1 logic when converting bits to/from offsets.

In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2 as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memory usage). That's worth considering.

> > I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout.

> If we want to use the node kinds used in the paper, I think we should
> change the number in RT_NODE_KIND_X too.

Oh absolutely, this is nowhere near ready for cosmetic review :-)

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I
justwant to share some progress on making the radix tree usable not only for that, but hopefully a wider range of
applications,while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no
iteration)and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ). 
> > >
> >
> > Thank you for making progress on this. I agree with these directions
> > overall. I have some comments and questions:
>
> Glad to hear it and thanks for looking!
>
> > > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) --
currentlynode32's two size classes work fine for growing, but the code should be simplified before extending to other
cases.)
> >
> > Within the size class, we just alloc a new node of lower size class
> > and do memcpy(). I guess it will be almost same as what we do for
> > growing.
>
> Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward.
I'mthinking the hard-coded numbers like this: 
>
> .fanout = 3,
> .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
>
> ...may be better with a #defined symbol that can also be used elsewhere.

FWIW, exposing these definitions would be good in terms of testing too
since we can use them in regression tests.

>
> > I don't think
> > shrinking class-3 to class-1 makes sense.
>
> Agreed. The smallest kind should just be freed when empty.
>
> > > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves"
basedon the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level
childslots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256
leaf',since zero is a valid value. Aside from that, this was a small change. 
> >
> > Yes, but it also means that we use pointer-sized value anyway even if
> > the value size is less than that, which wastes the memory, no?
>
> At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory,
whichmay compensate: 
>
> psql -c "select * from bench_search_random_nodes(1*1000*1000)"
> num_keys = 992660
>
> (using a low enough number that the experimental change n125->n63 doesn't affect anything)
> height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025
>
> v31:
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>       47800768 |     253 |       134
>
> (unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>       42926048 |     221 |       127
>
> I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are
embeddedhere) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that
additionalsize classes have both runtime costs as well as benefits. We need to have a balance.) 

Interesting. The result would probably vary if we change the slab
block sizes. I'd like to experiment if the code is available.

>
> So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a
pointer2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another
memoryaccess. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to
maintain.

Indeed.

>
> > > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster
(untested)because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to
extendthis idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go
there,and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time
infowill require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in
thenode3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes
(nostruct padding). 
> >
> > Sounds good.
>
> The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now
thinkinga much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values"
canuse the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit
fora tag "value or pointer-to-leaf" would have a much smaller code footprint. 

Do you mean we can make sure that the value doesn't set the lowest
bit? Or is it an optimization for TIDStore?

> In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a
node2as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case
memoryusage). That's worth considering. 

Agreed.

FWIW please let me know if there are some experiments I can help with.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jun 13, 2023 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >

> > I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)
>
> Interesting. The result would probably vary if we change the slab
> block sizes. I'd like to experiment if the code is available.

I cleaned up a few things and attached v34 so you can do that if you like. (Note: what I said about node63/n125 not making a difference in that one test is not quite true since slab keeps a few empty blocks around. I did some rough mental math and I think it doesn't change the conclusion any.)

0001-0007 is basically v33, but can apply on master.

0008 just adds back RT_EXTEND_DOWN. I left it out to simplify moving to recursion.

> > Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:
> >
> > .fanout = 3,
> > .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
> >
> > ...may be better with a #defined symbol that can also be used elsewhere.
>
> FWIW, exposing these definitions would be good in terms of testing too
> since we can use them in regression tests.

I added some definitions in 0012. It kind of doesn't matter now what sizes are the test unless it also can test that it stays within the expected size, if that makes sense. It is helpful during debugging to force growth to stop at a certain size.

> > > Within the size class, we just alloc a new node of lower size class
> > > and do memcpy().

Not anymore. ;-) To be technical, it didn't "just" memcpy(), since it then fell through to find the insert position and memmove(). In some parts of Andres' prototype, no memmove() is necessary, because it memcpy()'s around the insert position, and puts the new child in the right place. I've done this in 0009.

The memcpy you mention was done for 1) simplicity 2) to avoid memset'ing. Well, it was never necessary to memset the whole node in the first place. Only the header, slot index array, and isset arrays need to be zeroed, so in 0011 we always do only that. That combines alloc and init functionality, and it's simple everywhere.

In 0010 I restored iteration functionality -- it can no longer get the shift from the node, because it's not there as of v33. I was not particularly impressed that there were no basic iteration tests, and in fact the test_pattern test relied on functioning iteration. I added some basic tests. I'm not entirely pleased with testing overall, but I think it's at least sufficient for the job. I had the idea to replace "shift" everywhere and use "level" as a fundamental concept. This is clearer. I do want to make sure the compiler can compute the shift efficiently where necessary. I think that can wait until much later.

0013 standardizes (mostly) on 4/16/48/256 for naming convention, regardless of actual size, as I started to do earlier.

0014 is part cleanup of shrinking, and part making grow-node-48 more consistent with the rest.

> > The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint.
>
> Do you mean we can make sure that the value doesn't set the lowest
> bit? Or is it an optimization for TIDStore?

It will be up to the caller (the user of the template) -- if an abbreviation is possible that fits in the upper 63 bits (with something to guard for 32-bit platforms), the developer will be able to specify a conversion function so that the caller only sees the full value when searching and setting. Without such a function, the template will fall back to the size of the value type to determine how the value is stored.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

I wrote:
> I cleaned up a few things and attached v34 so you can do that if you like. 

Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this week, I thought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles because of my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so here's a patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more similar to previous patches, as a basis for future work.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Jun 23, 2023 at 6:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> I wrote:
> > I cleaned up a few things and attached v34 so you can do that if you like.
>
> Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this week,
Ithought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles
becauseof my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so
here'sa patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more
similarto previous patches, as a basis for future work. 
>

Thank you for updating the patch set. I'll look at updates closely
early next week.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jun 27, 2023 at 5:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jun 23, 2023 at 6:54 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > I wrote:
> > > I cleaned up a few things and attached v34 so you can do that if you like.
> >
> > Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this
week,I thought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles
becauseof my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so
here'sa patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more
similarto previous patches, as a basis for future work. 
> >
>
> Thank you for updating the patch set. I'll look at updates closely
> early next week.
>

I've run several benchmarks for v32, where before your recent change
starting, and v35 patch. Overall the numbers are better than the
previous version. Here is the test result where I used 1-byte value:

"select * from bench_load_random(10_000_000)"

* v35
  radix tree leaves: 192 total in 0 blocks; 0 empty blocks; 0 free (0
chunks); 192 used
  radix tree node 256: 13697472 total in 205 blocks; 0 empty blocks;
52400 free (25 chunks); 13645072 used
  radix tree node 125: 86630592 total in 2115 blocks; 0 empty blocks;
7859376 free (6102 chunks); 78771216 used
  radix tree node 32: 94912 total in 0 blocks; 10 empty blocks; 0 free
(0 chunks); 94912 used
  radix tree node 15: 9269952 total in 1136 blocks; 0 empty blocks;
168 free (1 chunks); 9269784 used
  radix tree node 3: 1915502784 total in 233826 blocks; 0 empty
blocks; 6560 free (164 chunks); 1915496224 used
 mem_allocated | load_ms
---------------+---------
    2025194752 |    3011
(1 row)

* v32
  radix tree node 256: 192 total in 0 blocks; 0 empty blocks; 0 free
(0 chunks); 192 used
  radix tree node 256: 13487552 total in 205 blocks; 0 empty blocks;
51600 free (25 chunks); 13435952 used
  radix tree node 125: 192 total in 0 blocks; 0 empty blocks; 0 free
(0 chunks); 192 used
  radix tree node 125: 86630592 total in 2115 blocks; 0 empty blocks;
7859376 free (6102 chunks); 78771216 used
  radix tree node 32: 192 total in 0 blocks; 0 empty blocks; 0 free (0
chunks); 192 used
  radix tree node 32: 94912 total in 0 blocks; 10 empty blocks; 0 free
(0 chunks); 94912 used
  radix tree node 15: 192 total in 0 blocks; 0 empty blocks; 0 free (0
chunks); 192 used
  radix tree node 15: 9269952 total in 1136 blocks; 0 empty blocks;
168 free (1 chunks); 9269784 used
  radix tree node 3: 241597002 total in 29499 blocks; 0 empty blocks;
3864 free (161 chunks); 241593138 used
  radix tree node 3: 1809039552 total in 221696 blocks; 0 empty
blocks; 5280 free (110 chunks); 1809034272 used
 mem_allocated | load_ms
---------------+---------
    2160118410 |    3069
(1 row)

As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
are unused, but we use less memory since we use less slab contexts and
save fragmentations.

I've also tested some large value cases (e.g. the value is 80-bytes)
and got a similar result.

Regarding the codes, there are many todo and fixme comments so it
seems to me that your recent work is still in-progress. What is the
current status? Can I start reviewing the code or should I wait for a
while until your recent work completes?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
> are unused, but we use less memory since we use less slab contexts and
> save fragmentations.

Thanks for testing. This tree is sparse enough that most of the space is taken up by small inner nodes, and not by leaves. So, it's encouraging to see a small space savings even here.

> I've also tested some large value cases (e.g. the value is 80-bytes)
> and got a similar result.

Interesting. With a separate allocation per value the overhead would be 8 bytes, or 10% here. It's plausible that savings elsewhere can hide that, globally.

> Regarding the codes, there are many todo and fixme comments so it
> seems to me that your recent work is still in-progress. What is the
> current status? Can I start reviewing the code or should I wait for a
> while until your recent work completes?

Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
> > are unused, but we use less memory since we use less slab contexts and
> > save fragmentations.
>
> Thanks for testing. This tree is sparse enough that most of the space is taken up by small inner nodes, and not by
leaves.So, it's encouraging to see a small space savings even here. 
>
> > I've also tested some large value cases (e.g. the value is 80-bytes)
> > and got a similar result.
>
> Interesting. With a separate allocation per value the overhead would be 8 bytes, or 10% here. It's plausible that
savingselsewhere can hide that, globally. 
>
> > Regarding the codes, there are many todo and fixme comments so it
> > seems to me that your recent work is still in-progress. What is the
> > current status? Can I start reviewing the code or should I wait for a
> > while until your recent work completes?
>
> Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan.
Fixingthat now is just going to create conflicts. I do have a couple small older patches laying around that were quick
experiments-- I think at least some of them should give a performance boost in loading speed, but haven't had time to
test.Would you like to take a look? 

Yes, I can experiment with these patches in the meantime.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?
>
> Yes, I can experiment with these patches in the meantime.

Okay, here it is in v36. 0001-6 are same as v35.

0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, but better to be tidy.
0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit tricky, but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout is either 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random.
0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees.
0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4 only?

If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Jul 8, 2023 at 11:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap
scan.Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were
quickexperiments -- I think at least some of them should give a performance boost in loading speed, but haven't had
timeto test. Would you like to take a look? 
> >
> > Yes, I can experiment with these patches in the meantime.
>
> Okay, here it is in v36. 0001-6 are same as v35.
>
> 0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer.
Notworth testing by themselves, but better to be tidy. 
> 0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit
tricky,but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout
iseither 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random. 
> 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees.
> 0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4
only?
>
> If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency.

Thanks for sharing the patches!

0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

I have some questions on 0009 patch:

+       /* shift chunks and children
+
+               Unfortunately, gcc has gotten too aggressive in
turning simple loops
+               into slow memmove's, so we have to be a bit more clever.
+               See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481
+
+               We take advantage of the fact that a good
+               compiler can turn a memmove of a small constant power-of-two
+               number of bytes into a single load/store.
+       */

According to the comment, this optimization is for only gcc? and there
is no negative impact when building with other compilers such as clang
by this change?

I'm not sure that it's a good approach to hand-optimize the code much
to generate better instructions on gcc. I think this change reduces
readability and maintainability. According to the bugzilla ticket
referred to in the comment, it's realized as a bug in the community,
so once the gcc bug fixes, we might no longer need this trick, no?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Thu, Jul 13, 2023 at 5:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Jul 8, 2023 at 11:54 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap
scan.Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were
quickexperiments -- I think at least some of them should give a performance boost in loading speed, but haven't had
timeto test. Would you like to take a look? 
> > >
> > > Yes, I can experiment with these patches in the meantime.
> >
> > Okay, here it is in v36. 0001-6 are same as v35.
> >
> > 0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer.
Notworth testing by themselves, but better to be tidy. 
> > 0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit
tricky,but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout
iseither 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random. 
> > 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees.
> > 0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4
only?
> >
> > If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency.

cfbot reported some failures[1], and the v36 patch cannot be applied
cleanly to the current HEAD. I've attached updated patches to make
cfbot happy.

Regards,

[1] http://cfbot.cputube.org/highlights/all.html#3687

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

[Part 1 - clear the deck of earlier performance work etc]

Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball check is not enough. I've now made the time to do that.

==== sparse loads

v38 0001-0006 (still using node3 for this test only):

select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
         avg        
---------------------
 27.1000000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
         avg          
----------------------
 165.6333333333333333

v38-0007-Optimize-RT_EXTEND_DOWN.patch

select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
         avg        
---------------------
 25.0900000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
         avg          
----------------------
 157.3666666666666667

That seems worth doing.

v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch

This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms, because that would cause global maintenance problems -- it's probably better to reduce worst-case search time where necessary. That would be much more localized.

> I have some questions on 0009 patch:

> According to the comment, this optimization is for only gcc?

No, not at all. That tells me the comment is misleading.

> I think this change reduces
> readability and maintainability.

Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler idea, though...

==== load mostly node4

select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
       7352384 |      25 |         0

v38-0009-TEMP-take-out-search-time-from-bench.patch

This is just to allow LATERAL queries for better measurements.

select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;

         avg        
---------------------
 24.8333333333333333

v38-0010-Try-a-simpler-way-to-avoid-memmove.patch

This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a comment, which I forgot)

         avg        
---------------------
 21.9600000000000000

So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my ignorance about it. We need something like either of the two.

v38-0011-Optimize-add_child_4-take-2.patch
         avg        
---------------------
 21.3500000000000000

This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.

> According to the bugzilla ticket
> referred to in the comment, it's realized as a bug in the community,
> so once the gcc bug fixes, we might no longer need this trick, no?

No comment in two years...

v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch

This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually seems to regress:

         avg        
---------------------
 22.0900000000000000

v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

This patch uses a single loop for the copy.

         avg        
---------------------
 21.0300000000000000

Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.

v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch

Just small cleanups.

v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch

Makes sense, but untested.

===============
[Part 2]

Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and start incorporating them into earlier vacuum work to get that out the door faster. With that in mind...

v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch

This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be replaced with the real thing at a later date. This is still fixed size, to be replaced with a varlen type. 

Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required for how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.)

I'm also concerned about the number of places that have to know if the store is using shared memory or not. Something to think about later.

v38-0018-Consolidate-inserting-updating-values.patch

This is something I coded up to get to an API more similar to one in simplehash, as used in tidbitmap.c. It seem worth doing on its own to reduce code duplication, and also simplifies coding of varlen types and "runtime-embeddable values".

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
Hi,

On Mon, Aug 14, 2023 at 8:05 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

Thank you for updating the patch!

>
> [Part 1 - clear the deck of earlier performance work etc]
>
> Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball
checkis not enough. I've now made the time to do that. 
>
> ==== sparse loads
>
> v38 0001-0006 (still using node3 for this test only):
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 *
(1+x-x)))a; 
>          avg
> ---------------------
>  27.1000000000000000
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 *
(1+x-x)))a; 
>          avg
> ----------------------
>  165.6333333333333333
>
> v38-0007-Optimize-RT_EXTEND_DOWN.patch
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 *
(1+x-x)))a; 
>          avg
> ---------------------
>  25.0900000000000000
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 *
(1+x-x)))a; 
>          avg
> ----------------------
>  157.3666666666666667
>
> That seems worth doing.
>
> v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch
>
> This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32-
and64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms,
becausethat would cause global maintenance problems -- it's probably better to reduce worst-case search time where
necessary.That would be much more localized. 
>
> > I have some questions on 0009 patch:
>
> > According to the comment, this optimization is for only gcc?
>
> No, not at all. That tells me the comment is misleading.
>
> > I think this change reduces
> > readability and maintainability.
>
> Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler
idea,though... 
>
> ==== load mostly node4
>
> select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
> n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>        7352384 |      25 |         0
>
> v38-0009-TEMP-take-out-search-time-from-bench.patch
>
> This is just to allow LATERAL queries for better measurements.
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 *
(1+x-x),'0xFFFFFF')) a; 
>
>          avg
> ---------------------
>  24.8333333333333333

0007, 0008, and 0009 look good to me.

>
> v38-0010-Try-a-simpler-way-to-avoid-memmove.patch
>
> This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't
like,this *is* gcc-specific. (needs a comment, which I forgot) 
>
>          avg
> ---------------------
>  21.9600000000000000
>
> So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my
ignoranceabout it. We need something like either of the two. 
>
> v38-0011-Optimize-add_child_4-take-2.patch
>          avg
> ---------------------
>  21.3500000000000000
>
> This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the
buggoing forward. 

I prefer 0010 but is it worth testing with other compilers such as clang?

>
> > According to the bugzilla ticket
> > referred to in the comment, it's realized as a bug in the community,
> > so once the gcc bug fixes, we might no longer need this trick, no?
>
> No comment in two years...
>
> v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch
>
> This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually
seemsto regress: 
>
>          avg
> ---------------------
>  22.0900000000000000
>
> v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
>
> This patch uses a single loop for the copy.
>
>          avg
> ---------------------
>  21.0300000000000000
>
> Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.

Agreed.

>
> v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
> v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch
>
> Just small cleanups.
>
> v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch
>
> Makes sense, but untested.

Agreed.

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

>
> ===============
> [Part 2]
>
> Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and
startincorporating them into earlier vacuum work to get that out the door faster. With that in mind... 
>
> v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch
>
> This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be
replacedwith the real thing at a later date. This is still fixed size, to be replaced with a varlen type. 

That's more readable.

>
> Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required
forhow little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL
functionsthat add blocks/offsets to the tid store, and emit tuples of tids found in the store.) 

It would not be hard to have such SQL functions. I'll try it.

>
> I'm also concerned about the number of places that have to know if the store is using shared memory or not. Something
tothink about later. 
>
> v38-0018-Consolidate-inserting-updating-values.patch
>
> This is something I coded up to get to an API more similar to one in simplehash, as used in tidbitmap.c. It seem
worthdoing on its own to reduce code duplication, and also simplifies coding of varlen types and "runtime-embeddable
values".

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> BTW cfbot reported that some regression tests failed due to OOM. I've
> attached the patch to fix it.

Seems worth doing now rather than later, so added this and squashed most of the rest together. I wonder if that test uses too much memory in general. Maybe using the full uint64 is too much.

> On Mon, Aug 14, 2023 at 8:05 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

> > This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.
>
> I prefer 0010 but is it worth testing with other compilers such as clang?

Okay, keeping 0010 with a comment, and leaving out 0011 for now. Clang is aggressive about unrolling loops, so may be worth looking globally at some point.

> > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

> > Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.
>
> Agreed.

Keeping 0012 and not 0013.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Tue, Aug 15, 2023 at 6:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > BTW cfbot reported that some regression tests failed due to OOM. I've
> > attached the patch to fix it.
>
> Seems worth doing now rather than later, so added this and squashed most of the rest together.

This segfaults because of a mistake fixing a rebase conflict, so v40 attached.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Aug 16, 2023 at 8:04 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
>
> On Tue, Aug 15, 2023 at 6:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > BTW cfbot reported that some regression tests failed due to OOM. I've
> > > attached the patch to fix it.
> >
> > Seems worth doing now rather than later, so added this and squashed most of the rest together.
>
> This segfaults because of a mistake fixing a rebase conflict, so v40 attached.
>

Thank you for updating the patch set.

On Tue, Aug 15, 2023 at 11:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Aug 14, 2023 at 8:05 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code
requiredfor how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to
haveSQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.) 
>
> It would not be hard to have such SQL functions. I'll try it.

I've updated the regression tests for tidstore so that it uses SQL
functions to add blocks/offsets and dump its contents. The new test
covers the same test coverages but it's executed using SQL functions
instead of executing all tests in one SQL function.

0008 patch fixes a bug in tidstore which I found during this work. We
didn't recreate the radix tree in the same memory context when
TidStoreReset().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've updated the regression tests for tidstore so that it uses SQL
> functions to add blocks/offsets and dump its contents. The new test
> covers the same test coverages but it's executed using SQL functions
> instead of executing all tests in one SQL function.

This is much nicer and more flexible, thanks! A few questions/comments:

tidstore_dump_tids() returns a string -- is it difficult to turn this into a SRF, or is it just a bit more work?

The lookup test seems fine for now. The output would look nicer with an "order by tid".

I think we could have the SQL function tidstore_create() take a boolean for shared memory. That would allow ad-hoc testing without a recompile, if I'm not mistaken.

+SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+ tidstore_set_block_offsets
+----------------------------
+
+
+
+
+
+(5 rows)

Calling a void function multiple times leads to vertical whitespace, which looks a bit strange and may look better with some output, even if irrelevant:

-SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])

 row_number | tidstore_set_block_offsets
------------+----------------------------
          1 |
          2 |
          3 |
          4 |
          5 |
(5 rows)

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Aug 28, 2023 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've updated the regression tests for tidstore so that it uses SQL
> > functions to add blocks/offsets and dump its contents. The new test
> > covers the same test coverages but it's executed using SQL functions
> > instead of executing all tests in one SQL function.
>
> This is much nicer and more flexible, thanks! A few questions/comments:
>
> tidstore_dump_tids() returns a string -- is it difficult to turn this into a SRF, or is it just a bit more work?

It's not difficult. I've changed it in v42 patch.

>
> The lookup test seems fine for now. The output would look nicer with an "order by tid".

Agreed.

>
> I think we could have the SQL function tidstore_create() take a boolean for shared memory. That would allow ad-hoc
testingwithout a recompile, if I'm not mistaken. 

Agreed.

>
> +SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
> +  FROM blocks, offsets
> +  GROUP BY blk;
> + tidstore_set_block_offsets
> +----------------------------
> +
> +
> +
> +
> +
> +(5 rows)
>
> Calling a void function multiple times leads to vertical whitespace, which looks a bit strange and may look better
withsome output, even if irrelevant: 
>
> -SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
> +SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
>
>  row_number | tidstore_set_block_offsets
> ------------+----------------------------
>           1 |
>           2 |
>           3 |
>           4 |
>           5 |
> (5 rows)

Yes, it looks better.

I've attached v42 patch set. I improved tidstore regression test codes
in addition of imcorporating the above comments.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:

On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached v42 patch set. I improved tidstore regression test codes
> in addition of imcorporating the above comments.

Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished one prerequisite which is: public APIs passing pointers to values.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Sep 6, 2023 at 3:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've attached v42 patch set. I improved tidstore regression test codes
> > in addition of imcorporating the above comments.
>
> Seems fine at a glance, thanks. I will build on this to implement variable-length values.

Thanks.

> I have already finished one prerequisite which is: public APIs passing pointers to values.

Great!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote:
> I've attached v42 patch set. I improved tidstore regression test codes
> in addition of imcorporating the above comments.

Why did you need to disable the benchmark module for CI?

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Sep 16, 2023 at 9:03 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote:
> > I've attached v42 patch set. I improved tidstore regression test codes
> > in addition of imcorporating the above comments.
>
> Why did you need to disable the benchmark module for CI?

I didn't want to unnecessarily make cfbot unhappy since the benchmark
module is not going to get committed to the core and sometimes not
up-to-date.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I wrote:

> Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished one
prerequisitewhich is: public APIs passing pointers to values.
 

Since my publishing schedule has not kept up, I'm just going to share
something similar to what I mentioned earlier, just to get things
moving again.

0001-0009 are from earlier versions, except for 0007 which makes a
bunch of superficial naming updates, similar to those done in a recent
other version. Somewhere along the way I fixed long-standing git
whitespace warnings, but I don't remember if that's new here. In any
case, let's try to preserve that.

0010 is some minor refactoring to reduce duplication

0011-0014 add public functions that give the caller more control over
the input and responsibility for locking. They are not named well, but
I plan these to be temporary: They are currently used for the tidstore
only, since that has much simpler tests than the standard radix tree
tests. One thing to note: since the tidstore has always done it's own
locking within a larger structure, these patches don't bother to do
locking at the radix tree level. Locking twice seems...not great.
These patches are the main prerequisite for variable-length values.
Once that is working well, we can switch the standard tests to the new
APIs.

Next steps include (some of these were briefly discussed off-list with
Sawada-san):

- template parameter for varlen values
- some callers to pass length in bytes
- block entries to have num_elems for # of bitmap words
- a way for updates to re-alloc values when needed
- aset allocation for values when appropriate

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I wrote:
>
> > Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished
oneprerequisite which is: public APIs passing pointers to values. 
>
> Since my publishing schedule has not kept up, I'm just going to share
> something similar to what I mentioned earlier, just to get things
> moving again.

Thanks for sharing the updates. I've returned to work today and will
resume working on this feature.

>
> 0001-0009 are from earlier versions, except for 0007 which makes a
> bunch of superficial naming updates, similar to those done in a recent
> other version. Somewhere along the way I fixed long-standing git
> whitespace warnings, but I don't remember if that's new here. In any
> case, let's try to preserve that.
>
> 0010 is some minor refactoring to reduce duplication
>
> 0011-0014 add public functions that give the caller more control over
> the input and responsibility for locking. They are not named well, but
> I plan these to be temporary: They are currently used for the tidstore
> only, since that has much simpler tests than the standard radix tree
> tests. One thing to note: since the tidstore has always done it's own
> locking within a larger structure, these patches don't bother to do
> locking at the radix tree level. Locking twice seems...not great.
> These patches are the main prerequisite for variable-length values.
> Once that is working well, we can switch the standard tests to the new
> APIs.

Since the variable-length values support is a big deal and would be
related to API design I'd like to discuss the API design first.
Currently, we have the following APIs:

---
RT_VALUE_TYPE
RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
or for variable-length value support,
RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If an entry already exists, return its pointer and set "found" to
true. Otherwize, insert an empty value with sz bytes, return its
pointer, and set "found" to false.

---
RT_VALUE_TYPE
RT_FIND(RT_RADIX_TREE *tree, uint64 key);

If an entry exists, return the pointer to the value, otherwise return NULL.

(I omitted RT_SEARCH() as it's essentially the same as RT_FIND() and
will probably get removed.)

---
bool
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
or for variable-length value support,
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);

If an entry already exists, update its value to 'value_p' and return
true. Otherwise set the value and return false.

Given variable-length value support, RT_GET() would have to do
repalloc() if the existing value size is not big enough for the new
value, but it cannot as the radix tree doesn't know the size of each
stored value. Another idea is that the radix tree returns the pointer
to the slot and the caller updates the value accordingly. But it means
that the caller has to update the slot properly while considering the
value size (embedded vs. single-leave value), which seems not a good
idea.

To deal with this problem, I think we can somewhat change RT_GET() API
as follow:

RT_VALUE_TYPE
RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If the entry already exists, replace the value with a new empty value
with sz bytes and set "found" to true. Otherwise, insert an empty
value, return its pointer, and set "found" to false.

We probably will find a better name but I use RT_INSERT() for
discussion. RT_INSERT() returns an empty slot regardless of existing
values. It can be used to insert a new value or to replace the value
with a larger value.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Since the variable-length values support is a big deal and would be
> related to API design I'd like to discuss the API design first.

Thanks for the fine summary of the issues here.

[Swapping this back in my head]

> RT_VALUE_TYPE
> RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
> or for variable-length value support,
> RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If an entry already exists, return its pointer and set "found" to
> true. Otherwize, insert an empty value with sz bytes, return its
> pointer, and set "found" to false.

> ---
> bool
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> or for variable-length value support,
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
>
> If an entry already exists, update its value to 'value_p' and return
> true. Otherwise set the value and return false.

I'd have to double-check, but I think RT_SET is vestigial and I'm not
sure it has any advantage over RT_GET as I've sketched it out. I'm
pretty sure it's only there now because changing the radix tree
regression tests is much harder than changing TID store.

> Given variable-length value support, RT_GET() would have to do
> repalloc() if the existing value size is not big enough for the new
> value, but it cannot as the radix tree doesn't know the size of each
> stored value.

I think we have two choices:

- the value stores the "length". The caller would need to specify a
function to compute size from the "length" member. Note this assumes
there is an array. I think both aspects are not great.
- the value stores the "size". Callers that store an array (as
PageTableEntry's do) would compute length when they need to. This
sounds easier.

> Another idea is that the radix tree returns the pointer
> to the slot and the caller updates the value accordingly.

I did exactly this in v43 TidStore if I understood you correctly. If I
misunderstood you, can you clarify?

> But it means
> that the caller has to update the slot properly while considering the
> value size (embedded vs. single-leave value), which seems not a good
> idea.

For this optimization, callers will have to know about pointer-sized
values and treat them differently, but they don't need to know the
details about how where they are stored.

While we want to keep embedded values in the back of our minds, I
really think the details should be postponed to a follow-up commit.

> To deal with this problem, I think we can somewhat change RT_GET() API
> as follow:
>
> RT_VALUE_TYPE
> RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If the entry already exists, replace the value with a new empty value
> with sz bytes and set "found" to true. Otherwise, insert an empty
> value, return its pointer, and set "found" to false.
>
> We probably will find a better name but I use RT_INSERT() for
> discussion. RT_INSERT() returns an empty slot regardless of existing
> values. It can be used to insert a new value or to replace the value
> with a larger value.

For the case we are discussing, bitmaps, updating an existing value is
a bit tricky. We need the existing value to properly update it with
set or unset bits. This can't work in general without a lot of work
for the caller.

However, for vacuum, we have all values that we need up front. That
gives me an idea: Something like this insert API could be optimized
for "insert-only": If we only free values when we free the whole tree
at the end, that's a clear use case for David Rowley's proposed "bump
context", which would save 8 bytes per allocation and be a bit faster.
[1] (RT_GET for varlen values would use an aset context, to allow
repalloc, and nodes would continue to use slab).

[1] https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Since the variable-length values support is a big deal and would be
> > related to API design I'd like to discuss the API design first.
>
> Thanks for the fine summary of the issues here.
>
> [Swapping this back in my head]
>
> > RT_VALUE_TYPE
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
> > or for variable-length value support,
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If an entry already exists, return its pointer and set "found" to
> > true. Otherwize, insert an empty value with sz bytes, return its
> > pointer, and set "found" to false.
>
> > ---
> > bool
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> > or for variable-length value support,
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
> >
> > If an entry already exists, update its value to 'value_p' and return
> > true. Otherwise set the value and return false.
>
> I'd have to double-check, but I think RT_SET is vestigial and I'm not
> sure it has any advantage over RT_GET as I've sketched it out. I'm
> pretty sure it's only there now because changing the radix tree
> regression tests is much harder than changing TID store.

Agreed.

>
> > Given variable-length value support, RT_GET() would have to do
> > repalloc() if the existing value size is not big enough for the new
> > value, but it cannot as the radix tree doesn't know the size of each
> > stored value.
>
> I think we have two choices:
>
> - the value stores the "length". The caller would need to specify a
> function to compute size from the "length" member. Note this assumes
> there is an array. I think both aspects are not great.
> - the value stores the "size". Callers that store an array (as
> PageTableEntry's do) would compute length when they need to. This
> sounds easier.

As for the second idea, do we always need to require the value to have
the "size" (e.g. int32) in the first field of its struct? If so, the
caller will be able to use only 4 bytes in embedded value cases (or
won't be able to use at all if the pointer size is 4 bytes).

>
> > Another idea is that the radix tree returns the pointer
> > to the slot and the caller updates the value accordingly.
>
> I did exactly this in v43 TidStore if I understood you correctly. If I
> misunderstood you, can you clarify?

I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
the value as they want.

>
> > But it means
> > that the caller has to update the slot properly while considering the
> > value size (embedded vs. single-leave value), which seems not a good
> > idea.
>
> For this optimization, callers will have to know about pointer-sized
> values and treat them differently, but they don't need to know the
> details about how where they are stored.
>
> While we want to keep embedded values in the back of our minds, I
> really think the details should be postponed to a follow-up commit.

Agreed.

>
> > To deal with this problem, I think we can somewhat change RT_GET() API
> > as follow:
> >
> > RT_VALUE_TYPE
> > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If the entry already exists, replace the value with a new empty value
> > with sz bytes and set "found" to true. Otherwise, insert an empty
> > value, return its pointer, and set "found" to false.
> >
> > We probably will find a better name but I use RT_INSERT() for
> > discussion. RT_INSERT() returns an empty slot regardless of existing
> > values. It can be used to insert a new value or to replace the value
> > with a larger value.
>
> For the case we are discussing, bitmaps, updating an existing value is
> a bit tricky. We need the existing value to properly update it with
> set or unset bits. This can't work in general without a lot of work
> for the caller.

True.

>
> However, for vacuum, we have all values that we need up front. That
> gives me an idea: Something like this insert API could be optimized
> for "insert-only": If we only free values when we free the whole tree
> at the end, that's a clear use case for David Rowley's proposed "bump
> context", which would save 8 bytes per allocation and be a bit faster.
> [1] (RT_GET for varlen values would use an aset context, to allow
> repalloc, and nodes would continue to use slab).

Interesting idea and worth trying it. Do we need to protect the whole
tree as insert-only for safety? It's problematic if the user uses
mixed RT_INSERT() and RT_GET().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > > Given variable-length value support, RT_GET() would have to do
> > > repalloc() if the existing value size is not big enough for the new
> > > value, but it cannot as the radix tree doesn't know the size of each
> > > stored value.
> >
> > I think we have two choices:
> >
> > - the value stores the "length". The caller would need to specify a
> > function to compute size from the "length" member. Note this assumes
> > there is an array. I think both aspects are not great.
> > - the value stores the "size". Callers that store an array (as
> > PageTableEntry's do) would compute length when they need to. This
> > sounds easier.
>
> As for the second idea, do we always need to require the value to have
> the "size" (e.g. int32) in the first field of its struct? If so, the
> caller will be able to use only 4 bytes in embedded value cases (or
> won't be able to use at all if the pointer size is 4 bytes).

We could have an RT_SIZE_TYPE for varlen value types. That's easy.
There is another way, though: (This is a digression into embedded
values, but it does illuminate some issues even aside from that)

My thinking a while ago was that an embedded value had no explicit
length/size, but could be "expanded" into a conventional value for the
caller. For bitmaps, the smallest full value would have length 1 and
whatever size (For tid store maybe 16 bytes). This would happen
automatically via a template function.

Now I think that could be too complicated (especially for page table
entries, which have more bookkeeping than vacuum needs) and slow.
Imagine this as an embedded value:

typedef struct BlocktableEntry
{
  uint16 size;

  /* later: uint8 flags; for bitmap scan */

  /* 64 bit: 3 elements , 32-bit: 1 element */
  OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
sizeof(OffsetNumber)];

  /* end of embeddable value */

  bitmapword words[FLEXIBLE_ARRAY_MEMBER];
} BlocktableEntry;

Here we can use a slot to store up to 3 offsets, no matter how big
they are. That's great because a bitmap could be mostly wasted space.
But now the caller can't know up front how many bytes it needs until
it retrieves the value and sees what's already there. If there are
already three values, the caller needs to tell the tree "alloc this
much, update this slot you just gave me with the alloc (maybe DSA)
pointer, and return the local pointer". Then copy the 3 offsets into
set bits, and set whatever else it needs to. With normal values, same
thing, but with realloc.

This is a bit complex, but I see an advantage The tree doesn't need to
care so much about the size, so the value doesn't need to contain the
size. For our case, we can use length (number of bitmapwords) without
the disadvantages I mentioned above, with length zero (or maybe -1)
meaning "no bitmapword array, the offsets are all in this small
array".

> > > Another idea is that the radix tree returns the pointer
> > > to the slot and the caller updates the value accordingly.
> >
> > I did exactly this in v43 TidStore if I understood you correctly. If I
> > misunderstood you, can you clarify?
>
> I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
> the value as they want.

Did my sketch above get closer to that? Side note: I don't think we
can expose that directly (e.g. need to check for create or extend
upwards), but some functionality can be a thin wrapper around it.

> > However, for vacuum, we have all values that we need up front. That
> > gives me an idea: Something like this insert API could be optimized
> > for "insert-only": If we only free values when we free the whole tree
> > at the end, that's a clear use case for David Rowley's proposed "bump
> > context", which would save 8 bytes per allocation and be a bit faster.
> > [1] (RT_GET for varlen values would use an aset context, to allow
> > repalloc, and nodes would continue to use slab).
>
> Interesting idea and worth trying it. Do we need to protect the whole
> tree as insert-only for safety? It's problematic if the user uses
> mixed RT_INSERT() and RT_GET().

You're right, but I'm not sure what the policy should be.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote:

> bool
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> or for variable-length value support,
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
>
> If an entry already exists, update its value to 'value_p' and return
> true. Otherwise set the value and return false.

> RT_VALUE_TYPE
> RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If the entry already exists, replace the value with a new empty value
> with sz bytes and set "found" to true. Otherwise, insert an empty
> value, return its pointer, and set "found" to false.
>
> We probably will find a better name but I use RT_INSERT() for
> discussion. RT_INSERT() returns an empty slot regardless of existing
> values. It can be used to insert a new value or to replace the value
> with a larger value.

Looking at TidStoreSetBlockOffsets again (in particular how it works
with RT_GET), and thinking about issues we've discussed, I think
RT_SET is sufficient for vacuum. Here's how it could work:

TidStoreSetBlockOffsets could have a stack variable that's "almost
always" large enough. When not, it can allocate in its own context. It
sets the necessary bits there. Then, it passes the pointer to RT_SET
with the number of bytes to copy. That seems very simple.

At some future time, we can add a new function with the complex
business about getting the current value to modify it, with the
re-alloc'ing that it might require.

In other words, from both an API perspective and a performance
perspective, it makes sense for tid store to have a simple "set"
interface for vacuum that can be optimized for its characteristics
(insert only, ordered offsets). And also a more complex one for bitmap
scan (setting/unsetting bits of existing values, in any order). They
can share the same iteration interface, key types, and value types.

What do you think, Masahiko?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Dec 6, 2023 at 3:39 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > > Given variable-length value support, RT_GET() would have to do
> > > > repalloc() if the existing value size is not big enough for the new
> > > > value, but it cannot as the radix tree doesn't know the size of each
> > > > stored value.
> > >
> > > I think we have two choices:
> > >
> > > - the value stores the "length". The caller would need to specify a
> > > function to compute size from the "length" member. Note this assumes
> > > there is an array. I think both aspects are not great.
> > > - the value stores the "size". Callers that store an array (as
> > > PageTableEntry's do) would compute length when they need to. This
> > > sounds easier.
> >
> > As for the second idea, do we always need to require the value to have
> > the "size" (e.g. int32) in the first field of its struct? If so, the
> > caller will be able to use only 4 bytes in embedded value cases (or
> > won't be able to use at all if the pointer size is 4 bytes).
>
> We could have an RT_SIZE_TYPE for varlen value types. That's easy.
> There is another way, though: (This is a digression into embedded
> values, but it does illuminate some issues even aside from that)
>
> My thinking a while ago was that an embedded value had no explicit
> length/size, but could be "expanded" into a conventional value for the
> caller. For bitmaps, the smallest full value would have length 1 and
> whatever size (For tid store maybe 16 bytes). This would happen
> automatically via a template function.
>
> Now I think that could be too complicated (especially for page table
> entries, which have more bookkeeping than vacuum needs) and slow.
> Imagine this as an embedded value:
>
> typedef struct BlocktableEntry
> {
>   uint16 size;
>
>   /* later: uint8 flags; for bitmap scan */
>
>   /* 64 bit: 3 elements , 32-bit: 1 element */
>   OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
> sizeof(OffsetNumber)];
>
>   /* end of embeddable value */
>
>   bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> } BlocktableEntry;
>
> Here we can use a slot to store up to 3 offsets, no matter how big
> they are. That's great because a bitmap could be mostly wasted space.

Interesting idea.

> But now the caller can't know up front how many bytes it needs until
> it retrieves the value and sees what's already there. If there are
> already three values, the caller needs to tell the tree "alloc this
> much, update this slot you just gave me with the alloc (maybe DSA)
> pointer, and return the local pointer". Then copy the 3 offsets into
> set bits, and set whatever else it needs to.  With normal values, same
> thing, but with realloc.
>
> This is a bit complex, but I see an advantage The tree doesn't need to
> care so much about the size, so the value doesn't need to contain the
> size. For our case, we can use length (number of bitmapwords) without
> the disadvantages I mentioned above, with length zero (or maybe -1)
> meaning "no bitmapword array, the offsets are all in this small
> array".

It's still unclear to me why the value doesn't need to contain the size.

If I understand you correctly, in RT_GET(), the tree allocs a new
memory and updates the slot where the value is embedded with the
pointer to the allocated memory, and returns the pointer to the
caller. Since the returned value, newly allocated memory, is still
empty, the callner needs to copy the contents of the old value to the
new value and do whatever else it needs to.

If the value is already a single-leave value and RT_GET() is called
with a larger size, the slot is always replaced with the newly
allocated area and the caller needs to copy the contents? If the tree
does realloc the value with a new size, how does the tree know the new
value is larger than the existing value? It seems like the caller
needs to provide a function to calculate the size of the value based
on the length.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Dec 7, 2023 at 12:27 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > bool
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> > or for variable-length value support,
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
> >
> > If an entry already exists, update its value to 'value_p' and return
> > true. Otherwise set the value and return false.
>
> > RT_VALUE_TYPE
> > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If the entry already exists, replace the value with a new empty value
> > with sz bytes and set "found" to true. Otherwise, insert an empty
> > value, return its pointer, and set "found" to false.
> >
> > We probably will find a better name but I use RT_INSERT() for
> > discussion. RT_INSERT() returns an empty slot regardless of existing
> > values. It can be used to insert a new value or to replace the value
> > with a larger value.
>
> Looking at TidStoreSetBlockOffsets again (in particular how it works
> with RT_GET), and thinking about issues we've discussed, I think
> RT_SET is sufficient for vacuum. Here's how it could work:
>
> TidStoreSetBlockOffsets could have a stack variable that's "almost
> always" large enough. When not, it can allocate in its own context. It
> sets the necessary bits there. Then, it passes the pointer to RT_SET
> with the number of bytes to copy. That seems very simple.

Right.

>
> At some future time, we can add a new function with the complex
> business about getting the current value to modify it, with the
> re-alloc'ing that it might require.
>
> In other words, from both an API perspective and a performance
> perspective, it makes sense for tid store to have a simple "set"
> interface for vacuum that can be optimized for its characteristics
> (insert only, ordered offsets). And also a more complex one for bitmap
> scan (setting/unsetting bits of existing values, in any order). They
> can share the same iteration interface, key types, and value types.
>
> What do you think, Masahiko?

Good point. RT_SET() would be faster than RT_GET() and updating the
value because RT_SET() would not need to take care of the existing
value (its size, embedded or not, realloc etc).

I think that we can separate the radix tree patch into two parts: the
main implementation with RT_SET(), and more complex APIs such as
RT_GET() etc. That way, it would probably make it easy to complete the
radix tree and tidstore first.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> It's still unclear to me why the value doesn't need to contain the size.
>
> If I understand you correctly, in RT_GET(), the tree allocs a new
> memory and updates the slot where the value is embedded with the
> pointer to the allocated memory, and returns the pointer to the
> caller. Since the returned value, newly allocated memory, is still
> empty, the callner needs to copy the contents of the old value to the
> new value and do whatever else it needs to.
>
> If the value is already a single-leave value and RT_GET() is called
> with a larger size, the slot is always replaced with the newly
> allocated area and the caller needs to copy the contents? If the tree
> does realloc the value with a new size, how does the tree know the new
> value is larger than the existing value? It seems like the caller
> needs to provide a function to calculate the size of the value based
> on the length.

Right. My brief description mentioned one thing without details: The
caller would need to control whether to re-alloc. RT_GET would pass
the size. If nothing is found, the tree would allocate. If there is a
value already, just return it. That means both the address of the
slot, and the local pointer to the value (with embedded, would be the
same address). The caller checks if the array is long enough. If not,
call a new function that takes the new size, the address of the slot,
and the pointer to the old value. The tree would re-alloc, put the
alloc pointer in the slot and return the new local pointer. But as we
agreed, that is all follow-up work.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 8, 2023 at 1:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > It's still unclear to me why the value doesn't need to contain the size.
> >
> > If I understand you correctly, in RT_GET(), the tree allocs a new
> > memory and updates the slot where the value is embedded with the
> > pointer to the allocated memory, and returns the pointer to the
> > caller. Since the returned value, newly allocated memory, is still
> > empty, the callner needs to copy the contents of the old value to the
> > new value and do whatever else it needs to.
> >
> > If the value is already a single-leave value and RT_GET() is called
> > with a larger size, the slot is always replaced with the newly
> > allocated area and the caller needs to copy the contents? If the tree
> > does realloc the value with a new size, how does the tree know the new
> > value is larger than the existing value? It seems like the caller
> > needs to provide a function to calculate the size of the value based
> > on the length.
>
> Right. My brief description mentioned one thing without details: The
> caller would need to control whether to re-alloc. RT_GET would pass
> the size. If nothing is found, the tree would allocate. If there is a
> value already, just return it. That means both the address of the
> slot, and the local pointer to the value (with embedded, would be the
> same address). The caller checks if the array is long enough. If not,
> call a new function that takes the new size, the address of the slot,
> and the pointer to the old value. The tree would re-alloc, put the
> alloc pointer in the slot and return the new local pointer. But as we
> agreed, that is all follow-up work.

Thank you for the detailed explanation. That makes sense to me. We
will address it as a follow-up work.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 8, 2023 at 3:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 8, 2023 at 1:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > It's still unclear to me why the value doesn't need to contain the size.
> > >
> > > If I understand you correctly, in RT_GET(), the tree allocs a new
> > > memory and updates the slot where the value is embedded with the
> > > pointer to the allocated memory, and returns the pointer to the
> > > caller. Since the returned value, newly allocated memory, is still
> > > empty, the callner needs to copy the contents of the old value to the
> > > new value and do whatever else it needs to.
> > >
> > > If the value is already a single-leave value and RT_GET() is called
> > > with a larger size, the slot is always replaced with the newly
> > > allocated area and the caller needs to copy the contents? If the tree
> > > does realloc the value with a new size, how does the tree know the new
> > > value is larger than the existing value? It seems like the caller
> > > needs to provide a function to calculate the size of the value based
> > > on the length.
> >
> > Right. My brief description mentioned one thing without details: The
> > caller would need to control whether to re-alloc. RT_GET would pass
> > the size. If nothing is found, the tree would allocate. If there is a
> > value already, just return it. That means both the address of the
> > slot, and the local pointer to the value (with embedded, would be the
> > same address).

BTW Given that the actual value size can be calculated only by the
caller, how does the tree know if the value is embedded or not? It's
probably related to how to store combined pointer/value slots. If leaf
nodes have a bitmap array that indicates the corresponding slot is an
embedded value or a pointer to a value, it would be easy. But since
the bitmap array is needed only in the leaf nodes, internal nodes and
leaf nodes will no longer be identical structure, which is not a bad
thing to me, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> BTW Given that the actual value size can be calculated only by the
> caller, how does the tree know if the value is embedded or not? It's
> probably related to how to store combined pointer/value slots.

Right, this is future work. At first, variable-length types will have
to be single-value leaves. In fact, the idea for storing up to 3
offsets in the bitmap header could be done this way -- it would just
be a (small) single-value leaf.

(Reminder: Currently, fixed-length values are compile-time embeddable
if the platform pointer size is big enough.)

> If leaf
> nodes have a bitmap array that indicates the corresponding slot is an
> embedded value or a pointer to a value, it would be easy.

That's the most general way to do it. We could do it much more easily
with a pointer tag, although for the above idea it may require some
endian-aware coding. Both were mentioned in the paper, I recall.

> But since
> the bitmap array is needed only in the leaf nodes, internal nodes and
> leaf nodes will no longer be identical structure, which is not a bad
> thing to me, though.

Absolutely no way we are going back to double everything: double
types, double functions, double memory contexts. Plus, that bitmap in
inner nodes could indicate a pointer to a leaf that got there by "lazy
expansion".



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 8, 2023 at 7:46 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > BTW Given that the actual value size can be calculated only by the
> > caller, how does the tree know if the value is embedded or not? It's
> > probably related to how to store combined pointer/value slots.
>
> Right, this is future work. At first, variable-length types will have
> to be single-value leaves. In fact, the idea for storing up to 3
> offsets in the bitmap header could be done this way -- it would just
> be a (small) single-value leaf.

Agreed.

>
> (Reminder: Currently, fixed-length values are compile-time embeddable
> if the platform pointer size is big enough.)
>
> > If leaf
> > nodes have a bitmap array that indicates the corresponding slot is an
> > embedded value or a pointer to a value, it would be easy.
>
> That's the most general way to do it. We could do it much more easily
> with a pointer tag, although for the above idea it may require some
> endian-aware coding. Both were mentioned in the paper, I recall.

True. Probably we can use the combined pointer/value slots approach
only if the tree is able to use the pointer tagging. That is, if the
caller allows the tree to use one bit of the value.

I'm going to update the patch based on the recent discussion (RT_SET()
and variable-length values) etc., and post the patch set early next
week.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 8, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 8, 2023 at 7:46 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > BTW Given that the actual value size can be calculated only by the
> > > caller, how does the tree know if the value is embedded or not? It's
> > > probably related to how to store combined pointer/value slots.
> >
> > Right, this is future work. At first, variable-length types will have
> > to be single-value leaves. In fact, the idea for storing up to 3
> > offsets in the bitmap header could be done this way -- it would just
> > be a (small) single-value leaf.
>
> Agreed.
>
> >
> > (Reminder: Currently, fixed-length values are compile-time embeddable
> > if the platform pointer size is big enough.)
> >
> > > If leaf
> > > nodes have a bitmap array that indicates the corresponding slot is an
> > > embedded value or a pointer to a value, it would be easy.
> >
> > That's the most general way to do it. We could do it much more easily
> > with a pointer tag, although for the above idea it may require some
> > endian-aware coding. Both were mentioned in the paper, I recall.
>
> True. Probably we can use the combined pointer/value slots approach
> only if the tree is able to use the pointer tagging. That is, if the
> caller allows the tree to use one bit of the value.
>
> I'm going to update the patch based on the recent discussion (RT_SET()
> and variable-length values) etc., and post the patch set early next
> week.

I've attached the updated patch set. From the previous patch set, I've
merged patches 0007 to 0010. The other changes such as adding RT_GET()
still are unmerged for now, for discussion. Probably we can make them
as follow-up patches as we discussed. 0011 to 0015 patches are new
changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
support variable-length values.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've attached the updated patch set. From the previous patch set, I've
> merged patches 0007 to 0010. The other changes such as adding RT_GET()
> still are unmerged for now, for discussion. Probably we can make them
> as follow-up patches as we discussed. 0011 to 0015 patches are new
> changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
> support variable-length values.

This looks like the right direction, and I'm pleased it's not much
additional code on top of my last patch.

v44-0014:

+#ifdef RT_VARLEN_VALUE
+ /* XXX: need to choose block sizes? */
+ tree->leaf_ctx = AllocSetContextCreate(ctx,
+    "radix tree leaves",
+    ALLOCSET_DEFAULT_SIZES);
+#else
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    "radix tree leaves",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* RT_VARLEN_VALUE */

Choosing block size: Similar to what we've discussed previously around
DSA segments, we might model this on CreateWorkExprContext() in
src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m /
autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the
max block size to 1/16 of that, or less.

Also, it occurred to me that compile-time embeddable values don't need
a leaf context. I'm not sure how many places assume that there is
always a leaf context. If not many, it may be worth not creating one
here, just to be tidy.

+ size_t copysize;

- memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE));
+ copysize = sizeof(RT_VALUE_TYPE);
+#endif
+
+ memcpy(leaf.local, value_p, copysize);

I'm not sure this indirection adds clarity. I guess the intent was to
keep from saying "memcpy" twice, but now the code has to say "copysize
= foo" twice.

For varlen case, we need to watch out for slowness because of memcpy.
Let's put that off for later testing, though. We may someday want to
avoid a memcpy call for the varlen case, so let's keep it flexible
here.

v44-0015:

+#define SizeOfBlocktableEntry (offsetof(

Unused.

+ char buf[MaxBlocktableEntrySize] = {0};

Zeroing this buffer is probably going to be expensive. Also see this
pre-existing comment:
/* WIP: slow, since it writes to memory for every bit */
page->words[wordnum] |= ((bitmapword) 1 << bitnum);

For this function (which will be vacuum-only, so we can assume
ordering), in the loop we can:
* declare the local bitmapword variable to be zero
* set the bits on it
* write it out to the right location when done.

Let's fix both of these at once.

+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len);
+ else
+ local_rt_set(ts->tree.local, blkno, (void *) page, page_len);

Is there a reason for "void *"? The declared parameter is
"RT_VALUE_TYPE *value_p" in 0014.
Also, since this function is for vacuum (and other uses will need a
new function), let's assert the returned bool is false.

Does iteration still work? If so, it's not too early to re-wire this
up with vacuum and see how it behaves.

Lastly, my compiler has a warning that CI doesn't have:

In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121:
../src/include/lib/radixtree.h: In function ‘rt_find.isra’:
../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used
uninitialized [-Wmaybe-uninitialized]
 2142 |                 return (RT_VALUE_TYPE*) slot;
      |                        ^~~~~~~~~~~~~~~~~~~~~
../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here
 2112 |         RT_PTR_ALLOC *slot;
      |                       ^~~~



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Dec 12, 2023 at 11:53 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I've attached the updated patch set. From the previous patch set, I've
> > merged patches 0007 to 0010. The other changes such as adding RT_GET()
> > still are unmerged for now, for discussion. Probably we can make them
> > as follow-up patches as we discussed. 0011 to 0015 patches are new
> > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
> > support variable-length values.
>
> This looks like the right direction, and I'm pleased it's not much
> additional code on top of my last patch.
>
> v44-0014:
>
> +#ifdef RT_VARLEN_VALUE
> + /* XXX: need to choose block sizes? */
> + tree->leaf_ctx = AllocSetContextCreate(ctx,
> +    "radix tree leaves",
> +    ALLOCSET_DEFAULT_SIZES);
> +#else
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +    "radix tree leaves",
> +    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +    sizeof(RT_VALUE_TYPE));
> +#endif /* RT_VARLEN_VALUE */
>
> Choosing block size: Similar to what we've discussed previously around
> DSA segments, we might model this on CreateWorkExprContext() in
> src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m /
> autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the
> max block size to 1/16 of that, or less.
>
> Also, it occurred to me that compile-time embeddable values don't need
> a leaf context. I'm not sure how many places assume that there is
> always a leaf context. If not many, it may be worth not creating one
> here, just to be tidy.
>
> + size_t copysize;
>
> - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE));
> + copysize = sizeof(RT_VALUE_TYPE);
> +#endif
> +
> + memcpy(leaf.local, value_p, copysize);
>
> I'm not sure this indirection adds clarity. I guess the intent was to
> keep from saying "memcpy" twice, but now the code has to say "copysize
> = foo" twice.
>
> For varlen case, we need to watch out for slowness because of memcpy.
> Let's put that off for later testing, though. We may someday want to
> avoid a memcpy call for the varlen case, so let's keep it flexible
> here.
>
> v44-0015:
>
> +#define SizeOfBlocktableEntry (offsetof(
>
> Unused.
>
> + char buf[MaxBlocktableEntrySize] = {0};
>
> Zeroing this buffer is probably going to be expensive. Also see this
> pre-existing comment:
> /* WIP: slow, since it writes to memory for every bit */
> page->words[wordnum] |= ((bitmapword) 1 << bitnum);
>
> For this function (which will be vacuum-only, so we can assume
> ordering), in the loop we can:
> * declare the local bitmapword variable to be zero
> * set the bits on it
> * write it out to the right location when done.
>
> Let's fix both of these at once.
>
> + if (TidStoreIsShared(ts))
> + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len);
> + else
> + local_rt_set(ts->tree.local, blkno, (void *) page, page_len);
>
> Is there a reason for "void *"? The declared parameter is
> "RT_VALUE_TYPE *value_p" in 0014.
> Also, since this function is for vacuum (and other uses will need a
> new function), let's assert the returned bool is false.
>
> Does iteration still work? If so, it's not too early to re-wire this
> up with vacuum and see how it behaves.
>
> Lastly, my compiler has a warning that CI doesn't have:
>
> In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121:
> ../src/include/lib/radixtree.h: In function ‘rt_find.isra’:
> ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used
> uninitialized [-Wmaybe-uninitialized]
>  2142 |                 return (RT_VALUE_TYPE*) slot;
>       |                        ^~~~~~~~~~~~~~~~~~~~~
> ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here
>  2112 |         RT_PTR_ALLOC *slot;
>       |                       ^~~~

Thank you for the comments! I agreed with all of them and incorporated
them into the attached latest patch set, v45.

In v45, 0001 - 0006 are from earlier versions but I've merged previous
updates. So the radix tree now has RT_SET() and RT_FIND() but not
RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
versions that incorporated the above comments. 0009 patch integrates
tidstore with lazy vacuum. Note that DSA segment problem is not
resolved yet in this patch. 0010 and 0011 makes DSA initial/max
segment size configurable and make parallel vacuum specify both in
proportion to maintenance_work_mem. 0012 is a development-purpose
patch to make it easy to investigate bugs in tidstore. I'd like to
keep it in the patch set at least during the development.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> In v45, 0001 - 0006 are from earlier versions but I've merged previous
> updates. So the radix tree now has RT_SET() and RT_FIND() but not
> RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
> versions that incorporated the above comments. 0009 patch integrates
> tidstore with lazy vacuum.

Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from

https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com

...and got similar results, so we still have good space-efficiency on this test:

master:
INFO:  finished vacuuming "john.public.test": index scans: 9
system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s

v45:
INFO:  finished vacuuming "john.public.test": index scans: 1
system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

More sparse TID distributions won't be as favorable, but we have ideas
to improve that in the future.

For my next steps, I will finish the node-shrinking behavior and save
for a later patchset. Not needed for tid store, but needs to happen
because of assumptions in the code. Also, some time ago, I think I
commented out RT_FREE_RECURSE to get something working, so I'll fix
it, and look at other fixmes and todos.

> Note that DSA segment problem is not
> resolved yet in this patch.

I remember you started a separate thread about this, but I don't think
it got any attention. Maybe reply with a "TLDR;" and share a patch to
allow controlling max segment size.

Some more comments:

v45-0003:

Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
RT_ITERATE_NEXT anymore? The former should handle fixed-length values
just fine? If so, we should rename it to match the latter.

+ * The caller is responsible for locking/unlocking the tree in shared mode.

This is not new to v45, but this will come up again below. This needs
more explanation: Since we're returning a pointer (to support
variable-length values), the caller needs to maintain control until
it's finished with the value.

v45-0005:

+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.

This is just stating facts without giving any reasons. Readers are
going to wonder why it's inconsistent. The "why" is much more
important than the "what". Even with that, this comment is also far
from the relevant parts, and so will get out of date. Maybe we can
just make sure each relevant function is explained individually.

v45-0007:

-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);

Tid store calls this max_bytes -- can we use that name here, too?
"work_mem" is highly specific.

- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;

We have a macro for invalid pointer because of DSA.

v45-0008:

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
  elog(ERROR, "tuple offset out of range: %u", off);

This is a superfluous distraction, since the error path is located way
off in the cold segment of the binary.

v45-0009:

(just a few small things for now)

- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *   vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.

I think we can keep as "listed in the TID store".

- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.

I think we want to keep "in dynamic shared memory". It's still true.
I'm not sure anything needs to change here, actually.

 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)

It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.

- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",

This should be signed int64.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
the memory accounting starts >1MB because we're adding the trivial
size of some struct, let's just stop doing that. The segment
allocations are what we care about.

v45-0011:

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

Why? I believe I mentioned months ago that copying a hard-coded value
that can get out of sync is not maintainable, but I don't even see the
point of this part.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > In v45, 0001 - 0006 are from earlier versions but I've merged previous
> > updates. So the radix tree now has RT_SET() and RT_FIND() but not
> > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
> > versions that incorporated the above comments. 0009 patch integrates
> > tidstore with lazy vacuum.
>
> Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from
>
> https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com
>
> ...and got similar results, so we still have good space-efficiency on this test:
>
> master:
> INFO:  finished vacuuming "john.public.test": index scans: 9
> system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s
>
> v45:
> INFO:  finished vacuuming "john.public.test": index scans: 1
> system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

Thank you for testing it again. That's a very good result.

> For my next steps, I will finish the node-shrinking behavior and save
> for a later patchset. Not needed for tid store, but needs to happen
> because of assumptions in the code. Also, some time ago, I think I
> commented out RT_FREE_RECURSE to get something working, so I'll fix
> it, and look at other fixmes and todos.

Great!

>
> > Note that DSA segment problem is not
> > resolved yet in this patch.
>
> I remember you started a separate thread about this, but I don't think
> it got any attention. Maybe reply with a "TLDR;" and share a patch to
> allow controlling max segment size.

Yeah, I recalled that thread. Will send a reply.

>
> Some more comments:
>
> v45-0003:
>
> Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
> RT_ITERATE_NEXT anymore? The former should handle fixed-length values
> just fine? If so, we should rename it to match the latter.

Agreed to rename it.

>
> + * The caller is responsible for locking/unlocking the tree in shared mode.
>
> This is not new to v45, but this will come up again below. This needs
> more explanation: Since we're returning a pointer (to support
> variable-length values), the caller needs to maintain control until
> it's finished with the value.

Will fix.

>
> v45-0005:
>
> + * Regarding the concurrency support, we use a single LWLock for the TidStore.
> + * The TidStore is exclusively locked when inserting encoded tids to the
> + * radix tree or when resetting itself. When searching on the TidStore or
> + * doing the iteration, it is not locked but the underlying radix tree is
> + * locked in shared mode.
>
> This is just stating facts without giving any reasons. Readers are
> going to wonder why it's inconsistent. The "why" is much more
> important than the "what". Even with that, this comment is also far
> from the relevant parts, and so will get out of date. Maybe we can
> just make sure each relevant function is explained individually.

Right, I'll fix it.

>
> v45-0007:
>
> -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
> +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);
>
> Tid store calls this max_bytes -- can we use that name here, too?
> "work_mem" is highly specific.

While I agree that "work_mem" is highly specific, I avoided using
"max_bytes" in radix tree because "max_bytes" sounds to me there is a
memory limitation but the radix tree doesn't have it actually. It
might be sufficient to mention it in the comment, though.

>
> - RT_PTR_ALLOC *slot;
> + RT_PTR_ALLOC *slot = NULL;
>
> We have a macro for invalid pointer because of DSA.

Will fix.

>
> v45-0008:
>
> - if (off < 1 || off > MAX_TUPLES_PER_PAGE)
> + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
>   elog(ERROR, "tuple offset out of range: %u", off);
>
> This is a superfluous distraction, since the error path is located way
> off in the cold segment of the binary.

Okay, will remove it.

>
> v45-0009:
>
> (just a few small things for now)
>
> - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
> - *   vacrel->dead_items array.
> + * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
>
> I think we can keep as "listed in the TID store".
>
> - * Allocate dead_items (either using palloc, or in dynamic shared memory).
> - * Sets dead_items in vacrel for caller.
> + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
> + * in vacrel for caller.
>
> I think we want to keep "in dynamic shared memory". It's still true.
> I'm not sure anything needs to change here, actually.

Agreed with above comments. Will fix them.

>
>  parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> - int nrequested_workers, int max_items,
> - int elevel, BufferAccessStrategy bstrategy)
> + int nrequested_workers, int vac_work_mem,
> + int max_offset, int elevel,
> + BufferAccessStrategy bstrategy)
>
> It seems very strange to me that this function has to pass the
> max_offset. In general, it's been simpler to assume we have a constant
> max_offset, but in this case that fact is not helping. Something to
> think about for later.

max_offset was previously used in old TID encoding in tidstore. Since
tidstore has entries for each block, I think we no longer need it.

>
> - (errmsg("scanned index \"%s\" to remove %d row versions",
> + (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
>
> This should be signed int64.

Will fix.

>
> v45-0010:
>
> Thinking about this some more, I'm not sure we need to do anything
> different for the *starting* segment size. (Controlling *max* size
> does seem important, however.) For the corner case of m_w_m = 1MB,
> it's fine if vacuum quits pruning immediately after (in effect) it
> finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
> the memory accounting starts >1MB because we're adding the trivial
> size of some struct, let's just stop doing that. The segment
> allocations are what we care about.

IIUC it's for work_mem, whose the minimum value is 64kB.

>
> v45-0011:
>
> + /*
> + * max_bytes is forced to be at least 64kB, the current minimum valid
> + * value for the work_mem GUC.
> + */
> + max_bytes = Max(64 * 1024L, max_bytes);
>
> Why?

This is to avoid creating a radix tree within very small memory. The
minimum work_mem value is a reasonable lower bound that PostgreSQL
uses internally. It's actually copied from tuplesort.c.

>I believe I mentioned months ago that copying a hard-coded value
> that can get out of sync is not maintainable, but I don't even see the
> point of this part.

True.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote:

> >  parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> > - int nrequested_workers, int max_items,
> > - int elevel, BufferAccessStrategy bstrategy)
> > + int nrequested_workers, int vac_work_mem,
> > + int max_offset, int elevel,
> > + BufferAccessStrategy bstrategy)
> >
> > It seems very strange to me that this function has to pass the
> > max_offset. In general, it's been simpler to assume we have a constant
> > max_offset, but in this case that fact is not helping. Something to
> > think about for later.
>
> max_offset was previously used in old TID encoding in tidstore. Since
> tidstore has entries for each block, I think we no longer need it.

It's needed now to properly size the allocation of TidStoreIter which
contains...

+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+} TidStoreIterResult;

Maybe we can palloc the offset array to "almost always" big enough,
with logic to resize if needed? If not too hard, seems worth it to
avoid churn in the parameter list.

> > v45-0010:
> >
> > Thinking about this some more, I'm not sure we need to do anything
> > different for the *starting* segment size. (Controlling *max* size
> > does seem important, however.) For the corner case of m_w_m = 1MB,
> > it's fine if vacuum quits pruning immediately after (in effect) it
> > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
> > the memory accounting starts >1MB because we're adding the trivial
> > size of some struct, let's just stop doing that. The segment
> > allocations are what we care about.
>
> IIUC it's for work_mem, whose the minimum value is 64kB.
>
> >
> > v45-0011:
> >
> > + /*
> > + * max_bytes is forced to be at least 64kB, the current minimum valid
> > + * value for the work_mem GUC.
> > + */
> > + max_bytes = Max(64 * 1024L, max_bytes);
> >
> > Why?
>
> This is to avoid creating a radix tree within very small memory. The
> minimum work_mem value is a reasonable lower bound that PostgreSQL
> uses internally. It's actually copied from tuplesort.c.

There is no explanation for why it should be done like tuplesort.c. Also...

- tree->leaf_ctx = SlabContextCreate(ctx,
-    "radix tree leaves",
-    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
-    sizeof(RT_VALUE_TYPE));
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    "radix tree leaves",
+    Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    work_mem),
+    sizeof(RT_VALUE_TYPE));

At first, my eyes skipped over this apparent re-indent, but hidden
inside here is another (undocumented) attempt to clamp the size of
something. There are too many of these sprinkled in various places,
and they're already a maintenance hazard -- a different one was left
behind in v45-0011:

@@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off,
dsa_area *area)
    ts->control->max_bytes = max_bytes - (70 * 1024);
  }

Let's do it in just one place. In TidStoreCreate(), do

/* clamp max_bytes to at least the size of the empty tree with
allocated blocks, so it doesn't immediately appear full */
ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);

Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

I may not recall everything while writing this, but it seems the only
other thing we should be clamping is the max aset block size (solved)
/ max DSM segment size (in progress).



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > >  parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> > > - int nrequested_workers, int max_items,
> > > - int elevel, BufferAccessStrategy bstrategy)
> > > + int nrequested_workers, int vac_work_mem,
> > > + int max_offset, int elevel,
> > > + BufferAccessStrategy bstrategy)
> > >
> > > It seems very strange to me that this function has to pass the
> > > max_offset. In general, it's been simpler to assume we have a constant
> > > max_offset, but in this case that fact is not helping. Something to
> > > think about for later.
> >
> > max_offset was previously used in old TID encoding in tidstore. Since
> > tidstore has entries for each block, I think we no longer need it.
>
> It's needed now to properly size the allocation of TidStoreIter which
> contains...
>
> +/* Result struct for TidStoreIterateNext */
> +typedef struct TidStoreIterResult
> +{
> + BlockNumber blkno;
> + int num_offsets;
> + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
> +} TidStoreIterResult;
>
> Maybe we can palloc the offset array to "almost always" big enough,
> with logic to resize if needed? If not too hard, seems worth it to
> avoid churn in the parameter list.

Yes, I was thinking of that.

>
> > > v45-0010:
> > >
> > > Thinking about this some more, I'm not sure we need to do anything
> > > different for the *starting* segment size. (Controlling *max* size
> > > does seem important, however.) For the corner case of m_w_m = 1MB,
> > > it's fine if vacuum quits pruning immediately after (in effect) it
> > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
> > > the memory accounting starts >1MB because we're adding the trivial
> > > size of some struct, let's just stop doing that. The segment
> > > allocations are what we care about.
> >
> > IIUC it's for work_mem, whose the minimum value is 64kB.
> >
> > >
> > > v45-0011:
> > >
> > > + /*
> > > + * max_bytes is forced to be at least 64kB, the current minimum valid
> > > + * value for the work_mem GUC.
> > > + */
> > > + max_bytes = Max(64 * 1024L, max_bytes);
> > >
> > > Why?
> >
> > This is to avoid creating a radix tree within very small memory. The
> > minimum work_mem value is a reasonable lower bound that PostgreSQL
> > uses internally. It's actually copied from tuplesort.c.
>
> There is no explanation for why it should be done like tuplesort.c. Also...
>
> - tree->leaf_ctx = SlabContextCreate(ctx,
> -    "radix tree leaves",
> -    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> -    sizeof(RT_VALUE_TYPE));
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +    "radix tree leaves",
> +    Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +    work_mem),
> +    sizeof(RT_VALUE_TYPE));
>
> At first, my eyes skipped over this apparent re-indent, but hidden
> inside here is another (undocumented) attempt to clamp the size of
> something. There are too many of these sprinkled in various places,
> and they're already a maintenance hazard -- a different one was left
> behind in v45-0011:
>
> @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off,
> dsa_area *area)
>     ts->control->max_bytes = max_bytes - (70 * 1024);
>   }
>
> Let's do it in just one place. In TidStoreCreate(), do
>
> /* clamp max_bytes to at least the size of the empty tree with
> allocated blocks, so it doesn't immediately appear full */
> ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
>
> Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

But doesn't it mean that even if we create a shared tidstore with
small memory, say 64kB, it actually uses 1MB?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > Let's do it in just one place. In TidStoreCreate(), do
> >
> > /* clamp max_bytes to at least the size of the empty tree with
> > allocated blocks, so it doesn't immediately appear full */
> > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
> >
> > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.
>
> But doesn't it mean that even if we create a shared tidstore with
> small memory, say 64kB, it actually uses 1MB?

This sounds like an argument for controlling the minimum DSA segment
size. (I'm not really in favor of that, but open to others' opinion)

I wasn't talking about that above -- I was saying we should have only
one place where we clamp max_bytes so that the tree doesn't
immediately appear full.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Dec 19, 2023 at 4:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > Let's do it in just one place. In TidStoreCreate(), do
> > >
> > > /* clamp max_bytes to at least the size of the empty tree with
> > > allocated blocks, so it doesn't immediately appear full */
> > > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
> > >
> > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.
> >
> > But doesn't it mean that even if we create a shared tidstore with
> > small memory, say 64kB, it actually uses 1MB?
>
> This sounds like an argument for controlling the minimum DSA segment
> size. (I'm not really in favor of that, but open to others' opinion)
>
> I wasn't talking about that above -- I was saying we should have only
> one place where we clamp max_bytes so that the tree doesn't
> immediately appear full.

Thank you for your clarification. Understood.

I've updated the new patch set that incorporated comments I got so
far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
In addition to the review comments, I made some changes in tidstore to
make it independent from heap. Specifically, it uses MaxOffsetNumber
instead of MaxHeapTuplesPerPage. Now we don't need to include
htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
bytes.

BTW regarding the previous comment I got before:

> - RT_PTR_ALLOC *slot;
> + RT_PTR_ALLOC *slot = NULL;
>
> We have a macro for invalid pointer because of DSA.

I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.

As for the initial and maximum DSA segment sizes, I've sent a summary
on that thread:

https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com

I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've updated the new patch set that incorporated comments I got so
> far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
> In addition to the review comments, I made some changes in tidstore to
> make it independent from heap. Specifically, it uses MaxOffsetNumber
> instead of MaxHeapTuplesPerPage. Now we don't need to include
> htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
> bytes.

That's a good idea.

> BTW regarding the previous comment I got before:
>
> > - RT_PTR_ALLOC *slot;
> > + RT_PTR_ALLOC *slot = NULL;
> >
> > We have a macro for invalid pointer because of DSA.
>
> I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.

Ah right, it's the address of the slot.

> I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.

That could probably use some discussion. A few months ago, I found the
debugging functions only worked when everything else worked. When
things weren't working, I had to rip one of these functions apart so
it only looked at one node. If something is broken, we can't count on
recursion or iteration working, because we won't get that far. I don't
remember how things are in the current patch.

I've finished the node shrinking and addressed some fixme/todo areas
-- can I share these and squash your v46 changes first?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Dec 21, 2023 at 10:19 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've updated the new patch set that incorporated comments I got so
> > far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
> > In addition to the review comments, I made some changes in tidstore to
> > make it independent from heap. Specifically, it uses MaxOffsetNumber
> > instead of MaxHeapTuplesPerPage. Now we don't need to include
> > htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
> > bytes.
>
> That's a good idea.
>
> > BTW regarding the previous comment I got before:
> >
> > > - RT_PTR_ALLOC *slot;
> > > + RT_PTR_ALLOC *slot = NULL;
> > >
> > > We have a macro for invalid pointer because of DSA.
> >
> > I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.
>
> Ah right, it's the address of the slot.
>
> > I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.
>
> That could probably use some discussion. A few months ago, I found the
> debugging functions only worked when everything else worked. When
> things weren't working, I had to rip one of these functions apart so
> it only looked at one node. If something is broken, we can't count on
> recursion or iteration working, because we won't get that far. I don't
> remember how things are in the current patch.

Agreed.

I found the following comment and wanted to discuss:

// this might be better as "iterate over nodes", plus a callback to
RT_DUMP_NODE,
// which should really only concern itself with single nodes
RT_SCOPE void
RT_DUMP(RT_RADIX_TREE *tree)

If it means we need to somehow use the iteration functions also for
dumping the whole tree, it would probably need to refactor the
iteration codes so that the RT_DUMP() can use them while dumping
visited nodes. But we need to be careful of not adding overheads to
the iteration performance.

>
> I've finished the node shrinking and addressed some fixme/todo areas
> -- can I share these and squash your v46 changes first?

Cool! Yes, please do so.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I found the following comment and wanted to discuss:
>
> // this might be better as "iterate over nodes", plus a callback to
> RT_DUMP_NODE,
> // which should really only concern itself with single nodes
> RT_SCOPE void
> RT_DUMP(RT_RADIX_TREE *tree)
>
> If it means we need to somehow use the iteration functions also for
> dumping the whole tree, it would probably need to refactor the
> iteration codes so that the RT_DUMP() can use them while dumping
> visited nodes. But we need to be careful of not adding overheads to
> the iteration performance.

Yeah, some months ago I thought a callback interface would make some
things easier. I don't think we need that at the moment (possibly
never), so that comment can be just removed. As far as these debug
functions, I only found useful the stats and dumping a single node,
FWIW.

I've attached v47, which is v46 plus some fixes for radix tree.

0004 - moves everything for "delete" to the end -- gradually other
things will be grouped together in a sensible order

0005 - trivial

0006 - shrink nodes -- still needs testing, but nothing crashes yet.
This shows some renaming might be good: Previously we had
RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
not copying the slot to be deleted. This means when growing it would
be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
reserves a new slot for the caller in the new node, but the caller
must do the insert itself. Note that there are some practical
restrictions/best-practices on whether shrinking should happen after
deletion or vice versa. Hopefully it's clear, but let me know if the
description can be improved. Also, it doesn't yet shrink from size
class 32 to 16, but it could with a bit of work.

0007 - trivial, but could use a better comment. I also need to make
sure stats reporting works (may also need some cleanup work).

0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
if DSA could just free all our allocated segments without throwing
away the DSA, and that's still a good question.

0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
back to that). The assert doesn't fire, so I guess it does what it's
supposed to? For me, the iteration logic is still the most confusing
piece out of the whole radix tree. Maybe that could be helped with
some better variable names, but I wonder if it needs more invasive
work. I confess I don't have better ideas for how it would work
differently.

0010 - some fixes for number of children accounting in node256

0011 - Long overdue pgindent of radixtree.h, without trying to fix up
afterwards. Feel free to throw out and redo if this interferes with
ongoing work.

The rest are from your v46. The bench doesn't work for tid store
anymore, so I squashed "disable bench for CI" until we get back to
that. Some more review comments (note: patch numbers are for v47, but
I changed nothing from v46 in this area):

0013:

+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.

Recently outdated. The variable length values seems to work, so let's
make everything match.

+#define MAX_TUPLES_PER_PAGE  MaxOffsetNumber

Maybe we don't need this macro anymore? The name no longer fits, in any case.

+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ char buf[MaxBlocktableEntrySize];
+ BlocktableEntry *page = (BlocktableEntry *) buf;

I'm not sure this is safe with alignment. Maybe rather than plain
"char", it needs to be a union with BlocktableEntry, or something.

+static inline BlocktableEntry *
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key);
+}

In the old encoding scheme, this function did something important, but
now it's a useless wrapper with one caller.

+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);

I don't see the point in including these tiny structs, since we will
always blow past the limit by a number of kilobytes (at least, often
megabytes or more) at the time it happens.

+ iter->output.max_offset = 64;

Maybe needs a comment that this is just some starting size and not
anything particular.

+ iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);

+ /* Make sure there is enough space to add offsets */
+ if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ {
+ result->max_offset *= 2;
+ result->offsets = repalloc(result->offsets,
+    sizeof(OffsetNumber) * result->max_offset);
+ }

popcount()-ing for every array element in every value is expensive --
let's just add sizeof(bitmapword). It's not that wasteful, but then
the initial max will need to be 128.

About separation of responsibilities for locking: The only thing
currently where the tid store is not locked is tree iteration. That's
a strange exception. Also, we've recently made RT_FIND return a
pointer, so the caller must somehow hold a share lock, but I think we
haven't exposed callers the ability to do that, and we rely on the tid
store lock for that. We have a mix of tree locking and tid store
locking. We will need to consider carefully how to make this more
clear, maintainable, and understandable.

0015:

"XXX: some regression test fails since this commit changes the minimum
m_w_m to 2048 from 1024. This was necessary for the pervious memory"

This shouldn't fail anymore if the "one-place" clamp was in a patch
before this. If so, lets take out that GUC change and worry about
min/max size separately. If it still fails, I'd like to know why.

- *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *                                               vacrel->dead_items array.
+ *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in
the TID store.

What I was getting at earlier is that the first line here doesn't
really need to change, we can just s/array/store/ ?

-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-                                         int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+                                         OffsetNumber *deadoffsets,
int num_offsets, Buffer buffer,
+                                         Buffer vmbuffer)

"buffer" should still come after "blkno", so that line doesn't need to change.

$ git diff master -- src/backend/access/heap/ | grep has_lpdead_items
- bool has_lpdead_items; /* includes existing LP_DEAD items */
- * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
- if (prunestate.has_lpdead_items)
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
- prunestate->has_lpdead_items = false;
- prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items
= true;

In a green field, it'd be fine to replace these with an expression of
"num_offsets", but it adds a bit of noise for reviewers and the git
log. Is it really necessary?

-                       deadoffsets[lpdead_items++] = offnum;
+
prunestate->deadoffsets[prunestate->num_offsets++] = offnum;

 I'm also not quite sure why "deadoffsets" and "lpdead_items" got
moved to the PruneState. The latter was renamed in a way that makes
more sense, but I don't see why the churn is necessary.

@@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel,
        }
 #endif

-       /*
-        * Now save details of the LP_DEAD items from the page in vacrel
-        */
-       if (lpdead_items > 0)
+       if (prunestate->num_offsets > 0)
        {
-               VacDeadItems *dead_items = vacrel->dead_items;
-               ItemPointerData tmp;
-
                vacrel->lpdead_item_pages++;
-               prunestate->has_lpdead_items = true;
-
-               ItemPointerSetBlockNumber(&tmp, blkno);
-
-               for (int i = 0; i < lpdead_items; i++)
-               {
-                       ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-                       dead_items->items[dead_items->num_items++] = tmp;
-               }
-
-               Assert(dead_items->num_items <= dead_items->max_items);
-               pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-
  dead_items->num_items);

I don't understand why this block got removed and nothing new is
adding anything to the tid store.

@@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel)
                         * with prunestate-driven visibility map and
FSM steps (just like
                         * the two-pass strategy).
                         */
-                       Assert(dead_items->num_items == 0);
+                       Assert(TidStoreNumTids(dead_items) == 0);
+               }
+               else if (prunestate.num_offsets > 0)
+               {
+                       /* Save details of the LP_DEAD items from the
page in dead_items */
+                       TidStoreSetBlockOffsets(dead_items, blkno,
prunestate.deadoffsets,
+
 prunestate.num_offsets);
+
+
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+
          TidStoreMemoryUsage(dead_items));

I guess it was added here, 800 lines away? If so, why?

About progress reporting: I want to make sure no one is going to miss
counting "num_dead_tuples". It's no longer relevant for the number of
index scans we need to do, but do admins still have a use for it?
Something to think about later.

0017

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

If this still needs to be here, I still don't understand why.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2023-12-21 14:41:37 +0700, John Naylor wrote:
> I've attached v47, which is v46 plus some fixes for radix tree.

Could either of you summarize what the design changes you've made in the last
months are and why you've done them? Unfortunately this thread is very long,
and the comments in the file just say "FIXME" in places that apparently are
affected by design changes.  This makes it hard to catch up here.

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Dec 21, 2023 at 6:27 PM Andres Freund <andres@anarazel.de> wrote:
>
> Could either of you summarize what the design changes you've made in the last
> months are and why you've done them? Unfortunately this thread is very long,
> and the comments in the file just say "FIXME" in places that apparently are
> affected by design changes.  This makes it hard to catch up here.

I'd be happy to try, since we are about due for a summary. I was also
hoping to reach a coherent-enough state sometime in early January to
request your feedback, so good timing. Not sure how much detail to go
into, but here goes:

Back in May [1], the method of value storage shifted towards "combined
pointer-value slots", which was described and recommended in the
paper. There were some other changes for simplicity and efficiency,
but none as far-reaching as this.

This is enabled by using the template architecture that we adopted
long ago for different reasons. Fixed length values are either stored
in the slot of the last-level node (if the value fits into the
platform's pointer), or are a "single-value" leaf (otherwise).

For tid store, we want to eventually support bitmap heap scans (in
addition to vacuum), and in doing so make it independent of heap AM.
That means value types similar to PageTableEntry tidbitmap.c, but with
a variable number of bitmapwords.

That required radix tree to support variable length values. That has
been the main focus in the last several months, and it basically works
now.

To my mind, the biggest architectural issues in the patch today are:

- Variable-length values means that pointers are passed around in
places. This will require some shifting responsibility for locking to
the caller, or longer-term maybe a callback interface. (This is new,
the below are pre-existing issues.)
- The tid store has its own "control object" (when shared memory is
needed) with its own lock, in addition to the same for the associated
radix tree. This leads to unnecessary double-locking. This area needs
some attention.
- Memory accounting is still unsettled. The current thinking is to cap
max block/segment size, scaled to a fraction of m_w_m, but there are
still open questions.

There has been some recent effort toward finishing work started
earlier, like shrinking nodes. There a couple places that can still
use either simplification or optimization, but otherwise work fine.
Most of the remaining fixmes/todos/wips are trivial; a few are
actually outdated now that I look again, and will be removed shortly.
The regression tests could use some tidying up.

-John

[1] https://www.postgresql.org/message-id/CAFBsxsFyWLxweHVDtKb7otOCR4XdQGYR4b%2B9svxpVFnJs08BmQ%40mail.gmail.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I found the following comment and wanted to discuss:
> >
> > // this might be better as "iterate over nodes", plus a callback to
> > RT_DUMP_NODE,
> > // which should really only concern itself with single nodes
> > RT_SCOPE void
> > RT_DUMP(RT_RADIX_TREE *tree)
> >
> > If it means we need to somehow use the iteration functions also for
> > dumping the whole tree, it would probably need to refactor the
> > iteration codes so that the RT_DUMP() can use them while dumping
> > visited nodes. But we need to be careful of not adding overheads to
> > the iteration performance.
>
> Yeah, some months ago I thought a callback interface would make some
> things easier. I don't think we need that at the moment (possibly
> never), so that comment can be just removed. As far as these debug
> functions, I only found useful the stats and dumping a single node,
> FWIW.
>
> I've attached v47, which is v46 plus some fixes for radix tree.
>
> 0004 - moves everything for "delete" to the end -- gradually other
> things will be grouped together in a sensible order
>
> 0005 - trivial

LGTM.

>
> 0006 - shrink nodes -- still needs testing, but nothing crashes yet.

Cool. The coverage test results showed the shrink codes are also covered.

> This shows some renaming might be good: Previously we had
> RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
> added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
> not copying the slot to be deleted. This means when growing it would
> be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
> reserves a new slot for the caller in the new node, but the caller
> must do the insert itself.

Agreed.

> Note that there are some practical
> restrictions/best-practices on whether shrinking should happen after
> deletion or vice versa. Hopefully it's clear, but let me know if the
> description can be improved. Also, it doesn't yet shrink from size
> class 32 to 16, but it could with a bit of work.

Sounds reasonable.

>
> 0007 - trivial, but could use a better comment. I also need to make
> sure stats reporting works (may also need some cleanup work).
>
> 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
> if DSA could just free all our allocated segments without throwing
> away the DSA, and that's still a good question.

LGTM.

>
> 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
> this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
> back to that).

Will rename it.

> The assert doesn't fire, so I guess it does what it's
> supposed to?

Yes.

> For me, the iteration logic is still the most confusing
> piece out of the whole radix tree. Maybe that could be helped with
> some better variable names, but I wonder if it needs more invasive
> work.

True. Maybe more comments would also help.

>
> 0010 - some fixes for number of children accounting in node256
>
> 0011 - Long overdue pgindent of radixtree.h, without trying to fix up
> afterwards. Feel free to throw out and redo if this interferes with
> ongoing work.
>

LGTM.

I'm working on the below review comments and most of them are already
incorporated on the local branch:

> The rest are from your v46. The bench doesn't work for tid store
> anymore, so I squashed "disable bench for CI" until we get back to
> that. Some more review comments (note: patch numbers are for v47, but
> I changed nothing from v46 in this area):
>
> 0013:
>
> + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
> + * and stored in the radix tree.
>
> Recently outdated. The variable length values seems to work, so let's
> make everything match.
>
> +#define MAX_TUPLES_PER_PAGE  MaxOffsetNumber
>
> Maybe we don't need this macro anymore? The name no longer fits, in any case.

Removed.

>
> +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
> + int num_offsets)
> +{
> + char buf[MaxBlocktableEntrySize];
> + BlocktableEntry *page = (BlocktableEntry *) buf;
>
> I'm not sure this is safe with alignment. Maybe rather than plain
> "char", it needs to be a union with BlocktableEntry, or something.

I tried it in the new patch set but could you explain why it could not
be safe with alignment?

>
> +static inline BlocktableEntry *
> +tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
> +{
> + if (TidStoreIsShared(iter->ts))
> + return shared_rt_iterate_next(iter->tree_iter.shared, key);
> +
> + return local_rt_iterate_next(iter->tree_iter.local, key);
> +}
>
> In the old encoding scheme, this function did something important, but
> now it's a useless wrapper with one caller.

Removed.

>
> + /*
> + * In the shared case, TidStoreControl and radix_tree are backed by the
> + * same DSA area and rt_memory_usage() returns the value including both.
> + * So we don't need to add the size of TidStoreControl separately.
> + */
> + if (TidStoreIsShared(ts))
> + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
> +
> + return sizeof(TidStore) + sizeof(TidStore) +
> local_rt_memory_usage(ts->tree.local);
>
> I don't see the point in including these tiny structs, since we will
> always blow past the limit by a number of kilobytes (at least, often
> megabytes or more) at the time it happens.

Agreed, removed.

>
> + iter->output.max_offset = 64;
>
> Maybe needs a comment that this is just some starting size and not
> anything particular.
>
> + iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);
>
> + /* Make sure there is enough space to add offsets */
> + if (result->num_offsets + bmw_popcount(w) > result->max_offset)
> + {
> + result->max_offset *= 2;
> + result->offsets = repalloc(result->offsets,
> +    sizeof(OffsetNumber) * result->max_offset);
> + }
>
> popcount()-ing for every array element in every value is expensive --
> let's just add sizeof(bitmapword). It's not that wasteful, but then
> the initial max will need to be 128.

Good idea.

>
> About separation of responsibilities for locking: The only thing
> currently where the tid store is not locked is tree iteration. That's
> a strange exception. Also, we've recently made RT_FIND return a
> pointer, so the caller must somehow hold a share lock, but I think we
> haven't exposed callers the ability to do that, and we rely on the tid
> store lock for that. We have a mix of tree locking and tid store
> locking. We will need to consider carefully how to make this more
> clear, maintainable, and understandable.

Yes, tidstore should be locked during the iteration.

One simple direction about locking is that the radix tree has the lock
but no APIs hold/release it. It's the caller's responsibility. If a
data structure using a radix tree for its storage has its own lock
(like tidstore), it can use it instead of the radix tree's one. A
downside would be that it's probably hard to support a better locking
algorithm such as ROWEX in the radix tree. Another variant of APIs
that also does locking/unlocking within APIs might help.

>
> 0015:
>
> "XXX: some regression test fails since this commit changes the minimum
> m_w_m to 2048 from 1024. This was necessary for the pervious memory"
>
> This shouldn't fail anymore if the "one-place" clamp was in a patch
> before this. If so, lets take out that GUC change and worry about
> min/max size separately. If it still fails, I'd like to know why.

Agreed.

>
> - *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
> - *                                               vacrel->dead_items array.
> + *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in
> the TID store.
>
> What I was getting at earlier is that the first line here doesn't
> really need to change, we can just s/array/store/ ?

Fixed.

>
> -static int
> -lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> -                                         int index, Buffer vmbuffer)
> +static void
> +lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
> +                                         OffsetNumber *deadoffsets,
> int num_offsets, Buffer buffer,
> +                                         Buffer vmbuffer)
>
> "buffer" should still come after "blkno", so that line doesn't need to change.

Fixed.

>
> $ git diff master -- src/backend/access/heap/ | grep has_lpdead_items
> - bool has_lpdead_items; /* includes existing LP_DEAD items */
> - * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
> - Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
> - if (prunestate.has_lpdead_items)
> - else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
> - if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
> - prunestate->has_lpdead_items = false;
> - prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items
> = true;
>
> In a green field, it'd be fine to replace these with an expression of
> "num_offsets", but it adds a bit of noise for reviewers and the git
> log. Is it really necessary?

I see your point. I think we can live with having both
has_lpdead_items and num_offsets. But we will have to check if these
values are consistent, which could be less maintainable.

>
> -                       deadoffsets[lpdead_items++] = offnum;
> +
> prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
>
>  I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> moved to the PruneState. The latter was renamed in a way that makes
> more sense, but I don't see why the churn is necessary.
>
> @@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel,
>         }
>  #endif
>
> -       /*
> -        * Now save details of the LP_DEAD items from the page in vacrel
> -        */
> -       if (lpdead_items > 0)
> +       if (prunestate->num_offsets > 0)
>         {
> -               VacDeadItems *dead_items = vacrel->dead_items;
> -               ItemPointerData tmp;
> -
>                 vacrel->lpdead_item_pages++;
> -               prunestate->has_lpdead_items = true;
> -
> -               ItemPointerSetBlockNumber(&tmp, blkno);
> -
> -               for (int i = 0; i < lpdead_items; i++)
> -               {
> -                       ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
> -                       dead_items->items[dead_items->num_items++] = tmp;
> -               }
> -
> -               Assert(dead_items->num_items <= dead_items->max_items);
> -               pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
> -
>   dead_items->num_items);
>
> I don't understand why this block got removed and nothing new is
> adding anything to the tid store.
>
> @@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel)
>                          * with prunestate-driven visibility map and
> FSM steps (just like
>                          * the two-pass strategy).
>                          */
> -                       Assert(dead_items->num_items == 0);
> +                       Assert(TidStoreNumTids(dead_items) == 0);
> +               }
> +               else if (prunestate.num_offsets > 0)
> +               {
> +                       /* Save details of the LP_DEAD items from the
> page in dead_items */
> +                       TidStoreSetBlockOffsets(dead_items, blkno,
> prunestate.deadoffsets,
> +
>  prunestate.num_offsets);
> +
> +
> pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
> +
>           TidStoreMemoryUsage(dead_items));
>
> I guess it was added here, 800 lines away? If so, why?

The above changes are related. The idea is not to use tidstore in a
one-pass strategy. If the table doesn't have any indexes, in
lazy_scan_prune() we collect offset numbers of dead tuples on the page
and vacuum the page using them. In this case, we don't need to use
tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
LVPagePruneState is a convenient place to store collected offset
numbers.

>
> About progress reporting: I want to make sure no one is going to miss
> counting "num_dead_tuples". It's no longer relevant for the number of
> index scans we need to do, but do admins still have a use for it?
> Something to think about later.

I'm not sure if the user will still need num_dead_tuples in progress
reporting view. The total number of dead tuples might be useful but
the verbose log already shows that.

>
> 0017
>
> + /*
> + * max_bytes is forced to be at least 64kB, the current minimum valid
> + * value for the work_mem GUC.
> + */
> + max_bytes = Max(64 * 1024L, max_bytes);
>
> If this still needs to be here, I still don't understand why.

Removed.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
> > + int num_offsets)
> > +{
> > + char buf[MaxBlocktableEntrySize];
> > + BlocktableEntry *page = (BlocktableEntry *) buf;
> >
> > I'm not sure this is safe with alignment. Maybe rather than plain
> > "char", it needs to be a union with BlocktableEntry, or something.
>
> I tried it in the new patch set but could you explain why it could not
> be safe with alignment?

I was thinking because "buf" is just an array of bytes. But, since the
next declaration is a cast to a pointer to the actual type, maybe we
can rely on the compiler to do the right thing. (It seems to on my
machine in any case)

> > About separation of responsibilities for locking: The only thing
> > currently where the tid store is not locked is tree iteration. That's
> > a strange exception. Also, we've recently made RT_FIND return a
> > pointer, so the caller must somehow hold a share lock, but I think we
> > haven't exposed callers the ability to do that, and we rely on the tid
> > store lock for that. We have a mix of tree locking and tid store
> > locking. We will need to consider carefully how to make this more
> > clear, maintainable, and understandable.
>
> Yes, tidstore should be locked during the iteration.
>
> One simple direction about locking is that the radix tree has the lock
> but no APIs hold/release it. It's the caller's responsibility. If a
> data structure using a radix tree for its storage has its own lock
> (like tidstore), it can use it instead of the radix tree's one. A

It looks like the only reason tidstore has its own lock is because it
has no way to delegate locking to the tree's lock. Instead of working
around the limitations of the thing we've designed, let's make it work
for the one use case we have. I think we need to expose RT_LOCK_*
functions to the outside, and have tid store use them. That would
allow us to simplify all those "if (TidStoreIsShared(ts)
LWLockAcquire(..., ...)" calls, which are complex and often redundant.

At some point, we'll probably want to keep locking inside, at least to
smooth the way for fine-grained locking you mentioned.

> > In a green field, it'd be fine to replace these with an expression of
> > "num_offsets", but it adds a bit of noise for reviewers and the git
> > log. Is it really necessary?
>
> I see your point. I think we can live with having both
> has_lpdead_items and num_offsets. But we will have to check if these
> values are consistent, which could be less maintainable.

It would be clearer if that removal was split out into a separate patch.

> >  I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> > moved to the PruneState. The latter was renamed in a way that makes
> > more sense, but I don't see why the churn is necessary.
...
> > I guess it was added here, 800 lines away? If so, why?
>
> The above changes are related. The idea is not to use tidstore in a
> one-pass strategy. If the table doesn't have any indexes, in
> lazy_scan_prune() we collect offset numbers of dead tuples on the page
> and vacuum the page using them. In this case, we don't need to use
> tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
> LVPagePruneState is a convenient place to store collected offset
> numbers.

Okay, that makes sense, but if it was ever explained, I don't
remember, and there is nothing in the commit message either.

I'm not sure this can be split up easily, but if so it might help reviewing.

This change also leads to a weird-looking control flow:

if (vacrel->nindexes == 0)
{
  if (prunestate.num_offsets > 0)
  {
    ...
  }
}
else if (prunestate.num_offsets > 0)
{
  ...
}



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Dec 27, 2023 at 12:08 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
> > > + int num_offsets)
> > > +{
> > > + char buf[MaxBlocktableEntrySize];
> > > + BlocktableEntry *page = (BlocktableEntry *) buf;
> > >
> > > I'm not sure this is safe with alignment. Maybe rather than plain
> > > "char", it needs to be a union with BlocktableEntry, or something.
> >
> > I tried it in the new patch set but could you explain why it could not
> > be safe with alignment?
>
> I was thinking because "buf" is just an array of bytes. But, since the
> next declaration is a cast to a pointer to the actual type, maybe we
> can rely on the compiler to do the right thing. (It seems to on my
> machine in any case)

Okay, I kept it.

>
> > > About separation of responsibilities for locking: The only thing
> > > currently where the tid store is not locked is tree iteration. That's
> > > a strange exception. Also, we've recently made RT_FIND return a
> > > pointer, so the caller must somehow hold a share lock, but I think we
> > > haven't exposed callers the ability to do that, and we rely on the tid
> > > store lock for that. We have a mix of tree locking and tid store
> > > locking. We will need to consider carefully how to make this more
> > > clear, maintainable, and understandable.
> >
> > Yes, tidstore should be locked during the iteration.
> >
> > One simple direction about locking is that the radix tree has the lock
> > but no APIs hold/release it. It's the caller's responsibility. If a
> > data structure using a radix tree for its storage has its own lock
> > (like tidstore), it can use it instead of the radix tree's one. A
>
> It looks like the only reason tidstore has its own lock is because it
> has no way to delegate locking to the tree's lock. Instead of working
> around the limitations of the thing we've designed, let's make it work
> for the one use case we have. I think we need to expose RT_LOCK_*
> functions to the outside, and have tid store use them. That would
> allow us to simplify all those "if (TidStoreIsShared(ts)
> LWLockAcquire(..., ...)" calls, which are complex and often redundant.

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

>
> At some point, we'll probably want to keep locking inside, at least to
> smooth the way for fine-grained locking you mentioned.
>
> > > In a green field, it'd be fine to replace these with an expression of
> > > "num_offsets", but it adds a bit of noise for reviewers and the git
> > > log. Is it really necessary?
> >
> > I see your point. I think we can live with having both
> > has_lpdead_items and num_offsets. But we will have to check if these
> > values are consistent, which could be less maintainable.
>
> It would be clearer if that removal was split out into a separate patch.

Agreed.

>
> > >  I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> > > moved to the PruneState. The latter was renamed in a way that makes
> > > more sense, but I don't see why the churn is necessary.
> ...
> > > I guess it was added here, 800 lines away? If so, why?
> >
> > The above changes are related. The idea is not to use tidstore in a
> > one-pass strategy. If the table doesn't have any indexes, in
> > lazy_scan_prune() we collect offset numbers of dead tuples on the page
> > and vacuum the page using them. In this case, we don't need to use
> > tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
> > LVPagePruneState is a convenient place to store collected offset
> > numbers.
>
> Okay, that makes sense, but if it was ever explained, I don't
> remember, and there is nothing in the commit message either.
>
> I'm not sure this can be split up easily, but if so it might help reviewing.

Agreed.

>
> This change also leads to a weird-looking control flow:
>
> if (vacrel->nindexes == 0)
> {
>   if (prunestate.num_offsets > 0)
>   {
>     ...
>   }
> }
> else if (prunestate.num_offsets > 0)
> {
>   ...
> }

Fixed.

I've attached a new patch set. From v47 patch, I've merged your
changes for radix tree, and split the vacuum integration patch into 3
patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
use a simple TID array for one-pass strategy (0008 patch), and replace
has_lpdead_items with "num_offsets > 0" (0009 patch), while
incorporating your review comments on the vacuum integration patch
(sorry for making it difficult to see the changes from v47 patch).
0013 to 0015 patches are also updates from v47 patch.

I'm thinking that we should change the order of the patches so that
tidstore patch requires the patch for changing DSA segment sizes. That
way, we can remove the complex max memory calculation part that we no
longer use from the tidstore patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I agree that we expose RT_LOCK_* functions and have tidstore use them,
> but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> calls part. I think that even if we expose them, we will still need to
> do something like "if (TidStoreIsShared(ts))
> shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

> I've attached a new patch set. From v47 patch, I've merged your
> changes for radix tree, and split the vacuum integration patch into 3
> patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
> use a simple TID array for one-pass strategy (0008 patch), and replace
> has_lpdead_items with "num_offsets > 0" (0009 patch), while
> incorporating your review comments on the vacuum integration patch

Nice!

> (sorry for making it difficult to see the changes from v47 patch).

It's actually pretty clear. I just have a couple comments before
sharing my latest cleanups:

(diff'ing between v47 and v48):

--       /*
-        * In the shared case, TidStoreControl and radix_tree are backed by the
-        * same DSA area and rt_memory_usage() returns the value including both.
-        * So we don't need to add the size of TidStoreControl separately.
-        */
        if (TidStoreIsShared(ts))
-               return sizeof(TidStore) +
shared_rt_memory_usage(ts->tree.shared);
+               rt_mem = shared_rt_memory_usage(ts->tree.shared);
+       else
+               rt_mem = local_rt_memory_usage(ts->tree.local);

-       return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);
+       return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;

Upthread, I meant that I don't see the need to include the size of
these structs *at all*. They're tiny, and the blocks/segments will
almost certainly have some empty space counted in the total anyway.
The returned size is already overestimated, so this extra code is just
a distraction.

- if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)
>= result->max_offset)

I believe this math is wrong. We care about "result->num_offsets +
BITS_PER_BITMAPWORD", right?
Also, it seems if the condition evaluates to equal, we still have
enough space, in which case ">" the max is the right condition.

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxOffsetNumber)

This can now use OffsetNumberIsValid().

> 0013 to 0015 patches are also updates from v47 patch.

> I'm thinking that we should change the order of the patches so that
> tidstore patch requires the patch for changing DSA segment sizes. That
> way, we can remove the complex max memory calculation part that we no
> longer use from the tidstore patch.

I don't think there is any reason to have those calculations at all at
this point. Every patch in every version should at least *work
correctly*, without kludging m_w_m and without constraining max
segment size. I'm fine with the latter remaining in its own thread,
and I hope we can consider it an enhancement that respects the admin's
configured limits more effectively, and not a pre-requisite for not
breaking. I *think* we're there now, but it's hard to tell since 0015
was at the very end. As I said recently, if something still fails, I'd
like to know why. So for v49, I took the liberty of removing the DSA
max segment patches for now, and squashing v48-0015.

In addition for v49, I have quite a few cleanups:

0001 - This hasn't been touched in a very long time, but I ran
pgindent and clarified a comment
0002 - We no longer need to isolate the rightmost bit anywhere, so
removed that part and revised the commit message accordingly.

radix tree:
0003 - v48 plus squashed v48-0013
0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
and I fixed most of the rest.
0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
bit simpler.
0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
0008 - Removes a holdover from the multi-value leaves era.
0009 - It occurred to me that we need to have unique names for memory
contexts for different instantiations of the template. This is one way
to do it, by using the configured RT_PREFIX in the context name. I
also took an extra step to make the size class fanout show up
correctly on different platforms, but that's probably overkill and
undesirable, and I'll probably use only the class name next time.
0010/11 - Make the array functions less surprising and with more
informative names.
0012 - Restore a useful technique from Andres's prototype. This part
has been slow for a long time, so much that it showed up in a profile
where this path wasn't even taken much.

tid store / vacuum:
0013/14 - Same as v48 TID store, with review squashed
0015 - Rationalize comment and starting value.
0016 - I applied the removal of the old clamps from v48-0011 (init/max
DSA), and left out the rest for now.
0017-20 - Vacuum and debug tidstore as in v48, with v48-0015 squashed

I'll bring up locking again shortly.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > calls part. I think that even if we expose them, we will still need to
> > do something like "if (TidStoreIsShared(ts))
> > shared_rt_lock_share(ts->tree.shared)", no?
>
> I'll come back to this topic separately.

To answer your question, sure, but that "if (TidStoreIsShared(ts))"
part would be pushed down into a function so that only one place has
to care about it.

However, I'm starting to question whether we even need that. Meaning,
lock the tidstore separately. To "lock the tidstore" means to take a
lock, _separate_ from the radix tree's internal lock, to control
access to two fields in a separate "control object":

+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* the maximum bytes a TidStore can use */
+ size_t max_bytes;

I'm pretty sure max_bytes does not need to be in shared memory, and
certainly not under a lock: Thinking of a hypothetical
parallel-prune-phase scenario, one way would be for a leader process
to pass out ranges of blocks to workers, and when the limit is
exceeded, stop passing out blocks and wait for all the workers to
finish.

As for num_tids, vacuum previously put the similar count in

@@ -176,7 +179,8 @@ struct ParallelVacuumState
  PVIndStats *indstats;

  /* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;

VacDeadItems contained "num_items". What was the reason to have new
infrastructure for that count? And it doesn't seem like access to it
was controlled by a lock -- can you confirm? If we did get parallel
pruning, maybe the count would belong inside PVShared?

The number of tids is not that tightly bound to the tidstore's job. I
believe tidbitmap.c (a possible future client) doesn't care about the
global number of tids -- not only that, but AND/OR operations can
change the number in a non-obvious way, so it would not be convenient
to keep an accurate number anyway. But the lock would still be
mandatory with this patch.

If we can make vacuum work a bit closer to how it does now, it'd be a
big step up in readability, I think. Namely, getting rid of all the
locking logic inside tidstore.c and let the radix tree's locking do
the right thing. We'd need to make that work correctly when receiving
pointers to values upon lookup, and I already shared ideas for that.
But I want to see if there is any obstacle in the way of removing the
tidstore control object and it's separate lock.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jan 3, 2024 at 11:10 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > calls part. I think that even if we expose them, we will still need to
> > do something like "if (TidStoreIsShared(ts))
> > shared_rt_lock_share(ts->tree.shared)", no?
>
> I'll come back to this topic separately.
>
> > I've attached a new patch set. From v47 patch, I've merged your
> > changes for radix tree, and split the vacuum integration patch into 3
> > patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
> > use a simple TID array for one-pass strategy (0008 patch), and replace
> > has_lpdead_items with "num_offsets > 0" (0009 patch), while
> > incorporating your review comments on the vacuum integration patch
>
> Nice!
>
> > (sorry for making it difficult to see the changes from v47 patch).
>
> It's actually pretty clear. I just have a couple comments before
> sharing my latest cleanups:
>
> (diff'ing between v47 and v48):
>
> --       /*
> -        * In the shared case, TidStoreControl and radix_tree are backed by the
> -        * same DSA area and rt_memory_usage() returns the value including both.
> -        * So we don't need to add the size of TidStoreControl separately.
> -        */
>         if (TidStoreIsShared(ts))
> -               return sizeof(TidStore) +
> shared_rt_memory_usage(ts->tree.shared);
> +               rt_mem = shared_rt_memory_usage(ts->tree.shared);
> +       else
> +               rt_mem = local_rt_memory_usage(ts->tree.local);
>
> -       return sizeof(TidStore) + sizeof(TidStore) +
> local_rt_memory_usage(ts->tree.local);
> +       return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;
>
> Upthread, I meant that I don't see the need to include the size of
> these structs *at all*. They're tiny, and the blocks/segments will
> almost certainly have some empty space counted in the total anyway.
> The returned size is already overestimated, so this extra code is just
> a distraction.

Agreed.

>
> - if (result->num_offsets + bmw_popcount(w) > result->max_offset)
> + if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)
> >= result->max_offset)
>
> I believe this math is wrong. We care about "result->num_offsets +
> BITS_PER_BITMAPWORD", right?
> Also, it seems if the condition evaluates to equal, we still have
> enough space, in which case ">" the max is the right condition.

Oops, you're right. Fixed.

>
> - if (off < 1 || off > MAX_TUPLES_PER_PAGE)
> + if (off < 1 || off > MaxOffsetNumber)
>
> This can now use OffsetNumberIsValid().

Fixed.

>
> > 0013 to 0015 patches are also updates from v47 patch.
>
> > I'm thinking that we should change the order of the patches so that
> > tidstore patch requires the patch for changing DSA segment sizes. That
> > way, we can remove the complex max memory calculation part that we no
> > longer use from the tidstore patch.
>
> I don't think there is any reason to have those calculations at all at
> this point. Every patch in every version should at least *work
> correctly*, without kludging m_w_m and without constraining max
> segment size. I'm fine with the latter remaining in its own thread,
> and I hope we can consider it an enhancement that respects the admin's
> configured limits more effectively, and not a pre-requisite for not
> breaking. I *think* we're there now, but it's hard to tell since 0015
> was at the very end. As I said recently, if something still fails, I'd
> like to know why. So for v49, I took the liberty of removing the DSA
> max segment patches for now, and squashing v48-0015.

Fair enough.

>
> In addition for v49, I have quite a few cleanups:
>
> 0001 - This hasn't been touched in a very long time, but I ran
> pgindent and clarified a comment
> 0002 - We no longer need to isolate the rightmost bit anywhere, so
> removed that part and revised the commit message accordingly.

Thanks.

>
> radix tree:
> 0003 - v48 plus squashed v48-0013
> 0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
> and I fixed most of the rest.
> 0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
> 0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
> bit simpler.
> 0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
> 0008 - Removes a holdover from the multi-value leaves era.
> 0009 - It occurred to me that we need to have unique names for memory
> contexts for different instantiations of the template. This is one way
> to do it, by using the configured RT_PREFIX in the context name. I
> also took an extra step to make the size class fanout show up
> correctly on different platforms, but that's probably overkill and
> undesirable, and I'll probably use only the class name next time.
> 0010/11 - Make the array functions less surprising and with more
> informative names.
> 0012 - Restore a useful technique from Andres's prototype. This part
> has been slow for a long time, so much that it showed up in a profile
> where this path wasn't even taken much.

These changes look good to me. I've squashed them.

In addition, I've made some changes and cleanups:

0010 - address the above review comments.
0011 - simplify the radix tree iteration code. I hope it makes the
code clear and readable. Also I removed RT_UPDATE_ITER_STACK().
0012 - fix a typo
0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
even if SIZEOF_VOID_P is 8 bytes, for example in a case where
!defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
details.
0014 - cleanup RT_VERIFY code.
0015 - change and cleanup RT_DUMP_NODE(). Now it dumps only one node
and no longer supports dumping nodes recursively.
0016 - remove RT_DUMP_SEARCH() and RT_DUMP(). These seem no longer necessary.
0017 - MOve RT_DUMP_NODE to the debug function section, close to RT_STATS.
0018 - Fix a printf format in RT_STATS().

BTW, now that the inner and leaf nodes use the same structure, do we
still need RT_NODE_BASE_XXX types? Most places where we use
RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.
Exceptions are RT_FANOUT_XX calculations:

#if SIZEOF_VOID_P < 8
#define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48    ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#else
#define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48    ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#endif                          /* SIZEOF_VOID_P < 8 */

But I think we can replace them with offsetof(RT_NODE_16, children) etc.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> In addition, I've made some changes and cleanups:

These look good to me, although I have not tried dumping a node in a while.

> 0011 - simplify the radix tree iteration code. I hope it makes the
> code clear and readable. Also I removed RT_UPDATE_ITER_STACK().

I'm very pleased with how much simpler it is now!

> 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
> RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
> DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
> even if SIZEOF_VOID_P is 8 bytes, for example in a case where
> !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
> details.

Thanks for the pointer. ;-)

> BTW, now that the inner and leaf nodes use the same structure, do we
> still need RT_NODE_BASE_XXX types? Most places where we use
> RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.

That's been in the back of my mind as well. Maybe the common header
should be the new "base" member? At least, something other than "n".

> Exceptions are RT_FANOUT_XX calculations:
>
> #if SIZEOF_VOID_P < 8
> #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
> #define RT_FANOUT_48    ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
> #else
> #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
> #define RT_FANOUT_48    ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
> #endif                          /* SIZEOF_VOID_P < 8 */
>
> But I think we can replace them with offsetof(RT_NODE_16, children) etc.

That makes sense. Do you want to have a go at it, or shall I?

I think after that, the only big cleanup needed is putting things in a
more readable order. I can do that at a later date, and other
opportunities for beautification are pretty minor and localized.

Rationalizing locking is the only thing left that requires a bit of thought.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jan 9, 2024 at 8:19 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > In addition, I've made some changes and cleanups:
>
> These look good to me, although I have not tried dumping a node in a while.
>
> > 0011 - simplify the radix tree iteration code. I hope it makes the
> > code clear and readable. Also I removed RT_UPDATE_ITER_STACK().
>
> I'm very pleased with how much simpler it is now!
>
> > 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
> > RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
> > DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
> > even if SIZEOF_VOID_P is 8 bytes, for example in a case where
> > !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
> > details.
>
> Thanks for the pointer. ;-)
>
> > BTW, now that the inner and leaf nodes use the same structure, do we
> > still need RT_NODE_BASE_XXX types? Most places where we use
> > RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.
>
> That's been in the back of my mind as well. Maybe the common header
> should be the new "base" member? At least, something other than "n".

Agreed.

>
> > Exceptions are RT_FANOUT_XX calculations:
> >
> > #if SIZEOF_VOID_P < 8
> > #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
> > #define RT_FANOUT_48    ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
> > #else
> > #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
> > #define RT_FANOUT_48    ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
> > #endif                          /* SIZEOF_VOID_P < 8 */
> >
> > But I think we can replace them with offsetof(RT_NODE_16, children) etc.
>
> That makes sense. Do you want to have a go at it, or shall I?

I've done in 0010 patch in v51 patch set.  Whereas RT_NODE_4 and
RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and
RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need
RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations
are now spread before and after RT_NODE_XXX struct declaration. It's a
bit less readable, but I'm not sure of a better way.

The previous updates are merged into the main radix tree patch and
tidstore patch. Nothing changes in other patches from v50.

>
> I think after that, the only big cleanup needed is putting things in a
> more readable order. I can do that at a later date, and other
> opportunities for beautification are pretty minor and localized.

Agreed.

>
> Rationalizing locking is the only thing left that requires a bit of thought.

Right, I'll send a reply soon.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Jan 10, 2024 at 9:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've done in 0010 patch in v51 patch set.  Whereas RT_NODE_4 and
> RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and
> RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need
> RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations
> are now spread before and after RT_NODE_XXX struct declaration. It's a
> bit less readable, but I'm not sure of a better way.

They were before and after the *_BASE types, so it's not really worse,
I think. I did notice that RT_SLOT_IDX_LIMIT has been considered
special for a very long time, before we even had size classes, so it's
the same thing but even more far away. I have an idea to introduce
*_MAX macros, allowing to turn RT_SLOT_IDX_LIMIT into
RT_FANOUT_48_MAX, so that everything is in the same spot, and to make
this area more consistent. I also noticed that I'd been assuming that
RT_FANOUT_16_HI fits easily into a DSA size class, but that's only
true on 64-bit, and in any case we don't want to assume it. I've
attached an addendum .txt to demo this idea.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 8, 2024 at 8:35 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > > calls part. I think that even if we expose them, we will still need to
> > > do something like "if (TidStoreIsShared(ts))
> > > shared_rt_lock_share(ts->tree.shared)", no?
> >
> > I'll come back to this topic separately.
>
> To answer your question, sure, but that "if (TidStoreIsShared(ts))"
> part would be pushed down into a function so that only one place has
> to care about it.
>
> However, I'm starting to question whether we even need that. Meaning,
> lock the tidstore separately. To "lock the tidstore" means to take a
> lock, _separate_ from the radix tree's internal lock, to control
> access to two fields in a separate "control object":
>
> +typedef struct TidStoreControl
> +{
> + /* the number of tids in the store */
> + int64 num_tids;
> +
> + /* the maximum bytes a TidStore can use */
> + size_t max_bytes;
>
> I'm pretty sure max_bytes does not need to be in shared memory, and
> certainly not under a lock: Thinking of a hypothetical
> parallel-prune-phase scenario, one way would be for a leader process
> to pass out ranges of blocks to workers, and when the limit is
> exceeded, stop passing out blocks and wait for all the workers to
> finish.

True. I agreed that it doesn't need to be under a lock anyway, as it's
read-only.

>
> As for num_tids, vacuum previously put the similar count in
>
> @@ -176,7 +179,8 @@ struct ParallelVacuumState
>   PVIndStats *indstats;
>
>   /* Shared dead items space among parallel vacuum workers */
> - VacDeadItems *dead_items;
> + TidStore *dead_items;
>
> VacDeadItems contained "num_items". What was the reason to have new
> infrastructure for that count? And it doesn't seem like access to it
> was controlled by a lock -- can you confirm? If we did get parallel
> pruning, maybe the count would belong inside PVShared?

I thought that since the tidstore is a general-purpose data structure
the shared counter should be protected by a lock. One thing I'm
concerned about is that we might need to update both the radix tree
and the counter atomically in some cases. But that's true we don't
need it for lazy vacuum at least for now. Even given the parallel scan
phase, probably we won't need to have workers check the total number
of stored tuples during a parallel scan.

>
> The number of tids is not that tightly bound to the tidstore's job. I
> believe tidbitmap.c (a possible future client) doesn't care about the
> global number of tids -- not only that, but AND/OR operations can
> change the number in a non-obvious way, so it would not be convenient
> to keep an accurate number anyway. But the lock would still be
> mandatory with this patch.

Very good point.

>
> If we can make vacuum work a bit closer to how it does now, it'd be a
> big step up in readability, I think. Namely, getting rid of all the
> locking logic inside tidstore.c and let the radix tree's locking do
> the right thing. We'd need to make that work correctly when receiving
> pointers to values upon lookup, and I already shared ideas for that.
> But I want to see if there is any obstacle in the way of removing the
> tidstore control object and it's separate lock.

So I agree to remove both max_bytes and num_items from the control
object.Also, as you mentioned, we can remove the tidstore control
object itself. TidStoreGetHandle() returns a radix tree handle, and we
can pass it to TidStoreAttach().  I'll try it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 8, 2024 at 8:35 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > > > calls part. I think that even if we expose them, we will still need to
> > > > do something like "if (TidStoreIsShared(ts))
> > > > shared_rt_lock_share(ts->tree.shared)", no?
> > >
> > > I'll come back to this topic separately.
> >
> > To answer your question, sure, but that "if (TidStoreIsShared(ts))"
> > part would be pushed down into a function so that only one place has
> > to care about it.
> >
> > However, I'm starting to question whether we even need that. Meaning,
> > lock the tidstore separately. To "lock the tidstore" means to take a
> > lock, _separate_ from the radix tree's internal lock, to control
> > access to two fields in a separate "control object":
> >
> > +typedef struct TidStoreControl
> > +{
> > + /* the number of tids in the store */
> > + int64 num_tids;
> > +
> > + /* the maximum bytes a TidStore can use */
> > + size_t max_bytes;
> >
> > I'm pretty sure max_bytes does not need to be in shared memory, and
> > certainly not under a lock: Thinking of a hypothetical
> > parallel-prune-phase scenario, one way would be for a leader process
> > to pass out ranges of blocks to workers, and when the limit is
> > exceeded, stop passing out blocks and wait for all the workers to
> > finish.
>
> True. I agreed that it doesn't need to be under a lock anyway, as it's
> read-only.
>
> >
> > As for num_tids, vacuum previously put the similar count in
> >
> > @@ -176,7 +179,8 @@ struct ParallelVacuumState
> >   PVIndStats *indstats;
> >
> >   /* Shared dead items space among parallel vacuum workers */
> > - VacDeadItems *dead_items;
> > + TidStore *dead_items;
> >
> > VacDeadItems contained "num_items". What was the reason to have new
> > infrastructure for that count? And it doesn't seem like access to it
> > was controlled by a lock -- can you confirm? If we did get parallel
> > pruning, maybe the count would belong inside PVShared?
>
> I thought that since the tidstore is a general-purpose data structure
> the shared counter should be protected by a lock. One thing I'm
> concerned about is that we might need to update both the radix tree
> and the counter atomically in some cases. But that's true we don't
> need it for lazy vacuum at least for now. Even given the parallel scan
> phase, probably we won't need to have workers check the total number
> of stored tuples during a parallel scan.
>
> >
> > The number of tids is not that tightly bound to the tidstore's job. I
> > believe tidbitmap.c (a possible future client) doesn't care about the
> > global number of tids -- not only that, but AND/OR operations can
> > change the number in a non-obvious way, so it would not be convenient
> > to keep an accurate number anyway. But the lock would still be
> > mandatory with this patch.
>
> Very good point.
>
> >
> > If we can make vacuum work a bit closer to how it does now, it'd be a
> > big step up in readability, I think. Namely, getting rid of all the
> > locking logic inside tidstore.c and let the radix tree's locking do
> > the right thing. We'd need to make that work correctly when receiving
> > pointers to values upon lookup, and I already shared ideas for that.
> > But I want to see if there is any obstacle in the way of removing the
> > tidstore control object and it's separate lock.
>
> So I agree to remove both max_bytes and num_items from the control
> object.Also, as you mentioned, we can remove the tidstore control
> object itself. TidStoreGetHandle() returns a radix tree handle, and we
> can pass it to TidStoreAttach().  I'll try it.
>

I realized that if we remove the whole tidstore control object
including max_bytes, processes who attached the shared tidstore cannot
use TidStoreIsFull() actually as it always returns true. Also they
cannot use TidStoreReset() as well since it needs to pass max_bytes to
RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
could be problematic for general use. If we remove it, we probably
need a safeguard to prevent those who attached the tidstore from
calling these functions. Or we can keep the control object but remove
the lock and num_tids.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > So I agree to remove both max_bytes and num_items from the control
> > object.Also, as you mentioned, we can remove the tidstore control
> > object itself. TidStoreGetHandle() returns a radix tree handle, and we
> > can pass it to TidStoreAttach().  I'll try it.

Thanks. It's worth looking closely here.

> I realized that if we remove the whole tidstore control object
> including max_bytes, processes who attached the shared tidstore cannot
> use TidStoreIsFull() actually as it always returns true.

I imagine that we'd replace that with a function (maybe an earlier
version had it?) to report the memory usage to the caller, which
should know where to find max_bytes.

> Also they
> cannot use TidStoreReset() as well since it needs to pass max_bytes to
> RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
> could be problematic for general use.

HEAD has no problem finding the necessary values, and I don't think
it'd be difficult to maintain that ability. I'm not actually sure what
"general use" needs to have, and I'm not sure anyone can guess.
There's the future possibility of parallel heap-scanning, but I'm
guessing a *lot* more needs to happen for that to work, so I'm not
sure how much it buys us to immediately start putting those two fields
in a special abstraction. The only other concrete use case mentioned
in this thread that I remember is bitmap heap scan, and I believe that
would never need to reset, only free the whole thing when finished.

I spent some more time studying parallel vacuum, and have some
thoughts. In HEAD, we have

-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;

...which has the tids, plus two fields that function _very similarly_
to the two extra fields in the tidstore control object. It's a bit
strange to me that the patch doesn't have this struct anymore.

I suspect if we keep it around (just change "items" to be the local
tidstore struct), the patch would have a bit less churn and look/work
more like the current code. I think it might be easier to read if the
v17 commits are suited to the current needs of vacuum, rather than try
to anticipate all uses. Richer abstractions can come later if needed.
Another stanza:

- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-    est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;

With s/max_items/max_bytes/, I wonder if we can still use some of
this, and parallel workers would have no problem getting the necessary
info, as they do today. If not, I don't really understand why. I'm not
very familiar with working with shared memory, and I know the tree
itself needs some different setup, so it's quite possible I'm missing
something.

I find it difficult to kept straight these four things:

- radix tree
- radix tree control object
- tidstore
- tidstore control object

Even with the code in front of me, it's hard to reason about how these
concepts fit together. It'd be much more readable if this was
simplified.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sun, Jan 14, 2024 at 10:43 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > So I agree to remove both max_bytes and num_items from the control
> > > object.Also, as you mentioned, we can remove the tidstore control
> > > object itself. TidStoreGetHandle() returns a radix tree handle, and we
> > > can pass it to TidStoreAttach().  I'll try it.
>
> Thanks. It's worth looking closely here.
>
> > I realized that if we remove the whole tidstore control object
> > including max_bytes, processes who attached the shared tidstore cannot
> > use TidStoreIsFull() actually as it always returns true.
>
> I imagine that we'd replace that with a function (maybe an earlier
> version had it?) to report the memory usage to the caller, which
> should know where to find max_bytes.
>
> > Also they
> > cannot use TidStoreReset() as well since it needs to pass max_bytes to
> > RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
> > could be problematic for general use.
>
> HEAD has no problem finding the necessary values, and I don't think
> it'd be difficult to maintain that ability. I'm not actually sure what
> "general use" needs to have, and I'm not sure anyone can guess.
> There's the future possibility of parallel heap-scanning, but I'm
> guessing a *lot* more needs to happen for that to work, so I'm not
> sure how much it buys us to immediately start putting those two fields
> in a special abstraction. The only other concrete use case mentioned
> in this thread that I remember is bitmap heap scan, and I believe that
> would never need to reset, only free the whole thing when finished.
>
> I spent some more time studying parallel vacuum, and have some
> thoughts. In HEAD, we have
>
> -/*
> - * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
> - */
> -typedef struct VacDeadItems
> -{
> - int max_items; /* # slots allocated in array */
> - int num_items; /* current # of entries */
> -
> - /* Sorted array of TIDs to delete from indexes */
> - ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
> -} VacDeadItems;
>
> ...which has the tids, plus two fields that function _very similarly_
> to the two extra fields in the tidstore control object. It's a bit
> strange to me that the patch doesn't have this struct anymore.
>
> I suspect if we keep it around (just change "items" to be the local
> tidstore struct), the patch would have a bit less churn and look/work
> more like the current code. I think it might be easier to read if the
> v17 commits are suited to the current needs of vacuum, rather than try
> to anticipate all uses. Richer abstractions can come later if needed.

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore. This
is a reason why I abstract them behind TidStore. However, IIUC the
current parallel vacuum can work with such VacDeadItems fields,
fortunately. The leader process can use VacDeadItems allocated on DSM,
and worker processes can use a local VacDeadItems of which max_bytes
and num_items are copied from the shared one and "items" is a local
pointer.

Assuming parallel heap scan requires for both the leader and workers
to update the shared VacDeadItems concurrently, we may need such
richer abstractions.

I've implemented this idea in the v52 patch set. Here is the summary
of the updates:

0008: Remove the control object from tidstore. Also removed some
unsupported functions such as TidStoreNumTids()
0009: Adjust lazy vacuum integration patch with the control object removal.

I've not updated any locking code yet. Once we confirm this direction,
I'll update the locking code too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Just changing "items" to be the local tidstore struct could make the
> code tricky a bit, since max_bytes and num_items are on the shared
> memory while "items" is a local pointer to the shared tidstore.

Thanks for trying it this way! I like the overall simplification but
this aspect is not great.
Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

That may be a good idea for other reasons. It's awkward that the
create function is declared like this:

#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
dsa_area *dsa,
int tranche_id);
#else
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
#endif

An init function wouldn't need these parameters: it could look at the
passed struct to know what to do.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Just changing "items" to be the local tidstore struct could make the
> > code tricky a bit, since max_bytes and num_items are on the shared
> > memory while "items" is a local pointer to the shared tidstore.
>
> Thanks for trying it this way! I like the overall simplification but
> this aspect is not great.
> Hmm, I wonder if that's a side-effect of the "create" functions doing
> their own allocations and returning a pointer. Would it be less tricky
> if the structs were declared where we need them and passed to "init"
> functions?

Seems worth trying. The current RT_CREATE() API is also convenient as
other data structure such as simplehash.h and dshash.c supports a
similar

>
> That may be a good idea for other reasons. It's awkward that the
> create function is declared like this:
>
> #ifdef RT_SHMEM
> RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> dsa_area *dsa,
> int tranche_id);
> #else
> RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> #endif
>
> An init function wouldn't need these parameters: it could look at the
> passed struct to know what to do.

But the init function would initialize leaf_ctx etc,no? Initializing
leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. The same
is true for dsa. I imagined that an init function would allocate a DSA
memory for the control object. So I imagine we will end up still
requiring some of them.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Just changing "items" to be the local tidstore struct could make the
> > > code tricky a bit, since max_bytes and num_items are on the shared
> > > memory while "items" is a local pointer to the shared tidstore.
> >
> > Thanks for trying it this way! I like the overall simplification but
> > this aspect is not great.
> > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > their own allocations and returning a pointer. Would it be less tricky
> > if the structs were declared where we need them and passed to "init"
> > functions?
>
> Seems worth trying. The current RT_CREATE() API is also convenient as
> other data structure such as simplehash.h and dshash.c supports a
> similar

I don't happen to know if these paths had to solve similar trickiness
with some values being local, and some shared.

> > That may be a good idea for other reasons. It's awkward that the
> > create function is declared like this:
> >
> > #ifdef RT_SHMEM
> > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> > dsa_area *dsa,
> > int tranche_id);
> > #else
> > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> > #endif
> >
> > An init function wouldn't need these parameters: it could look at the
> > passed struct to know what to do.
>
> But the init function would initialize leaf_ctx etc,no? Initializing
> leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.

I was more referring to the parameters that were different above
depending on shared memory. My first thought was that the tricky part
is because of the allocation in local memory, but it's certainly
possible I've misunderstood the problem.

> The same
> is true for dsa. I imagined that an init function would allocate a DSA
> memory for the control object.

Yes:

...
//  embedded in VacDeadItems
  TidStore items;
};

// NULL DSA in local case, etc
dead_items->items.area = dead_items_dsa;
dead_items->items.tranche_id = FOO_ID;

TidStoreInit(&dead_items->items, vac_work_mem);

That's how I imagined it would work (leaving out some details). I
haven't tried it, so not sure how much it helps. Maybe it has other
problems, but I'm hoping it's just a matter of programming.

If we can't make this work nicely, I'd be okay with keeping the tid
store control object. My biggest concern is unnecessary
double-locking.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I wrote:

> > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > their own allocations and returning a pointer. Would it be less tricky
> > if the structs were declared where we need them and passed to "init"
> > functions?

If this is a possibility, I thought I'd first send the last (I hope)
large-ish set of radix tree cleanups to avoid rebasing issues. I'm not
including tidstore/vacuum here, because recent discussion has some
up-in-the-air work.

Should be self-explanatory, but some thing are worth calling out:
0012 and 0013: Some time ago I started passing insertpos as a
parameter, but now see that is not ideal -- when growing from node16
to node48 we don't need it at all, so it's a wasted calculation. While
reverting that, I found that this also allows passing constants in
some cases.
0014 makes a cleaner separation between adding a child and growing a
node, resulting in more compact-looking functions.
0019 is a bit unpolished, but I realized that it's pointless to assign
a zero child when further up the call stack we overwrite it anyway
with the actual value. With this, that assignment is skipped. This
makes some comments and names strange, so needs a bit of polish, but
wanted to get it out there anyway.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jan 17, 2024 at 11:37 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > Just changing "items" to be the local tidstore struct could make the
> > > > code tricky a bit, since max_bytes and num_items are on the shared
> > > > memory while "items" is a local pointer to the shared tidstore.
> > >
> > > Thanks for trying it this way! I like the overall simplification but
> > > this aspect is not great.
> > > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > > their own allocations and returning a pointer. Would it be less tricky
> > > if the structs were declared where we need them and passed to "init"
> > > functions?
> >
> > Seems worth trying. The current RT_CREATE() API is also convenient as
> > other data structure such as simplehash.h and dshash.c supports a
> > similar
>
> I don't happen to know if these paths had to solve similar trickiness
> with some values being local, and some shared.
>
> > > That may be a good idea for other reasons. It's awkward that the
> > > create function is declared like this:
> > >
> > > #ifdef RT_SHMEM
> > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> > > dsa_area *dsa,
> > > int tranche_id);
> > > #else
> > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> > > #endif
> > >
> > > An init function wouldn't need these parameters: it could look at the
> > > passed struct to know what to do.
> >
> > But the init function would initialize leaf_ctx etc,no? Initializing
> > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.
>
> I was more referring to the parameters that were different above
> depending on shared memory. My first thought was that the tricky part
> is because of the allocation in local memory, but it's certainly
> possible I've misunderstood the problem.
>
> > The same
> > is true for dsa. I imagined that an init function would allocate a DSA
> > memory for the control object.
>
> Yes:
>
> ...
> //  embedded in VacDeadItems
>   TidStore items;
> };
>
> // NULL DSA in local case, etc
> dead_items->items.area = dead_items_dsa;
> dead_items->items.tranche_id = FOO_ID;
>
> TidStoreInit(&dead_items->items, vac_work_mem);
>
> That's how I imagined it would work (leaving out some details). I
> haven't tried it, so not sure how much it helps. Maybe it has other
> problems, but I'm hoping it's just a matter of programming.

It seems we cannot make this work nicely. IIUC VacDeadItems is
allocated in DSM and TidStore is embedded there. However,
dead_items->items.area is a local pointer to dsa_area. So we cannot
include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
would need to pass dsa_area to each interface by callers.

>
> If we can't make this work nicely, I'd be okay with keeping the tid
> store control object. My biggest concern is unnecessary
> double-locking.

If we don't do any locking stuff in radix tree APIs and it's the
user's responsibility at all, probably we don't need a lock for
tidstore? That is, we expose lock functions as you mentioned and the
user (like tidstore) acquires/releases the lock before/after accessing
the radix tree and num_items. Currently (as of v52 patch) RT_FIND is
doing so, but we would need to change RT_SET() and iteration functions
as well.

During trying this idea, I realized that there is a visibility problem
in the radix tree template especially if we want to embed the radix
tree in a struct. Considering a use case where we want to use a radix
tree in an exposed struct, we would declare only interfaces in a .h
file and define actual implementation in a .c file (FYI
TupleHashTableData does a similar thing with simplehash.h). The .c
file and .h file would be like:

in .h file:
#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DECLARE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

typedef struct TidStore
{
:
    local_rt_radix_tree tree; /* embedded */
:
} TidStore;

in .c file:

#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DEFINE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

But it doesn't work as the compiler doesn't know the actual definition
of local_rt_radix_tree. If the 'tree' is *local_rt_radix_tree, it
works. The reason is that with RT_DECLARE but without RT_DEFINE, the
radix tree template generates only forward declarations:

#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

In order to make it work, we need to move the definitions required to
expose RT_RADIX_TREE struct to RT_DECLARE part, which actually
requires to move RT_NODE, RT_HANDLE, RT_NODE_PTR, RT_SIZE_CLASS_COUNT,
and RT_RADIX_TREE_CONTROL etc. However RT_SIZE_CLASS_COUNT, used in
RT_RADIX_TREE, could be bothersome. Since it refers to
RT_SIZE_CLASS_INFO that further refers to many #defines and structs,
we might end up moving many structs such as RT_NODE_4 etc to
RT_DECLARE part as well. Or we can use a fixed number is stead of
"lengthof(RT_SIZE_CLASS_INFO)". Apart from that, macros requried by
both RT_DECLARE and RT_DEFINE such as RT_PAN and RT_MAX_LEVEL also
needs to be moved to a common place where they are defined in both
cases.

Given these facts, I think that the current abstraction works nicely
and it would make sense not to support embedding the radix tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> It seems we cannot make this work nicely. IIUC VacDeadItems is
> allocated in DSM and TidStore is embedded there. However,
> dead_items->items.area is a local pointer to dsa_area. So we cannot
> include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> would need to pass dsa_area to each interface by callers.

Thanks again for exploring this line of thinking! Okay, it seems even
if there's a way to make this work, it would be too invasive to
justify when compared with the advantage I was hoping for.

> > If we can't make this work nicely, I'd be okay with keeping the tid
> > store control object. My biggest concern is unnecessary
> > double-locking.
>
> If we don't do any locking stuff in radix tree APIs and it's the
> user's responsibility at all, probably we don't need a lock for
> tidstore? That is, we expose lock functions as you mentioned and the
> user (like tidstore) acquires/releases the lock before/after accessing
> the radix tree and num_items.

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

There are a number of places where we assert "the running count of the
dead items" is the same as "the length of the dead items array", like
here:

@@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
  BlockNumber threshold;

  Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));

As such, in HEAD I'm guessing it's arbitrary which one is used for
control flow. Correct me if I'm mistaken. If I am wrong for some part
of the code, it'd be good to understand when that invariant can't be
maintained.

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * Do index vacuuming (call each index's ambulkdelete routine), then do
  * related heap vacuuming
  */
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
  lazy_vacuum(vacrel);

Like here. In HEAD, could this have used vacrel->dead_items?

@@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
  * the second heap pass.  No more, no less.
  */
- Assert(index > 0);
  Assert(vacrel->num_index_scans > 1 ||
-    (index == vacrel->lpdead_items &&
+    (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
  vacuumed_pages == vacrel->lpdead_item_pages));

  ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+ vacuumed_pages)));

We assert that vacrel->lpdead_items has the expected value, and then
the ereport repeats the function call (with a lock) to read the value
we just consulted to pass the assert.

If we *really* want to compare counts, maybe we could invent a
debugging-only function that iterates over the tree and popcounts the
bitmaps. That seems too expensive for regular assert builds, though.

On the subject of debugging builds, I think it no longer makes sense
to have the array for debug checking in tid store, even during
development. A few months ago, we had an encoding scheme that looked
simple on paper, but its code was fiendishly difficult to follow (at
least for me). That's gone. In addition to the debugging count above,
we could also put a copy of the key in the BlockTableEntry's header,
in debug builds. We don't yet need to care about the key size, since
we don't (yet) have runtime-embeddable values.

> Currently (as of v52 patch) RT_FIND is
> doing so,

[meaning, there is no internal "automatic" locking here since after we
switched to variable-length types, an outstanding TODO]
Maybe it's okay to expose global locking for v17. I have one possible
alternative:

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

I believe this is the only open question remaining. The rest is just
polish and testing.

> During trying this idea, I realized that there is a visibility problem
> in the radix tree template

If it's broken even without the embedding I'll look into this (I don't
know if this configuration has ever been tested). I think a good test
is putting the shared tid tree in it's own translation unit, to see if
anything needs to be fixed. I'll go try that.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > It seems we cannot make this work nicely. IIUC VacDeadItems is
> > allocated in DSM and TidStore is embedded there. However,
> > dead_items->items.area is a local pointer to dsa_area. So we cannot
> > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> > would need to pass dsa_area to each interface by callers.
>
> Thanks again for exploring this line of thinking! Okay, it seems even
> if there's a way to make this work, it would be too invasive to
> justify when compared with the advantage I was hoping for.
>
> > > If we can't make this work nicely, I'd be okay with keeping the tid
> > > store control object. My biggest concern is unnecessary
> > > double-locking.
> >
> > If we don't do any locking stuff in radix tree APIs and it's the
> > user's responsibility at all, probably we don't need a lock for
> > tidstore? That is, we expose lock functions as you mentioned and the
> > user (like tidstore) acquires/releases the lock before/after accessing
> > the radix tree and num_items.
>
> I'm not quite sure what the point of "num_items" is anymore, because
> it was really tied to the array in VacDeadItems. dead_items->num_items
> is essential to reading/writing the array correctly. If this number is
> wrong, the array is corrupt. There is no such requirement for the
> radix tree. We don't need to know the number of tids to add to it or
> do a lookup, or anything.

True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
we need to have the number of TIDs in a tidstore, especially in the
tidstore's control object.

>
> There are a number of places where we assert "the running count of the
> dead items" is the same as "the length of the dead items array", like
> here:
>
> @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
>   BlockNumber threshold;
>
>   Assert(vacrel->num_index_scans == 0);
> - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
> + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
>
> As such, in HEAD I'm guessing it's arbitrary which one is used for
> control flow. Correct me if I'm mistaken. If I am wrong for some part
> of the code, it'd be good to understand when that invariant can't be
> maintained.
>
> @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
>   * Do index vacuuming (call each index's ambulkdelete routine), then do
>   * related heap vacuuming
>   */
> - if (dead_items->num_items > 0)
> + if (TidStoreNumTids(dead_items) > 0)
>   lazy_vacuum(vacrel);
>
> Like here. In HEAD, could this have used vacrel->dead_items?
>
> @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>   * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
>   * the second heap pass.  No more, no less.
>   */
> - Assert(index > 0);
>   Assert(vacrel->num_index_scans > 1 ||
> -    (index == vacrel->lpdead_items &&
> +    (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
>   vacuumed_pages == vacrel->lpdead_item_pages));
>
>   ereport(DEBUG2,
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
> in %u pages",
> + vacrel->relname, TidStoreNumTids(vacrel->dead_items),
> + vacuumed_pages)));
>
> We assert that vacrel->lpdead_items has the expected value, and then
> the ereport repeats the function call (with a lock) to read the value
> we just consulted to pass the assert.
>
> If we *really* want to compare counts, maybe we could invent a
> debugging-only function that iterates over the tree and popcounts the
> bitmaps. That seems too expensive for regular assert builds, though.

IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
the whole lazy vacuum operation whereas num_items is the number of
LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
That is, after heap vacuum, the latter counter is reset while the
former counter is not.

The latter counter is used in lazyvacuum.c as well as the ereport in
vac_bulkdel_one_index().

>
> On the subject of debugging builds, I think it no longer makes sense
> to have the array for debug checking in tid store, even during
> development. A few months ago, we had an encoding scheme that looked
> simple on paper, but its code was fiendishly difficult to follow (at
> least for me). That's gone. In addition to the debugging count above,
> we could also put a copy of the key in the BlockTableEntry's header,
> in debug builds. We don't yet need to care about the key size, since
> we don't (yet) have runtime-embeddable values.

Putting a copy of the key in BlocktableEntry's header is an
interesting idea. But the current debug code in the tidstore also
makes sure that the tidstore returns TIDs in the correct order during
an iterate operation. I think it still has a value and you can disable
it by removing the "#define TIDSTORE_DEBUG" line.

>
> > Currently (as of v52 patch) RT_FIND is
> > doing so,
>
> [meaning, there is no internal "automatic" locking here since after we
> switched to variable-length types, an outstanding TODO]
> Maybe it's okay to expose global locking for v17. I have one possible
> alternative:
>
> This week I tried an idea to use a callback there so that after
> internal unlocking, the caller received the value (or whatever else
> needs to happen, such as lookup an offset in the tid bitmap). I've
> attached a draft for that that passes radix tree tests. It's a bit
> awkward, but I'm guessing this would more closely match future
> internal atomic locking. Let me know what you think of the concept,
> and then do whichever way you think is best. (using v53 as the basis)

Thank you for verifying this idea! Interesting. While it's promising
in terms of future atomic locking, I'm concerned it might not be easy
to use if radix tree APIs supports only such callback style. I believe
the caller would like to pass one more data along with val_data. For
example, considering tidstore that has num_tids internally, it wants
to pass both a pointer to BlocktableEntry and a pointer to TidStore
itself so that it increments the counter while holding a lock.

Another API idea for future atomic locking is to separate
RT_SET()/RT_FIND() into begin and end. In RT_SET_BEGIN() API, we find
the key, extend nodes if necessary, set the value, and return the
result while holding the lock. For example, if the radix tree supports
lock coupling, the leaf node and its parent remain locked. Then the
caller does its job and calls RT_SET_END() that does cleanup stuff
such as releasing locks.

I've not fully considered this approach but even this idea seems
complex and easy to use. I prefer the current simple approach as we
support the simple locking mechanism for now.

>
> I believe this is the only open question remaining. The rest is just
> polish and testing.

Right.

>
> > During trying this idea, I realized that there is a visibility problem
> > in the radix tree template
>
> If it's broken even without the embedding I'll look into this (I don't
> know if this configuration has ever been tested). I think a good test
> is putting the shared tid tree in it's own translation unit, to see if
> anything needs to be fixed. I'll go try that.

Thanks.

BTW in radixtree.h pg_attribute_unused() is used for some functions,
but is it for debugging purposes? I don't see why it's used only for
some functions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > I'm not quite sure what the point of "num_items" is anymore, because
> > it was really tied to the array in VacDeadItems. dead_items->num_items
> > is essential to reading/writing the array correctly. If this number is
> > wrong, the array is corrupt. There is no such requirement for the
> > radix tree. We don't need to know the number of tids to add to it or
> > do a lookup, or anything.
>
> True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
> we need to have the number of TIDs in a tidstore, especially in the
> tidstore's control object.

Hmm, it would be kind of sad to require explicit locking in tidstore.c
is only for maintaining that one number at all times. Aside from the
two ereports after an index scan / second heap pass, the only
non-assert place where it's used is

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * Do index vacuuming (call each index's ambulkdelete routine), then do
  * related heap vacuuming
  */
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
  lazy_vacuum(vacrel);

...and that condition can be checked by doing a single step of
iteration to see if it shows anything. But for the ereport, my idea
for iteration + popcount is probably quite slow.

> IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
> the whole lazy vacuum operation whereas num_items is the number of
> LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
> That is, after heap vacuum, the latter counter is reset while the
> former counter is not.
>
> The latter counter is used in lazyvacuum.c as well as the ereport in
> vac_bulkdel_one_index().

Ah, of course.

> Putting a copy of the key in BlocktableEntry's header is an
> interesting idea. But the current debug code in the tidstore also
> makes sure that the tidstore returns TIDs in the correct order during
> an iterate operation. I think it still has a value and you can disable
> it by removing the "#define TIDSTORE_DEBUG" line.

Fair enough. I just thought it'd be less work to leave this out in
case we change how locking is called.

> > This week I tried an idea to use a callback there so that after
> > internal unlocking, the caller received the value (or whatever else
> > needs to happen, such as lookup an offset in the tid bitmap). I've
> > attached a draft for that that passes radix tree tests. It's a bit
> > awkward, but I'm guessing this would more closely match future
> > internal atomic locking. Let me know what you think of the concept,
> > and then do whichever way you think is best. (using v53 as the basis)
>
> Thank you for verifying this idea! Interesting. While it's promising
> in terms of future atomic locking, I'm concerned it might not be easy
> to use if radix tree APIs supports only such callback style.

Yeah, it's quite awkward. It could be helped by only exposing it for
varlen types. For simply returning "present or not" (used a lot in the
regression tests), we could skip the callback if the data is null.
That is all also extra stuff.

> I believe
> the caller would like to pass one more data along with val_data. For

That's trivial, however, if I understand you correctly. With "void *",
a callback can receive anything, including a struct containing
additional pointers to elsewhere.

> example, considering tidstore that has num_tids internally, it wants
> to pass both a pointer to BlocktableEntry and a pointer to TidStore
> itself so that it increments the counter while holding a lock.

Hmm, so a callback to RT_SET also. That's interesting!

Anyway, I agree it needs to be simple, since the first use doesn't
even have multiple writers.

> BTW in radixtree.h pg_attribute_unused() is used for some functions,
> but is it for debugging purposes? I don't see why it's used only for
> some functions.

It was there to silence warnings about unused functions. I only see
one remaining, and it's already behind a debug symbol, so we might not
need this attribute anymore.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I wrote:

> On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > During trying this idea, I realized that there is a visibility problem
> > in the radix tree template
>
> If it's broken even without the embedding I'll look into this (I don't
> know if this configuration has ever been tested). I think a good test
> is putting the shared tid tree in it's own translation unit, to see if
> anything needs to be fixed. I'll go try that.

Here's a quick test that this works. The only thing that really needed
fixing in the template was failure to un-define one symbol. The rest
was just moving some things around.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Jan 19, 2024 at 6:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > I'm not quite sure what the point of "num_items" is anymore, because
> > > it was really tied to the array in VacDeadItems. dead_items->num_items
> > > is essential to reading/writing the array correctly. If this number is
> > > wrong, the array is corrupt. There is no such requirement for the
> > > radix tree. We don't need to know the number of tids to add to it or
> > > do a lookup, or anything.
> >
> > True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
> > we need to have the number of TIDs in a tidstore, especially in the
> > tidstore's control object.
>
> Hmm, it would be kind of sad to require explicit locking in tidstore.c
> is only for maintaining that one number at all times. Aside from the
> two ereports after an index scan / second heap pass, the only
> non-assert place where it's used is
>
> @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
>   * Do index vacuuming (call each index's ambulkdelete routine), then do
>   * related heap vacuuming
>   */
> - if (dead_items->num_items > 0)
> + if (TidStoreNumTids(dead_items) > 0)
>   lazy_vacuum(vacrel);
>
> ...and that condition can be checked by doing a single step of
> iteration to see if it shows anything. But for the ereport, my idea
> for iteration + popcount is probably quite slow.

Right.

On further thought, as you pointed out before, "num_tids" should not
be in tidstore in terms of integration with tidbitmap.c, because
tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
longer accurate and useful. Similarly, looking at tidbitmap.c, it has
npages and nchunks but they will not be necessary in lazy vacuum use
case. Also, assuming that we support parallel heap pruning, probably
we need to somehow lock the tidstore while adding tids to the tidstore
concurrently by parallel vacuum worker. But in tidbitmap use case, we
don't need to lock the tidstore since it doesn't have multiple
writers. Given these facts, different statistics and different lock
strategies are required by different use case. So I think there are 3
options:

1. expose lock functions for tidstore and the caller manages the
statistics in the outside of tidstore. For example, in lazyvacuum.c we
would have a TidStore for tid storage as well as VacDeadItemsInfo that
has num_tids and max_bytes. Both are in LVRelState. For parallel
vacuum, we pass both to the workers via DSM and pass both to function
where the statistics are required. As for the exposed lock functions,
when adding tids to the tidstore, the caller would need to call
something like TidStoreLockExclusive(ts) that further calls
LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.

2. add callback functions to tidstore so that the caller can do its
work while holding a lock on the tidstore. This is like the idea we
just discussed for radix tree. The caller passes a callback function
and user data to TidStoreSetBlockOffsets(), and the callback is called
after setting tids. Similar to option 1, the statistics need to be
stored in a different area.

3. keep tidstore.c and tidbitmap.c separate implementations but use
radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
control object and doesn't have any lossy page support. On the other
hand, in tidbitmap.c we replace simplehash with radix tree. This makes
tidstore.c simple but we would end up having different data structures
for similar usage.

I think it's worth trying option 1. What do you think, John?

>
> > IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
> > the whole lazy vacuum operation whereas num_items is the number of
> > LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
> > That is, after heap vacuum, the latter counter is reset while the
> > former counter is not.
> >
> > The latter counter is used in lazyvacuum.c as well as the ereport in
> > vac_bulkdel_one_index().
>
> Ah, of course.
>
> > Putting a copy of the key in BlocktableEntry's header is an
> > interesting idea. But the current debug code in the tidstore also
> > makes sure that the tidstore returns TIDs in the correct order during
> > an iterate operation. I think it still has a value and you can disable
> > it by removing the "#define TIDSTORE_DEBUG" line.
>
> Fair enough. I just thought it'd be less work to leave this out in
> case we change how locking is called.
>
> > > This week I tried an idea to use a callback there so that after
> > > internal unlocking, the caller received the value (or whatever else
> > > needs to happen, such as lookup an offset in the tid bitmap). I've
> > > attached a draft for that that passes radix tree tests. It's a bit
> > > awkward, but I'm guessing this would more closely match future
> > > internal atomic locking. Let me know what you think of the concept,
> > > and then do whichever way you think is best. (using v53 as the basis)
> >
> > Thank you for verifying this idea! Interesting. While it's promising
> > in terms of future atomic locking, I'm concerned it might not be easy
> > to use if radix tree APIs supports only such callback style.
>
> Yeah, it's quite awkward. It could be helped by only exposing it for
> varlen types. For simply returning "present or not" (used a lot in the
> regression tests), we could skip the callback if the data is null.
> That is all also extra stuff.
>
> > I believe
> > the caller would like to pass one more data along with val_data. For
>
> That's trivial, however, if I understand you correctly. With "void *",
> a callback can receive anything, including a struct containing
> additional pointers to elsewhere.
>
> > example, considering tidstore that has num_tids internally, it wants
> > to pass both a pointer to BlocktableEntry and a pointer to TidStore
> > itself so that it increments the counter while holding a lock.
>
> Hmm, so a callback to RT_SET also. That's interesting!
>
> Anyway, I agree it needs to be simple, since the first use doesn't
> even have multiple writers.

Right.

>
> > BTW in radixtree.h pg_attribute_unused() is used for some functions,
> > but is it for debugging purposes? I don't see why it's used only for
> > some functions.
>
> It was there to silence warnings about unused functions. I only see
> one remaining, and it's already behind a debug symbol, so we might not
> need this attribute anymore.

Okay.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On further thought, as you pointed out before, "num_tids" should not
> be in tidstore in terms of integration with tidbitmap.c, because
> tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
> longer accurate and useful. Similarly, looking at tidbitmap.c, it has
> npages and nchunks but they will not be necessary in lazy vacuum use
> case. Also, assuming that we support parallel heap pruning, probably
> we need to somehow lock the tidstore while adding tids to the tidstore
> concurrently by parallel vacuum worker. But in tidbitmap use case, we
> don't need to lock the tidstore since it doesn't have multiple
> writers.

Not currently, and it does seem bad to require locking where it's not required.

(That would be a prerequisite for parallel index scan. It's been tried
before with the hash table, but concurrency didn't scale well with the
hash table. I have no reason to think that the radix tree would scale
significantly better with the same global LW lock, but as you know
there are other locking schemes possible.)

> Given these facts, different statistics and different lock
> strategies are required by different use case. So I think there are 3
> options:
>
> 1. expose lock functions for tidstore and the caller manages the
> statistics in the outside of tidstore. For example, in lazyvacuum.c we
> would have a TidStore for tid storage as well as VacDeadItemsInfo that
> has num_tids and max_bytes. Both are in LVRelState. For parallel
> vacuum, we pass both to the workers via DSM and pass both to function
> where the statistics are required. As for the exposed lock functions,
> when adding tids to the tidstore, the caller would need to call
> something like TidStoreLockExclusive(ts) that further calls
> LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.

The advantage here is that vacuum can avoid locking entirely while
using shared memory, just like it does now, and has the option to add
it later.
IIUC, the radix tree struct would have a lock member, but wouldn't
take any locks internally? Maybe we still need one for
RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own
DSA_AREA_LOCK -- maybe that's enough?

That seems simplest, and is not very far from what we do now. If we do
this, then the lock functions should be where we branch for is_shared.

> 2. add callback functions to tidstore so that the caller can do its
> work while holding a lock on the tidstore. This is like the idea we
> just discussed for radix tree. The caller passes a callback function
> and user data to TidStoreSetBlockOffsets(), and the callback is called
> after setting tids. Similar to option 1, the statistics need to be
> stored in a different area.

I think we'll have to move to something like this eventually, but it
seems like overkill right now.

> 3. keep tidstore.c and tidbitmap.c separate implementations but use
> radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
> control object and doesn't have any lossy page support. On the other
> hand, in tidbitmap.c we replace simplehash with radix tree. This makes
> tidstore.c simple but we would end up having different data structures
> for similar usage.

They have so much in common that it's worth it to use the same
interface and (eventually) value type. They just need separate paths
for adding tids, as we've discussed.

> I think it's worth trying option 1. What do you think, John?

+1



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jan 17, 2024 at 12:32 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I wrote:
>
> > > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > > their own allocations and returning a pointer. Would it be less tricky
> > > if the structs were declared where we need them and passed to "init"
> > > functions?
>
> If this is a possibility, I thought I'd first send the last (I hope)
> large-ish set of radix tree cleanups to avoid rebasing issues. I'm not
> including tidstore/vacuum here, because recent discussion has some
> up-in-the-air work.

Thank you for updating the patches! These updates look good to me.

>
> Should be self-explanatory, but some thing are worth calling out:
> 0012 and 0013: Some time ago I started passing insertpos as a
> parameter, but now see that is not ideal -- when growing from node16
> to node48 we don't need it at all, so it's a wasted calculation. While
> reverting that, I found that this also allows passing constants in
> some cases.
> 0014 makes a cleaner separation between adding a child and growing a
> node, resulting in more compact-looking functions.
> 0019 is a bit unpolished, but I realized that it's pointless to assign
> a zero child when further up the call stack we overwrite it anyway
> with the actual value. With this, that assignment is skipped. This
> makes some comments and names strange, so needs a bit of polish, but
> wanted to get it out there anyway.

Cool.

I'll merge these patches in the next version v54 patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 22, 2024 at 2:36 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On further thought, as you pointed out before, "num_tids" should not
> > be in tidstore in terms of integration with tidbitmap.c, because
> > tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
> > longer accurate and useful. Similarly, looking at tidbitmap.c, it has
> > npages and nchunks but they will not be necessary in lazy vacuum use
> > case. Also, assuming that we support parallel heap pruning, probably
> > we need to somehow lock the tidstore while adding tids to the tidstore
> > concurrently by parallel vacuum worker. But in tidbitmap use case, we
> > don't need to lock the tidstore since it doesn't have multiple
> > writers.
>
> Not currently, and it does seem bad to require locking where it's not required.
>
> (That would be a prerequisite for parallel index scan. It's been tried
> before with the hash table, but concurrency didn't scale well with the
> hash table. I have no reason to think that the radix tree would scale
> significantly better with the same global LW lock, but as you know
> there are other locking schemes possible.)
>
> > Given these facts, different statistics and different lock
> > strategies are required by different use case. So I think there are 3
> > options:
> >
> > 1. expose lock functions for tidstore and the caller manages the
> > statistics in the outside of tidstore. For example, in lazyvacuum.c we
> > would have a TidStore for tid storage as well as VacDeadItemsInfo that
> > has num_tids and max_bytes. Both are in LVRelState. For parallel
> > vacuum, we pass both to the workers via DSM and pass both to function
> > where the statistics are required. As for the exposed lock functions,
> > when adding tids to the tidstore, the caller would need to call
> > something like TidStoreLockExclusive(ts) that further calls
> > LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.
>
> The advantage here is that vacuum can avoid locking entirely while
> using shared memory, just like it does now, and has the option to add
> it later.

True.

> IIUC, the radix tree struct would have a lock member, but wouldn't
> take any locks internally? Maybe we still need one for
> RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own
> DSA_AREA_LOCK -- maybe that's enough?

I think that's a good point. So there will be no place where the radix
tree takes any locks internally.

>
> That seems simplest, and is not very far from what we do now. If we do
> this, then the lock functions should be where we branch for is_shared.

Agreed.

>
> > 2. add callback functions to tidstore so that the caller can do its
> > work while holding a lock on the tidstore. This is like the idea we
> > just discussed for radix tree. The caller passes a callback function
> > and user data to TidStoreSetBlockOffsets(), and the callback is called
> > after setting tids. Similar to option 1, the statistics need to be
> > stored in a different area.
>
> I think we'll have to move to something like this eventually, but it
> seems like overkill right now.

Right.

>
> > 3. keep tidstore.c and tidbitmap.c separate implementations but use
> > radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
> > control object and doesn't have any lossy page support. On the other
> > hand, in tidbitmap.c we replace simplehash with radix tree. This makes
> > tidstore.c simple but we would end up having different data structures
> > for similar usage.
>
> They have so much in common that it's worth it to use the same
> interface and (eventually) value type. They just need separate paths
> for adding tids, as we've discussed.

Agreed.

>
> > I think it's worth trying option 1. What do you think, John?
>
> +1

Thanks!

Before working on this idea, since the latest patches conflict with
the current HEAD, I share the latest patch set (v54). Here is the
summary:

- As for radix tree part, it's based on v53 patch. I've squashed most
of cleanups and changes in v53 except for "DRAFT: Stop using invalid
pointers as placeholders." as I thought you might want to still work
on it. BTW it includes "#undef RT_SHMEM".
- As for tidstore, it's based on v51. That is, it still has the
control object and num_tids there.
- As for vacuum integration, it's also based on v51. But we no longer
need to change has_lpdead_items and LVPagePruneState thanks to the
recent commit c120550edb8 and e313a61137.

For the next version patch, I'll work on this idea and try to clean up
locking stuff both in tidstore and radix tree. Or if you're already
working on some of them, please let me know. I'll review it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> For the next version patch, I'll work on this idea and try to clean up
> locking stuff both in tidstore and radix tree. Or if you're already
> working on some of them, please let me know. I'll review it.

Okay go ahead, sounds good. I plan to look at the tests since they
haven't been looked at in a while.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 22, 2024 at 5:18 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > For the next version patch, I'll work on this idea and try to clean up
> > locking stuff both in tidstore and radix tree. Or if you're already
> > working on some of them, please let me know. I'll review it.
>
> Okay go ahead, sounds good. I plan to look at the tests since they
> haven't been looked at in a while.

I've attached the latest patch set. Here are updates from v54 patch:

0005 - Expose radix tree lock functions and remove all locks taken
internally in radixtree.h.
0008 - Remove tidstore's control object.
0009 - Add tidstore lock functions.
0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items"
separate from TidStore. Also make lazy vacuum and parallel vacuum use
it.

The new patches probably need to be polished but the VacDeadItemInfo
idea looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jan 23, 2024 at 12:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 22, 2024 at 5:18 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > For the next version patch, I'll work on this idea and try to clean up
> > > locking stuff both in tidstore and radix tree. Or if you're already
> > > working on some of them, please let me know. I'll review it.
> >
> > Okay go ahead, sounds good. I plan to look at the tests since they
> > haven't been looked at in a while.
>
> I've attached the latest patch set. Here are updates from v54 patch:
>
> 0005 - Expose radix tree lock functions and remove all locks taken
> internally in radixtree.h.
> 0008 - Remove tidstore's control object.
> 0009 - Add tidstore lock functions.
> 0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items"
> separate from TidStore. Also make lazy vacuum and parallel vacuum use
> it.

John pointed out offlist the tarball includes only the patches up to
0009. I've attached the correct one.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> The new patches probably need to be polished but the VacDeadItemInfo
> idea looks good to me.

That idea looks good to me, too. Since you already likely know what
you'd like to polish, I don't have much to say except for a few
questions below. I also did a quick sweep through every patch, so some
of these comments are unrelated to recent changes:

v55-0003:

+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));

I looked and found dsa.c doesn't already use shared locks in HEAD,
even dsa_dump. Not sure why that is...

+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)

The first parameter seems to be trying to make the block size exact,
but that's not right, because of the chunk header, and maybe
alignment. If the default block size is big enough to waste only a
tiny amount of space, let's just use that as-is. Also, I think all
block sizes in the code base have been a power of two, but I'm not
sure how much that matters.

+#ifdef RT_SHMEM
+ fprintf(stderr, "  [%d] chunk %x slot " DSA_POINTER_FORMAT "\n",
+ i, n4->chunks[i], n4->children[i]);
+#else
+ fprintf(stderr, "  [%d] chunk %x slot %p\n",
+ i, n4->chunks[i], n4->children[i]);
+#endif

Maybe we could invent a child pointer format, so we only #ifdef in one place.

--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/

Can you look into this?

test_radixtree.c:

+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 4, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
+};

These numbers have been wrong a long time, too, but only matters for
figuring out where it went wrong when something is broken. And for the
XXX, instead of trying to use the largest number that should fit (it's
obviously not testing that the expected node can actually hold that
number anyway), it seems we can just use a "big enough" number to
cause growing into the desired size class.

As far as cleaning up the tests, I always wondered why these didn't
use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where
where convenient, and leave comments above the tests. That seemed like
a good idea to me -- was there a reason to have hand-written branches
and elog messages everywhere?

--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
  test "$f" = src/include/nodes/nodetags.h && continue
  test "$f" = src/backend/nodes/nodetags.h && continue

+ # radixtree_*_impl.h cannot be included standalone: they are just
code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue

Ha! I'd forgotten about these -- they're long outdated.

v55-0005:

- * The radix tree is locked in shared mode during the iteration, so
- * RT_END_ITERATE needs to be called when finished to release the lock.
+ * The caller needs to acquire a lock in shared mode during the iteration
+ * if necessary.

"need if necessary" is maybe better phrased as "is the caller's responsibility"

+ /*
+ * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory.
+ */
  total = dsa_get_total_size(tree->dsa);

Maybe better to have a header comment for RT_MEMORY_USAGE that the
caller doesn't need to take a lock.

v55-0006:

"WIP: Not built, since some benchmarks have broken" -- I'll work on
this when I re-run some benchmarks.

v55-0007:

+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.

This hasn't been true for a few months now, and I thought we fixed
this in some earlier version?

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().

Do we need to do anything now?

v55-0008:

-TidStoreAttach(dsa_area *area, TidStoreHandle handle)
+TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)

"handle" seemed like a fine name. Is that not the case anymore? The
new one is kind of cryptic. The commit message just says "remove
control object" -- does that imply that we need to think of this
parameter differently, or is it unrelated? (Same with
dead_items_handle in 0011)

v55-0011:

+ /*
+ * Recreate the tidstore with the same max_bytes limitation. We cannot
+ * use neither maintenance_work_mem nor autovacuum_work_mem as they could
+ * already be changed.
+ */

I don't understand this part.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Jan 24, 2024 at 3:42 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > The new patches probably need to be polished but the VacDeadItemInfo
> > idea looks good to me.
>
> That idea looks good to me, too. Since you already likely know what
> you'd like to polish, I don't have much to say except for a few
> questions below. I also did a quick sweep through every patch, so some
> of these comments are unrelated to recent changes:

Thank you!

>
> v55-0003:
>
> +size_t
> +dsa_get_total_size(dsa_area *area)
> +{
> + size_t size;
> +
> + LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
> + size = area->control->total_segment_size;
> + LWLockRelease(DSA_AREA_LOCK(area));
>
> I looked and found dsa.c doesn't already use shared locks in HEAD,
> even dsa_dump. Not sure why that is...

Oh, the dsa_dump part seems to be a bug. But it'll keep it consistent
with others.

>
> +/*
> + * Calculate the slab blocksize so that we can allocate at least 32 chunks
> + * from the block.
> + */
> +#define RT_SLAB_BLOCK_SIZE(size) \
> + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
>
> The first parameter seems to be trying to make the block size exact,
> but that's not right, because of the chunk header, and maybe
> alignment. If the default block size is big enough to waste only a
> tiny amount of space, let's just use that as-is.

Agreed.

> Also, I think all
> block sizes in the code base have been a power of two, but I'm not
> sure how much that matters.

Did you mean all slab block sizes we use in radixtree.h?

>
> +#ifdef RT_SHMEM
> + fprintf(stderr, "  [%d] chunk %x slot " DSA_POINTER_FORMAT "\n",
> + i, n4->chunks[i], n4->children[i]);
> +#else
> + fprintf(stderr, "  [%d] chunk %x slot %p\n",
> + i, n4->chunks[i], n4->children[i]);
> +#endif
>
> Maybe we could invent a child pointer format, so we only #ifdef in one place.

WIll change.

>
> --- /dev/null
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -0,0 +1,35 @@
> +# FIXME: prevent install during main install, but not during test :/
>
> Can you look into this?

Okay, I'll look at it.

>
> test_radixtree.c:
>
> +/*
> + * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
> + */
> +static int rt_node_class_fanouts[] = {
> + 4, /* RT_CLASS_3 */
> + 15, /* RT_CLASS_32_MIN */
> + 32, /* RT_CLASS_32_MAX */
> + 125, /* RT_CLASS_125 */
> + 256 /* RT_CLASS_256 */
> +};
>
> These numbers have been wrong a long time, too, but only matters for
> figuring out where it went wrong when something is broken. And for the
> XXX, instead of trying to use the largest number that should fit (it's
> obviously not testing that the expected node can actually hold that
> number anyway), it seems we can just use a "big enough" number to
> cause growing into the desired size class.
>
> As far as cleaning up the tests, I always wondered why these didn't
> use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where
> where convenient, and leave comments above the tests. That seemed like
> a good idea to me -- was there a reason to have hand-written branches
> and elog messages everywhere?

The current test is based on test_integerset. I agree that we can
improve it by using EXPECT_TRUE etc.

>
> --- a/src/tools/pginclude/cpluspluscheck
> +++ b/src/tools/pginclude/cpluspluscheck
> @@ -101,6 +101,12 @@ do
>   test "$f" = src/include/nodes/nodetags.h && continue
>   test "$f" = src/backend/nodes/nodetags.h && continue
>
> + # radixtree_*_impl.h cannot be included standalone: they are just
> code fragments.
> + test "$f" = src/include/lib/radixtree_delete_impl.h && continue
> + test "$f" = src/include/lib/radixtree_insert_impl.h && continue
> + test "$f" = src/include/lib/radixtree_iter_impl.h && continue
> + test "$f" = src/include/lib/radixtree_search_impl.h && continue
>
> Ha! I'd forgotten about these -- they're long outdated.

Will remove.

>
> v55-0005:
>
> - * The radix tree is locked in shared mode during the iteration, so
> - * RT_END_ITERATE needs to be called when finished to release the lock.
> + * The caller needs to acquire a lock in shared mode during the iteration
> + * if necessary.
>
> "need if necessary" is maybe better phrased as "is the caller's responsibility"

Will fix.

>
> + /*
> + * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory.
> + */
>   total = dsa_get_total_size(tree->dsa);
>
> Maybe better to have a header comment for RT_MEMORY_USAGE that the
> caller doesn't need to take a lock.

Will fix.

>
> v55-0006:
>
> "WIP: Not built, since some benchmarks have broken" -- I'll work on
> this when I re-run some benchmarks.

Thanks!

>
> v55-0007:
>
> + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
> + * and stored in the radix tree.
>
> This hasn't been true for a few months now, and I thought we fixed
> this in some earlier version?

Yeah, I'll fix it.

>
> + * TODO: The caller must be certain that no other backend will attempt to
> + * access the TidStore before calling this function. Other backend must
> + * explicitly call TidStoreDetach() to free up backend-local memory associated
> + * with the TidStore. The backend that calls TidStoreDestroy() must not call
> + * TidStoreDetach().
>
> Do we need to do anything now?

No, will remove it.

>
> v55-0008:
>
> -TidStoreAttach(dsa_area *area, TidStoreHandle handle)
> +TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
>
> "handle" seemed like a fine name. Is that not the case anymore? The
> new one is kind of cryptic. The commit message just says "remove
> control object" -- does that imply that we need to think of this
> parameter differently, or is it unrelated? (Same with
> dead_items_handle in 0011)

Since it's actually just a radix tree's handle it was kind of
unnatural to me to use the same dsa_pointer as different handles. But
rethinking it, I agree "handle" is a fine name.

>
> v55-0011:
>
> + /*
> + * Recreate the tidstore with the same max_bytes limitation. We cannot
> + * use neither maintenance_work_mem nor autovacuum_work_mem as they could
> + * already be changed.
> + */
>
> I don't understand this part.

I wanted to mean that if maintenance_work_mem is changed and the
config file is reloaded, its value could no longer be the same as the
one that we used when initializing the parallel vacuum. That's why we
need to store max_bytes in the DSM. I'll rephrase it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Jan 26, 2024 at 11:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 3:42 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > The new patches probably need to be polished but the VacDeadItemInfo
> > > idea looks good to me.
> >
> > That idea looks good to me, too. Since you already likely know what
> > you'd like to polish, I don't have much to say except for a few
> > questions below. I also did a quick sweep through every patch, so some
> > of these comments are unrelated to recent changes:
>
> Thank you!
>
> >
> > +/*
> > + * Calculate the slab blocksize so that we can allocate at least 32 chunks
> > + * from the block.
> > + */
> > +#define RT_SLAB_BLOCK_SIZE(size) \
> > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
> >
> > The first parameter seems to be trying to make the block size exact,
> > but that's not right, because of the chunk header, and maybe
> > alignment. If the default block size is big enough to waste only a
> > tiny amount of space, let's just use that as-is.
>
> Agreed.
>

As of v55 patch, the sizes of each node class are:

- node4: 40 bytes
- node16_lo: 168 bytes
- node16_hi: 296 bytes
- node48: 784 bytes
- node256: 2088 bytes

If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
(approximately):

- node4: 32 bytes
- node16_lo: 128 bytes
- node16_hi: 200 bytes
- node48: 352 bytes
- node256: 1928 bytes

We might want to calculate a better slab block size for node256 at least.

> >
> > + * TODO: The caller must be certain that no other backend will attempt to
> > + * access the TidStore before calling this function. Other backend must
> > + * explicitly call TidStoreDetach() to free up backend-local memory associated
> > + * with the TidStore. The backend that calls TidStoreDestroy() must not call
> > + * TidStoreDetach().
> >
> > Do we need to do anything now?
>
> No, will remove it.
>

I misunderstood something. I think the above statement is still true
but we don't need to do anything at this stage. It's a typical usage
that the leader destroys the shared data after confirming all workers
are detached. It's not a TODO but probably a NOTE.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > > +/*
> > > + * Calculate the slab blocksize so that we can allocate at least 32 chunks
> > > + * from the block.
> > > + */
> > > +#define RT_SLAB_BLOCK_SIZE(size) \
> > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
> > >
> > > The first parameter seems to be trying to make the block size exact,
> > > but that's not right, because of the chunk header, and maybe
> > > alignment. If the default block size is big enough to waste only a
> > > tiny amount of space, let's just use that as-is.

> If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
> [snip]
> We might want to calculate a better slab block size for node256 at least.

I meant the macro could probably be

Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)

(Right now N=32). I also realize I didn't answer your question earlier
about block sizes being powers of two. I was talking about PG in
general -- I was thinking all block sizes were powers of two. If
that's true, I'm not sure if it's because programmers find the macro
calculations easy to reason about, or if there was an implementation
reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
just above a power of two, so if we did  round that up it would be
128kB.

> > > + * TODO: The caller must be certain that no other backend will attempt to
> > > + * access the TidStore before calling this function. Other backend must
> > > + * explicitly call TidStoreDetach() to free up backend-local memory associated
> > > + * with the TidStore. The backend that calls TidStoreDestroy() must not call
> > > + * TidStoreDetach().
> > >
> > > Do we need to do anything now?
> >
> > No, will remove it.
> >
>
> I misunderstood something. I think the above statement is still true
> but we don't need to do anything at this stage. It's a typical usage
> that the leader destroys the shared data after confirming all workers
> are detached. It's not a TODO but probably a NOTE.

Okay.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > > +/*
> > > > + * Calculate the slab blocksize so that we can allocate at least 32 chunks
> > > > + * from the block.
> > > > + */
> > > > +#define RT_SLAB_BLOCK_SIZE(size) \
> > > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
> > > >
> > > > The first parameter seems to be trying to make the block size exact,
> > > > but that's not right, because of the chunk header, and maybe
> > > > alignment. If the default block size is big enough to waste only a
> > > > tiny amount of space, let's just use that as-is.
>
> > If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
> > [snip]
> > We might want to calculate a better slab block size for node256 at least.
>
> I meant the macro could probably be
>
> Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
>
> (Right now N=32). I also realize I didn't answer your question earlier
> about block sizes being powers of two. I was talking about PG in
> general -- I was thinking all block sizes were powers of two. If
> that's true, I'm not sure if it's because programmers find the macro
> calculations easy to reason about, or if there was an implementation
> reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> just above a power of two, so if we did  round that up it would be
> 128kB.

Thank you for your explanation. It might be better to follow other
codes. Does the calculation below make sense to you?

RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
while (inner_blocksize < 32 * size_class.allocsize)
     inner_blocksize <<= 1;

As for the lock mode in dsa.c, I've posted a question[1].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoALgrU2sGWzgq%2B6G9X0ynqyVOjMR5_k4HgsGRWae1j%3DwQ%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > I meant the macro could probably be
> >
> > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
> >
> > (Right now N=32). I also realize I didn't answer your question earlier
> > about block sizes being powers of two. I was talking about PG in
> > general -- I was thinking all block sizes were powers of two. If
> > that's true, I'm not sure if it's because programmers find the macro
> > calculations easy to reason about, or if there was an implementation
> > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> > just above a power of two, so if we did  round that up it would be
> > 128kB.
>
> Thank you for your explanation. It might be better to follow other
> codes. Does the calculation below make sense to you?
>
> RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
> Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
> while (inner_blocksize < 32 * size_class.allocsize)
>      inner_blocksize <<= 1;

It does make sense, but we can do it more simply:

Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Jan 30, 2024 at 7:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > I meant the macro could probably be
> > >
> > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
> > >
> > > (Right now N=32). I also realize I didn't answer your question earlier
> > > about block sizes being powers of two. I was talking about PG in
> > > general -- I was thinking all block sizes were powers of two. If
> > > that's true, I'm not sure if it's because programmers find the macro
> > > calculations easy to reason about, or if there was an implementation
> > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> > > just above a power of two, so if we did  round that up it would be
> > > 128kB.
> >
> > Thank you for your explanation. It might be better to follow other
> > codes. Does the calculation below make sense to you?
> >
> > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
> > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
> > while (inner_blocksize < 32 * size_class.allocsize)
> >      inner_blocksize <<= 1;
>
> It does make sense, but we can do it more simply:
>
> Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))

Thanks!

I've attached the new patch set (v56). I've squashed previous updates
and addressed review comments on v55 in separate patches. Here are the
update summary:

0004: fix compiler warning caught by ci test.
0005-0008: address review comments on radix tree codes.
0009: cleanup #define and #undef
0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
undefined after including radixtree.h so we should not use it in test
code.
0013-0015: address review comments on tidstore codes.
0017-0018: address review comments on vacuum integration codes.

Looking at overall changes, there are still XXX and TODO comments in
radixtree.h:

---
 * XXX There are 4 node kinds, and this should never be increased,
 * for several reasons:
 * 1. With 5 or more kinds, gcc tends to use a jump table for switch
 *    statements.
 * 2. The 4 kinds can be represented with 2 bits, so we have the option
 *    in the future to tag the node pointer with the kind, even on
 *    platforms with 32-bit pointers. This might speed up node traversal
 *    in trees with highly random node kinds.
 * 3. We can have multiple size classes per node kind.

Can we just remove "XXX"?

---
 * WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

---
#ifdef RT_SHMEM
/*  WIP: do we really need this? */
typedef dsa_pointer RT_HANDLE;
#endif

I think it's worth having it.

---
 * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
 * inside a single bitmapword on most platforms, so it's a good starting
 * point. We can make it higher if we need to.
 */
#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)

Are you going to work something on this?

---
    /* WIP: We could go first to the higher node16 size class */
    newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);

Does it mean to go to RT_CLASS_16_HI and then further go to
RT_CLASS_16_LO upon further deletion?

---
 * TODO: The current locking mechanism is not optimized for high concurrency
 * with mixed read-write workloads. In the future it might be worthwhile
 * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
 * the paper "The ART of Practical Synchronization" by the same authors as
 * the ART paper, 2016.

I think it's not TODO for now, but a future improvement. We can remove it.

---
/* TODO: consider 5 with subclass 1 or 2. */
#define RT_FANOUT_4     4

Is there something we need to do here?

---
/*
 * Return index of the chunk and slot arrays for inserting into the node,
 * such that the chunk array remains ordered.
 * TODO: Improve performance for non-SIMD platforms.
 */

Are you going to work on this?

---
/* Delete the element at 'idx' */
/*  TODO: replace slow memmove's */

Are you going to work on this?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've attached the new patch set (v56). I've squashed previous updates
> and addressed review comments on v55 in separate patches. Here are the
> update summary:
>
> 0004: fix compiler warning caught by ci test.
> 0005-0008: address review comments on radix tree codes.
> 0009: cleanup #define and #undef
> 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
> undefined after including radixtree.h so we should not use it in test
> code.

Great, thanks!

I have a few questions and comments on v56, then I'll address yours
below with the attached v57, which is mostly cosmetic adjustments.

v56-0003:

(Looking closer at tests)

+static const bool rt_test_stats = false;

I'm thinking we should just remove everything that depends on this,
and keep this module entirely about correctness.

+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);

I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

v56-0007:

+ *
+ * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
+ * the caller doesn't need to take a lock.

Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

v56-0008

Thanks, I like how the tests look now.

-NOTICE:  testing node   4 with height 0 and  ascending keys
...
+NOTICE:  testing node   1 with height 0 and  ascending keys

Now that the number is not intended to match a size class, "node X"
seems out of place. Maybe we could have a separate array with strings?

+ 1, /* RT_CLASS_4 */

This should be more than one, so that the basic test still exercises
paths that shift elements around.

+ 100, /* RT_CLASS_48 */

This node currently holds 64 for local memory.

+ 255 /* RT_CLASS_256 */

This is the only one where we know exactly how many it can take, so
may as well keep it at 256.

v56-0012:

The test module for tidstore could use a few more comments.

v56-0015:

+typedef dsa_pointer TidStoreHandle;
+

-TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 {
  TidStore *ts;
+ dsa_pointer rt_dp = handle;

My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

-----------------------------------------------------------------------------------

Now, for v57:

> Looking at overall changes, there are still XXX and TODO comments in
> radixtree.h:

That's fine, as long as it's intentional as a message to readers. That
said, we can get rid of some:

> ---
>  * XXX There are 4 node kinds, and this should never be increased,
>  * for several reasons:
>  * 1. With 5 or more kinds, gcc tends to use a jump table for switch
>  *    statements.
>  * 2. The 4 kinds can be represented with 2 bits, so we have the option
>  *    in the future to tag the node pointer with the kind, even on
>  *    platforms with 32-bit pointers. This might speed up node traversal
>  *    in trees with highly random node kinds.
>  * 3. We can have multiple size classes per node kind.
>
> Can we just remove "XXX"?

How about "NOTE"?

> ---
>  * WIP: notes about traditional radix tree trading off span vs height...
>
> Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

> ---
> #ifdef RT_SHMEM
> /*  WIP: do we really need this? */
> typedef dsa_pointer RT_HANDLE;
> #endif
>
> I think it's worth having it.

Okay, removed WIP in v57-0004.

> ---
>  * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
>  * inside a single bitmapword on most platforms, so it's a good starting
>  * point. We can make it higher if we need to.
>  */
> #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
>
> Are you going to work something on this?

Hard-coded 64 for readability, and changed this paragraph to explain
the current rationale more clearly:

"The paper uses at most 64 for this node kind, and one advantage for us
is that "isset" is a single bitmapword on most platforms, rather than
an array, allowing the compiler to get rid of loops."

> ---
>     /* WIP: We could go first to the higher node16 size class */
>     newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);
>
> Does it mean to go to RT_CLASS_16_HI and then further go to
> RT_CLASS_16_LO upon further deletion?

Yes. It wouldn't be much work to make shrinking symmetrical with
growing (a good thing), but it's not essential so I haven't done it
yet.

> ---
>  * TODO: The current locking mechanism is not optimized for high concurrency
>  * with mixed read-write workloads. In the future it might be worthwhile
>  * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
>  * the paper "The ART of Practical Synchronization" by the same authors as
>  * the ART paper, 2016.
>
> I think it's not TODO for now, but a future improvement. We can remove it.

It _is_ a TODO, regardless of when it happens.

> ---
> /* TODO: consider 5 with subclass 1 or 2. */
> #define RT_FANOUT_4     4
>
> Is there something we need to do here?

Changed to:

"To save memory in trees with sparse keys, it would make sense to have two
size classes for the smallest kind (perhaps a high class of 5 and a low class
of 2), but it would be more effective to utilize lazy expansion and
path compression."

> ---
> /*
>  * Return index of the chunk and slot arrays for inserting into the node,
>  * such that the chunk array remains ordered.
>  * TODO: Improve performance for non-SIMD platforms.
>  */
>
> Are you going to work on this?

A small step in v57-0010. I've found a way to kill two birds with one
stone, by first checking for the case that the keys are inserted in
order. This also helps the SIMD case because it must branch anyway to
avoid bitscanning a zero bitfield. This moves the branch up and turns
a mask into an assert, looking a bit nicer. I've removed the TODO, but
maybe we should add it to the search_eq function.

> ---
> /* Delete the element at 'idx' */
> /*  TODO: replace slow memmove's */
>
> Are you going to work on this?

Done in v57-0011.

The rest:
v57-0004 - 0008 should be self explanatory, but questions/pushback welcome.
v57-0009 - I'm thinking leaves don't need to be memset at all. The
value written should be entirely the caller's responsibility, it
seems.
v57-0013 - the bench module can be built locally again
v57-0016 - minor comment edits in tid store

My todo:
- benchmark tid store / vacuum again, since we haven't since varlen
types and removing unnecessary locks. I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.
- leftover comment etc work

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've attached the new patch set (v56). I've squashed previous updates
> > and addressed review comments on v55 in separate patches. Here are the
> > update summary:
> >
> > 0004: fix compiler warning caught by ci test.
> > 0005-0008: address review comments on radix tree codes.
> > 0009: cleanup #define and #undef
> > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
> > undefined after including radixtree.h so we should not use it in test
> > code.
>
> Great, thanks!
>
> I have a few questions and comments on v56, then I'll address yours
> below with the attached v57, which is mostly cosmetic adjustments.

Thank you for the comments! I've squashed previous updates and your changes.

>
> v56-0003:
>
> (Looking closer at tests)
>
> +static const bool rt_test_stats = false;
>
> I'm thinking we should just remove everything that depends on this,
> and keep this module entirely about correctness.

Agreed. Removed in 0006 patch.

>
> + for (int shift = 0; shift <= (64 - 8); shift += 8)
> + test_node_types(shift);
>
> I'm not sure what the test_node_types_* functions are testing that
> test_basic doesn't. They have a different, and confusing, way to stop
> at every size class and check the keys/values. It seems we can replace
> all that with two more calls (asc/desc) to test_basic, with the
> maximum level.

Agreed, addressed in 0007 patch.

>
> It's pretty hard to see what test_pattern() is doing, or why it's
> useful. I wonder if instead the test could use something like the
> benchmark where random integers are masked off. That seems simpler. I
> can work on that, but I'd like to hear your side about test_pattern().

Yeah, test_pattern() is originally created for the integerset so it
doesn't necessarily fit the radixtree. I agree to use some tests from
benchmarks.

>
> v56-0007:
>
> + *
> + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
> + * the caller doesn't need to take a lock.
>
> Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

Agreed. Fixed in 0005 patch.

>
> v56-0008
>
> Thanks, I like how the tests look now.
>
> -NOTICE:  testing node   4 with height 0 and  ascending keys
> ...
> +NOTICE:  testing node   1 with height 0 and  ascending keys
>
> Now that the number is not intended to match a size class, "node X"
> seems out of place. Maybe we could have a separate array with strings?
>
> + 1, /* RT_CLASS_4 */
>
> This should be more than one, so that the basic test still exercises
> paths that shift elements around.
>
> + 100, /* RT_CLASS_48 */
>
> This node currently holds 64 for local memory.
>
> + 255 /* RT_CLASS_256 */
>
> This is the only one where we know exactly how many it can take, so
> may as well keep it at 256.

Fixed in 0008 patch.

>
> v56-0012:
>
> The test module for tidstore could use a few more comments.

Addressed in 0012 patch.

>
> v56-0015:
>
> +typedef dsa_pointer TidStoreHandle;
> +
>
> -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
> +TidStoreAttach(dsa_area *area, TidStoreHandle handle)
>  {
>   TidStore *ts;
> + dsa_pointer rt_dp = handle;
>
> My earlier opinion was that "handle" was a nicer variable name, but
> this brings back the typedef and also keeps the variable name I didn't
> like, but pushes it down into the function. I'm a bit confused, so
> I've kept these not-squashed for now.

I misunderstood your comment. I've changed to use a variable name
rt_handle and removed the TidStoreHandle type. 0013 patch.

>
> -----------------------------------------------------------------------------------
>
> Now, for v57:
>
> > Looking at overall changes, there are still XXX and TODO comments in
> > radixtree.h:
>
> That's fine, as long as it's intentional as a message to readers. That
> said, we can get rid of some:
>
> > ---
> >  * XXX There are 4 node kinds, and this should never be increased,
> >  * for several reasons:
> >  * 1. With 5 or more kinds, gcc tends to use a jump table for switch
> >  *    statements.
> >  * 2. The 4 kinds can be represented with 2 bits, so we have the option
> >  *    in the future to tag the node pointer with the kind, even on
> >  *    platforms with 32-bit pointers. This might speed up node traversal
> >  *    in trees with highly random node kinds.
> >  * 3. We can have multiple size classes per node kind.
> >
> > Can we just remove "XXX"?
>
> How about "NOTE"?

Agreed.

>
> > ---
> >  * WIP: notes about traditional radix tree trading off span vs height...
> >
> > Are you going to write it?
>
> Yes, when I draft a rough commit message, (for next time).

Thanks!

>
> > ---
> > #ifdef RT_SHMEM
> > /*  WIP: do we really need this? */
> > typedef dsa_pointer RT_HANDLE;
> > #endif
> >
> > I think it's worth having it.
>
> Okay, removed WIP in v57-0004.
>
> > ---
> >  * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
> >  * inside a single bitmapword on most platforms, so it's a good starting
> >  * point. We can make it higher if we need to.
> >  */
> > #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
> >
> > Are you going to work something on this?
>
> Hard-coded 64 for readability, and changed this paragraph to explain
> the current rationale more clearly:
>
> "The paper uses at most 64 for this node kind, and one advantage for us
> is that "isset" is a single bitmapword on most platforms, rather than
> an array, allowing the compiler to get rid of loops."

LGTM.

>
> > ---
> >     /* WIP: We could go first to the higher node16 size class */
> >     newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);
> >
> > Does it mean to go to RT_CLASS_16_HI and then further go to
> > RT_CLASS_16_LO upon further deletion?
>
> Yes. It wouldn't be much work to make shrinking symmetrical with
> growing (a good thing), but it's not essential so I haven't done it
> yet.

Okay, let's keep it as WIP.

>
> > ---
> >  * TODO: The current locking mechanism is not optimized for high concurrency
> >  * with mixed read-write workloads. In the future it might be worthwhile
> >  * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
> >  * the paper "The ART of Practical Synchronization" by the same authors as
> >  * the ART paper, 2016.
> >
> > I think it's not TODO for now, but a future improvement. We can remove it.
>
> It _is_ a TODO, regardless of when it happens.

Understood.

>
> > ---
> > /* TODO: consider 5 with subclass 1 or 2. */
> > #define RT_FANOUT_4     4
> >
> > Is there something we need to do here?
>
> Changed to:
>
> "To save memory in trees with sparse keys, it would make sense to have two
> size classes for the smallest kind (perhaps a high class of 5 and a low class
> of 2), but it would be more effective to utilize lazy expansion and
> path compression."

LGTM. But there is an extra '*' in the last line:

+ /*
:
+ * of 2), but it would be more effective to utilize lazy expansion and
+ * path compression.
+ * */

Fixed in 0004 patch.

>
> > ---
> > /*
> >  * Return index of the chunk and slot arrays for inserting into the node,
> >  * such that the chunk array remains ordered.
> >  * TODO: Improve performance for non-SIMD platforms.
> >  */
> >
> > Are you going to work on this?
>
> A small step in v57-0010. I've found a way to kill two birds with one
> stone, by first checking for the case that the keys are inserted in
> order. This also helps the SIMD case because it must branch anyway to
> avoid bitscanning a zero bitfield. This moves the branch up and turns
> a mask into an assert, looking a bit nicer. I've removed the TODO, but
> maybe we should add it to the search_eq function.

Great!

>
> > ---
> > /* Delete the element at 'idx' */
> > /*  TODO: replace slow memmove's */
> >
> > Are you going to work on this?
>
> Done in v57-0011.

LGTM.

>
> The rest:
> v57-0004 - 0008 should be self explanatory, but questions/pushback welcome.
> v57-0009 - I'm thinking leaves don't need to be memset at all. The
> value written should be entirely the caller's responsibility, it
> seems.
> v57-0013 - the bench module can be built locally again
> v57-0016 - minor comment edits in tid store

These fixes look good to me.

>
> My todo:
> - benchmark tid store / vacuum again, since we haven't since varlen
> types and removing unnecessary locks. I'm pretty sure there's an
> accidental memset call that crept in there, but I'm running out of
> steam today.
> - leftover comment etc work

Thanks. I'm also going to do some benchmarks and tests.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > My todo:
> > - benchmark tid store / vacuum again, since we haven't since varlen
> > types and removing unnecessary locks.

I ran a vacuum benchmark similar to the one in [1] (unlogged tables
for reproducibility), but smaller tables (100 million records),
deleting only the last 20% of the table, and including a parallel
vacuum test. Scripts attached.

monotonically ordered int column index:

master:
system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s

v-59:
system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s

uuid column index:

master:
system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s

v-59:
system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s

v59:
system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s

Over all, I'm pleased with these results, although I'm confused why
sometimes with the patch the first run reports running faster than the
others. I'm curious what others get. Traversing a tree that lives in
DSA has some overhead, as expected, but still comes out way ahead of
master.

There are still some micro-benchmarks we could do on tidstore, and
it'd be good to find out worse-case memory use (1 dead tuple each on
spread-out pages), but this is decent demonstration.

> > I'm not sure what the test_node_types_* functions are testing that
> > test_basic doesn't. They have a different, and confusing, way to stop
> > at every size class and check the keys/values. It seems we can replace
> > all that with two more calls (asc/desc) to test_basic, with the
> > maximum level.

v58-0008:

+ /* borrowed from RT_MAX_SHIFT */
+ const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
BITS_PER_BYTE) * BITS_PER_BYTE;

This is harder to read than "64 - 8", and doesn't really help
maintainability either.
Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.

+ /* leaf nodes */
+ test_basic(test_info, 0);

+ /* internal nodes */
+ test_basic(test_info, 8);
+
+ /* max-level nodes */
+ test_basic(test_info, max_shift);

This three-way terminology is not very informative. How about:

+       /* a tree with one level, i.e. a single node under the root node. */
 ...
+       /* a tree with two levels */
 ...
+       /* a tree with the maximum number of levels */

+static void
+test_basic(rt_node_class_test_elem *test_info, int shift)
+{
+ elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift);
+
+ /* Test nodes while changing the key insertion order */
+ do_test_basic(test_info->nkeys, shift, false);
+ do_test_basic(test_info->nkeys, shift, true);

Adding a level of indirection makes this harder to read, and do we
still know whether a test failed in asc or desc keys?

> > My earlier opinion was that "handle" was a nicer variable name, but
> > this brings back the typedef and also keeps the variable name I didn't
> > like, but pushes it down into the function. I'm a bit confused, so
> > I've kept these not-squashed for now.
>
> I misunderstood your comment. I've changed to use a variable name
> rt_handle and removed the TidStoreHandle type. 0013 patch.

(diff against an earlier version)
-       pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+       pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items);

Shall we use "handle" in vacuum_parallel.c as well?

> > I'm pretty sure there's an
> > accidental memset call that crept in there, but I'm running out of
> > steam today.

I have just a little bit of work to add for v59:

v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
any bitmapwords. That can only happen if e.g. there is an offset > 128
and there are none between 64 and 128, so not a huge deal but I think
it's a bit nicer in this patch.

> > >  * WIP: notes about traditional radix tree trading off span vs height...
> > >
> > > Are you going to write it?
> >
> > Yes, when I draft a rough commit message, (for next time).

I haven't gotten to the commit message, but:

v59-0004 - I did some rewriting of the top header comment to explain
ART concepts for new readers, made small comment changes, and tidied
up some indentation that pgindent won't touch
v59-0005 - re-pgindent'ed


[1] https://www.postgresql.org/message-id/CAFBsxsHUxmXYy0y4RrhMcNe-R11Bm099Xe-wUdb78pOu0%2BPT2Q%40mail.gmail.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > My todo:
> > > - benchmark tid store / vacuum again, since we haven't since varlen
> > > types and removing unnecessary locks.
>
> I ran a vacuum benchmark similar to the one in [1] (unlogged tables
> for reproducibility), but smaller tables (100 million records),
> deleting only the last 20% of the table, and including a parallel
> vacuum test. Scripts attached.
>
> monotonically ordered int column index:
>
> master:
> system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
> system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
> system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s
>
> v-59:
> system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
> system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
> system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s
>
> uuid column index:
>
> master:
> system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
> system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
> system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s
>
> v-59:
> system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
> system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
> system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s
>
> int & uuid indexes in parallel:
>
> master:
> system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
> system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
> system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s
>
> v59:
> system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
> system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
> system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s
>
> Over all, I'm pleased with these results, although I'm confused why
> sometimes with the patch the first run reports running faster than the
> others. I'm curious what others get. Traversing a tree that lives in
> DSA has some overhead, as expected, but still comes out way ahead of
> master.

Thanks! That's a great improvement.

I've also run the same scripts in my environment just in case and got
similar results:

monotonically ordered int column index:

master:
system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s
system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s

v-59:
system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s
system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s

uuid column index:

master:
system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s
system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s

v-59:
system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s
system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s
system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s
system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s

v-59:
system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s
system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s

>
> There are still some micro-benchmarks we could do on tidstore, and
> it'd be good to find out worse-case memory use (1 dead tuple each on
> spread-out pages), but this is decent demonstration.

I've tested a simple case where vacuum removes 33k dead tuples spread
about every 10 pages.

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

>
> > > I'm not sure what the test_node_types_* functions are testing that
> > > test_basic doesn't. They have a different, and confusing, way to stop
> > > at every size class and check the keys/values. It seems we can replace
> > > all that with two more calls (asc/desc) to test_basic, with the
> > > maximum level.
>
> v58-0008:
>
> + /* borrowed from RT_MAX_SHIFT */
> + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
> BITS_PER_BYTE) * BITS_PER_BYTE;
>
> This is harder to read than "64 - 8", and doesn't really help
> maintainability either.
> Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.
>
> + /* leaf nodes */
> + test_basic(test_info, 0);
>
> + /* internal nodes */
> + test_basic(test_info, 8);
> +
> + /* max-level nodes */
> + test_basic(test_info, max_shift);
>
> This three-way terminology is not very informative. How about:
>
> +       /* a tree with one level, i.e. a single node under the root node. */
>  ...
> +       /* a tree with two levels */
>  ...
> +       /* a tree with the maximum number of levels */

Agreed.

>
> +static void
> +test_basic(rt_node_class_test_elem *test_info, int shift)
> +{
> + elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift);
> +
> + /* Test nodes while changing the key insertion order */
> + do_test_basic(test_info->nkeys, shift, false);
> + do_test_basic(test_info->nkeys, shift, true);
>
> Adding a level of indirection makes this harder to read, and do we
> still know whether a test failed in asc or desc keys?

Agreed, it seems to be better to keep the previous logging style.

>
> > > My earlier opinion was that "handle" was a nicer variable name, but
> > > this brings back the typedef and also keeps the variable name I didn't
> > > like, but pushes it down into the function. I'm a bit confused, so
> > > I've kept these not-squashed for now.
> >
> > I misunderstood your comment. I've changed to use a variable name
> > rt_handle and removed the TidStoreHandle type. 0013 patch.
>
> (diff against an earlier version)
> -       pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
> +       pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items);
>
> Shall we use "handle" in vacuum_parallel.c as well?

Agreed.

>
> > > I'm pretty sure there's an
> > > accidental memset call that crept in there, but I'm running out of
> > > steam today.
>
> I have just a little bit of work to add for v59:
>
> v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> any bitmapwords. That can only happen if e.g. there is an offset > 128
> and there are none between 64 and 128, so not a huge deal but I think
> it's a bit nicer in this patch.

LGTM.

>
> > > >  * WIP: notes about traditional radix tree trading off span vs height...
> > > >
> > > > Are you going to write it?
> > >
> > > Yes, when I draft a rough commit message, (for next time).
>
> I haven't gotten to the commit message, but:

I've drafted the commit message.

>
> v59-0004 - I did some rewriting of the top header comment to explain
> ART concepts for new readers, made small comment changes, and tidied
> up some indentation that pgindent won't touch
> v59-0005 - re-pgindent'ed

LGTM, squashed all changes.

I've attached these updates from v59 in separate patches.

I've run regression tests with valgrind and run the coverity scan, and
I don't see critical issues.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

> I've also run the same scripts in my environment just in case and got
> similar results:

Thanks for testing, looks good as well.

> > There are still some micro-benchmarks we could do on tidstore, and
> > it'd be good to find out worse-case memory use (1 dead tuple each on
> > spread-out pages), but this is decent demonstration.
>
> I've tested a simple case where vacuum removes 33k dead tuples spread
> about every 10 pages.
>
> master:
> 198,000 bytes (=33000 * 6)
> system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
>
> v-59:
> 2,834,432 bytes (reported by TidStoreMemoryUsage())
> system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

The memory usage for the sparse case may be a concern, although it's
not bad -- a multiple of something small is probably not huge in
practice. See below for an option we have for this.

> > > > I'm pretty sure there's an
> > > > accidental memset call that crept in there, but I'm running out of
> > > > steam today.
> >
> > I have just a little bit of work to add for v59:
> >
> > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > and there are none between 64 and 128, so not a huge deal but I think
> > it's a bit nicer in this patch.
>
> LGTM.

Okay, I've squashed this.

> I've drafted the commit message.

Thanks, this is a good start.

> I've run regression tests with valgrind and run the coverity scan, and
> I don't see critical issues.

Great!

Now, I think we're in pretty good shape. There are a couple of things
that might be objectionable, so I want to try to improve them in the
little time we have:

1. Memory use for the sparse case. I shared an idea a few months ago
of how runtime-embeddable values (true combined pointer-value slots)
could work for tids. I don't think this is a must-have, but it's not a
lot of code, and I have this working:

v61-0006: Preparatory refactoring -- I think we should do this anyway,
since the intent seems more clear to me.
v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

With this resolved, I think the radix tree is pretty close to
committable. The tid store will likely need some polish yet, but no
major issues I know of.

(And, finally, a small thing I that I wanted to share just so I don't
forget, but maybe not worth the attention: In Andres's prototype,
there is a comment wondering if an update can skip checking if it
first need to create a root node. This is pretty easy, and done in
v61-0008.)

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
v61 had a brown-paper-bag bug in the embedded tids patch that didn't
present in the tidstore test, but caused vacuum to fail, fixed in v62.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Feb 15, 2024 at 8:26 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > I've also run the same scripts in my environment just in case and got
> > similar results:
>
> Thanks for testing, looks good as well.
>
> > > There are still some micro-benchmarks we could do on tidstore, and
> > > it'd be good to find out worse-case memory use (1 dead tuple each on
> > > spread-out pages), but this is decent demonstration.
> >
> > I've tested a simple case where vacuum removes 33k dead tuples spread
> > about every 10 pages.
> >
> > master:
> > 198,000 bytes (=33000 * 6)
> > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
> >
> > v-59:
> > 2,834,432 bytes (reported by TidStoreMemoryUsage())
> > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s
>
> The memory usage for the sparse case may be a concern, although it's
> not bad -- a multiple of something small is probably not huge in
> practice. See below for an option we have for this.
>
> > > > > I'm pretty sure there's an
> > > > > accidental memset call that crept in there, but I'm running out of
> > > > > steam today.
> > >
> > > I have just a little bit of work to add for v59:
> > >
> > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > > and there are none between 64 and 128, so not a huge deal but I think
> > > it's a bit nicer in this patch.
> >
> > LGTM.
>
> Okay, I've squashed this.
>
> > I've drafted the commit message.
>
> Thanks, this is a good start.
>
> > I've run regression tests with valgrind and run the coverity scan, and
> > I don't see critical issues.
>
> Great!
>
> Now, I think we're in pretty good shape. There are a couple of things
> that might be objectionable, so I want to try to improve them in the
> little time we have:
>
> 1. Memory use for the sparse case. I shared an idea a few months ago
> of how runtime-embeddable values (true combined pointer-value slots)
> could work for tids. I don't think this is a must-have, but it's not a
> lot of code, and I have this working:
>
> v61-0006: Preparatory refactoring -- I think we should do this anyway,
> since the intent seems more clear to me.

Looks good refactoring to me.

> v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> reduce memory regressions, so should be considered. Up to 3 tids can
> be stored in the last level child pointer. It's not polished, but I'll
> only proceed with that if we think we need this. "flags" iis called
> that because it could hold tidbitmap.c booleans (recheck, lossy) in
> the future, in addition to reserving space for the pointer tag. Note:
> I hacked the tests to only have 2 offsets per block to demo, but of
> course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

monotonically ordered int column index:

master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

uuid column index:

master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s

int & uuid indexes in parallel:

master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s
v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s

sparse case:

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

v-62:
729,088 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s

I'm happy to see a huge improvement. While it's really fascinating to
me, I'm concerned about the time left until the feature freeze. We
need to polish both tidstore and vacuum integration patches in 5
weeks. Personally I'd like to have it as a separate patch for now, and
focus on completing the main three patches since we might face some
issues after pushing these patches. I think with 0007 patch it's a big
win but it's still a win even without 0007 patch.

>
> 2. Management of memory contexts. It's pretty verbose and messy. I
> think the abstraction could be better:
> A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> can't destroy or reset it. That means we have to do a lot of manual
> work.
> B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> it seems the wrong responsibility. Not all uses will have a
> work_mem-type limit, I'm guessing. We only use it for limiting the max
> block size, and aset's default 8MB is already plenty small for
> vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> smaller, and there it makes sense to limit the max blocksize this way.
> C: The context for values has complex #ifdefs based on the value
> length/varlen, but it's both too much and not enough. If we get a bump
> context, how would we shoehorn that in for values for vacuum but not
> for tidbitmap?
>
> Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> TidStoreCreate(), and then to RT_CREATE. That context will contain the
> values (for local mem), and the node slabs will be children of the
> value context. That way, measuring memory usage and free-ing can just
> call with this parent context, and let recursion handle the rest.
> Perhaps the passed context can also hold the radix-tree struct, but
> I'm not sure since I haven't tried it. What do you think?

If I understand your idea correctly, RT_CREATE() creates the context
for values as a child of the passed context and the node slabs as
children of the value context. That way, measuring memory usage can
just call with the value context. It sounds like a good idea. But it
was not clear to me how to address point B and C.

Another variant of this idea would be that RT_CREATE() creates the
parent context of the value context to store radix-tree struct. That
is, the hierarchy would be like:

A MemoryContext (passed by vacuum through tidstore)
    - radix tree memory context (store radx-tree struct, control
struct, and iterator)
        - value context (aset, slab, or bump)
            - node slab contexts

Freeing can just call with the radix tree memory context. And perhaps
it works even if tidstore passes CurrentMemoryContex to RT_CREATE()?

>
> With this resolved, I think the radix tree is pretty close to
> committable. The tid store will likely need some polish yet, but no
> major issues I know of.

Agreed.

>
> (And, finally, a small thing I that I wanted to share just so I don't
> forget, but maybe not worth the attention: In Andres's prototype,
> there is a comment wondering if an update can skip checking if it
> first need to create a root node. This is pretty easy, and done in
> v61-0008.)

LGTM, thanks!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
>
> monotonically ordered int column index:
>
> master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
> v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
> v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

Hmm, that's strange -- this test is intended to delete all records
from the last 20% of the blocks, so I wouldn't expect any improvement
here, only in the sparse case. Maybe something is wrong. All the more
reason to put it off...

> I'm happy to see a huge improvement. While it's really fascinating to
> me, I'm concerned about the time left until the feature freeze. We
> need to polish both tidstore and vacuum integration patches in 5
> weeks. Personally I'd like to have it as a separate patch for now, and
> focus on completing the main three patches since we might face some
> issues after pushing these patches. I think with 0007 patch it's a big
> win but it's still a win even without 0007 patch.

Agreed to not consider it for initial commit. I'll hold on to it for
some future time.

> > 2. Management of memory contexts. It's pretty verbose and messy. I
> > think the abstraction could be better:
> > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> > can't destroy or reset it. That means we have to do a lot of manual
> > work.
> > B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> > it seems the wrong responsibility. Not all uses will have a
> > work_mem-type limit, I'm guessing. We only use it for limiting the max
> > block size, and aset's default 8MB is already plenty small for
> > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> > smaller, and there it makes sense to limit the max blocksize this way.
> > C: The context for values has complex #ifdefs based on the value
> > length/varlen, but it's both too much and not enough. If we get a bump
> > context, how would we shoehorn that in for values for vacuum but not
> > for tidbitmap?
> >
> > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> > TidStoreCreate(), and then to RT_CREATE. That context will contain the
> > values (for local mem), and the node slabs will be children of the
> > value context. That way, measuring memory usage and free-ing can just
> > call with this parent context, and let recursion handle the rest.
> > Perhaps the passed context can also hold the radix-tree struct, but
> > I'm not sure since I haven't tried it. What do you think?
>
> If I understand your idea correctly, RT_CREATE() creates the context
> for values as a child of the passed context and the node slabs as
> children of the value context. That way, measuring memory usage can
> just call with the value context. It sounds like a good idea. But it
> was not clear to me how to address point B and C.

For B & C, vacuum would create a context to pass to TidStoreCreate,
and it wouldn't need to bother changing max block size. RT_CREATE
would use that directly for leaves (if any), and would only create
child slab contexts under it. It would not need to know about
max_bytes. Modifyng your diagram a bit, something like:

- caller-supplied radix tree memory context (the 3 structs -- and
leaves, if any) (aset (or future bump?))
    - node slab contexts

This might only be workable with aset, if we need to individually free
the structs. (I haven't studied this, it was a recent idea)
It's simpler, because with small fixed length values, we don't need to
detect that and avoid creating a leaf context. All leaves would live
in the same context as the structs.

> Another variant of this idea would be that RT_CREATE() creates the
> parent context of the value context to store radix-tree struct. That
> is, the hierarchy would be like:
>
> A MemoryContext (passed by vacuum through tidstore)
>     - radix tree memory context (store radx-tree struct, control
> struct, and iterator)
>         - value context (aset, slab, or bump)
>             - node slab contexts

The template handling the value context here is complex, and is what I
meant by 'C' above. Most fixed length allocations in all of the
backend are aset, so it seems fine to use it always.

> Freeing can just call with the radix tree memory context. And perhaps
> it works even if tidstore passes CurrentMemoryContex to RT_CREATE()?

Seems like it would, but would keep some complexity, as I mentioned.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Feb 16, 2024 at 12:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > > reduce memory regressions, so should be considered. Up to 3 tids can
> > > be stored in the last level child pointer. It's not polished, but I'll
> > > only proceed with that if we think we need this. "flags" iis called
> > > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > > the future, in addition to reserving space for the pointer tag. Note:
> > > I hacked the tests to only have 2 offsets per block to demo, but of
> > > course both paths should be tested.
> >
> > Interesting. I've run the same benchmark tests we did[1][2] (the
> > median of 3 runs):
> >
> > monotonically ordered int column index:
> >
> > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
> > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
> > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s
>
> Hmm, that's strange -- this test is intended to delete all records
> from the last 20% of the blocks, so I wouldn't expect any improvement
> here, only in the sparse case. Maybe something is wrong. All the more
> reason to put it off...

Okay, let's dig it deeper later.

>
> > I'm happy to see a huge improvement. While it's really fascinating to
> > me, I'm concerned about the time left until the feature freeze. We
> > need to polish both tidstore and vacuum integration patches in 5
> > weeks. Personally I'd like to have it as a separate patch for now, and
> > focus on completing the main three patches since we might face some
> > issues after pushing these patches. I think with 0007 patch it's a big
> > win but it's still a win even without 0007 patch.
>
> Agreed to not consider it for initial commit. I'll hold on to it for
> some future time.
>
> > > 2. Management of memory contexts. It's pretty verbose and messy. I
> > > think the abstraction could be better:
> > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> > > can't destroy or reset it. That means we have to do a lot of manual
> > > work.
> > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> > > it seems the wrong responsibility. Not all uses will have a
> > > work_mem-type limit, I'm guessing. We only use it for limiting the max
> > > block size, and aset's default 8MB is already plenty small for
> > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> > > smaller, and there it makes sense to limit the max blocksize this way.
> > > C: The context for values has complex #ifdefs based on the value
> > > length/varlen, but it's both too much and not enough. If we get a bump
> > > context, how would we shoehorn that in for values for vacuum but not
> > > for tidbitmap?
> > >
> > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> > > TidStoreCreate(), and then to RT_CREATE. That context will contain the
> > > values (for local mem), and the node slabs will be children of the
> > > value context. That way, measuring memory usage and free-ing can just
> > > call with this parent context, and let recursion handle the rest.
> > > Perhaps the passed context can also hold the radix-tree struct, but
> > > I'm not sure since I haven't tried it. What do you think?
> >
> > If I understand your idea correctly, RT_CREATE() creates the context
> > for values as a child of the passed context and the node slabs as
> > children of the value context. That way, measuring memory usage can
> > just call with the value context. It sounds like a good idea. But it
> > was not clear to me how to address point B and C.
>
> For B & C, vacuum would create a context to pass to TidStoreCreate,
> and it wouldn't need to bother changing max block size. RT_CREATE
> would use that directly for leaves (if any), and would only create
> child slab contexts under it. It would not need to know about
> max_bytes. Modifyng your diagram a bit, something like:
>
> - caller-supplied radix tree memory context (the 3 structs -- and
> leaves, if any) (aset (or future bump?))
>     - node slab contexts
>
> This might only be workable with aset, if we need to individually free
> the structs. (I haven't studied this, it was a recent idea)
> It's simpler, because with small fixed length values, we don't need to
> detect that and avoid creating a leaf context. All leaves would live
> in the same context as the structs.

Thank you for the explanation.

I think that vacuum and tidbitmap (and future users) would end up
having the same max block size calculation. And it seems slightly odd
layering to me that max-block-size-specified context is created on
vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
tidstore layer, and the passed context is used for leaves (if
varlen-value is used) on radix tree layer. Another idea is to create a
max-block-size-specified context on the tidstore layer. That is,
vacuum and tidbitmap pass a work_mem and a flag indicating whether the
tidstore can use the bump context, and tidstore creates a (aset of
bump) memory context with the calculated max block size and passes it
to the radix tree.

As for using the bump memory context, I feel that we need to store
iterator struct in aset context at least as it can be individually
freed and re-created. Or it might not be necessary to allocate the
iterator struct in the same context as radix tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I think that vacuum and tidbitmap (and future users) would end up
> having the same max block size calculation. And it seems slightly odd
> layering to me that max-block-size-specified context is created on
> vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
> tidstore layer, and the passed context is used for leaves (if
> varlen-value is used) on radix tree layer.

That sounds slightly more complicated than I was thinking of, but we
could actually be talking about the same thing: I'm drawing a
distinction between "used = must be detected / #ifdef'd" and "used =
actually happens to call allocation". I meant that the passed context
would _always_ be used for leaves, regardless of varlen or not. So
with fixed-length values short enough to live in child pointer slots,
that context would still be used for iteration etc.

> Another idea is to create a
> max-block-size-specified context on the tidstore layer. That is,
> vacuum and tidbitmap pass a work_mem and a flag indicating whether the
> tidstore can use the bump context, and tidstore creates a (aset of
> bump) memory context with the calculated max block size and passes it
> to the radix tree.

That might be a better abstraction since both uses have some memory limit.

> As for using the bump memory context, I feel that we need to store
> iterator struct in aset context at least as it can be individually
> freed and re-created. Or it might not be necessary to allocate the
> iterator struct in the same context as radix tree.

Okay, that's one thing I was concerned about. Since we don't actually
have a bump context yet, it seems simple to assume aset for non-nodes,
and if we do get it, we can adjust slightly. Anyway, this seems like a
good thing to try to clean up, but it's also not a show-stopper.

On that note: I will be going on honeymoon shortly, and then to PGConf
India, so I will have sporadic connectivity for the next 10 days and
won't be doing any hacking during that time.

Andres, did you want to take a look at the radix tree patch 0003?
Aside from the above possible cleanup, most of it should be stable.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Feb 19, 2024 at 7:47 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I think that vacuum and tidbitmap (and future users) would end up
> > having the same max block size calculation. And it seems slightly odd
> > layering to me that max-block-size-specified context is created on
> > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
> > tidstore layer, and the passed context is used for leaves (if
> > varlen-value is used) on radix tree layer.
>
> That sounds slightly more complicated than I was thinking of, but we
> could actually be talking about the same thing: I'm drawing a
> distinction between "used = must be detected / #ifdef'd" and "used =
> actually happens to call allocation". I meant that the passed context
> would _always_ be used for leaves, regardless of varlen or not. So
> with fixed-length values short enough to live in child pointer slots,
> that context would still be used for iteration etc.
>
> > Another idea is to create a
> > max-block-size-specified context on the tidstore layer. That is,
> > vacuum and tidbitmap pass a work_mem and a flag indicating whether the
> > tidstore can use the bump context, and tidstore creates a (aset of
> > bump) memory context with the calculated max block size and passes it
> > to the radix tree.
>
> That might be a better abstraction since both uses have some memory limit.

I've drafted this idea, and fixed a bug in tidstore.c. Here is the
summary of updates from v62:

- removed v62-0007 patch as we discussed
- squashed v62-0006 and v62-0008 patches into 0003 patch
- v63-0008 patch fixes a bug in tidstore.
- v63-0009 patch is a draft idea of cleanup memory context handling.

>
> > As for using the bump memory context, I feel that we need to store
> > iterator struct in aset context at least as it can be individually
> > freed and re-created. Or it might not be necessary to allocate the
> > iterator struct in the same context as radix tree.
>
> Okay, that's one thing I was concerned about. Since we don't actually
> have a bump context yet, it seems simple to assume aset for non-nodes,
> and if we do get it, we can adjust slightly. Anyway, this seems like a
> good thing to try to clean up, but it's also not a show-stopper.
>
> On that note: I will be going on honeymoon shortly, and then to PGConf
> India, so I will have sporadic connectivity for the next 10 days and
> won't be doing any hacking during that time.

Thank you for letting us know. Enjoy yourself!

Regards

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> - v63-0008 patch fixes a bug in tidstore.

- page->nwords = wordnum + 1;
- Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
+ page->nwords = wordnum;
+ Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));

Yikes, I'm guessing this failed in a non-assert builds? I wonder why
my compiler didn't yell at me... Have you tried a tidstore-debug build
without asserts?

> - v63-0009 patch is a draft idea of cleanup memory context handling.

Thanks, looks pretty good!

+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+    "tidstore storage",

"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.

- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+    ? tree->leaf_ctx
+    : tree->context, allocsize);

Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:

+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    RT_STR(RT_PREFIX) "radix_tree leaf contex",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */

...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
  MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
  MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

Thirdly, cosmetic: With the introduction of single-value leaves, it
seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
I'm not convinced it's freeing all the memory. It's been many months
since we discussed this last, but IIRC we cannot just tell DSA to free
all its segments, right? Is there currently anything preventing us
from destroying the whole DSA area at once?

+ /* The last level node has pointers to values */
+ if (shift == 0)
+ {
+   dsa_free(tree->dsa, ptr);
+   return;
+ }

IIUC, this doesn't actually free leaves, it only frees the last-level
node. And, this function is unaware of whether children could be
embedded values. I'm thinking we need to get rid of the above
pre-check and instead, each node kind to have something like (e.g.
node4):

RT_PTR_ALLOC child = n4->children[i];

if (shift > 0)
  RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
else if (!RT_CHILDPTR_IS_VALUE(child))
  dsa_free(tree->dsa, child);

...or am I missing something?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I wrote:

> Secondly, I thought about my recent work to skip checking if we first
> need to create a root node, and that has a harmless (for vacuum at
> least) but slightly untidy behavior: When RT_SET is first called, and
> the key is bigger than 255, new nodes will go on top of the root node.
> These have chunk '0'. If all subsequent keys are big enough, the
> orginal root node will stay empty. If all keys are deleted, there will
> be a chain of empty nodes remaining. Again, I believe this is
> harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I just noticed there are a lot of unused function parameters
(referring to parent slots) leftover from a few weeks ago. Those are
removed in v64-0009. 0010 makes the obvious name change in those
remaining to "parent_slot". 0011 is a simplification in two places
regarding reserving slots. This should be a bit easier to read and
possibly makes it easier on the compiler.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > - v63-0008 patch fixes a bug in tidstore.
>
> - page->nwords = wordnum + 1;
> - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
> + page->nwords = wordnum;
> + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
>
> Yikes, I'm guessing this failed in a non-assert builds? I wonder why
> my compiler didn't yell at me... Have you tried a tidstore-debug build
> without asserts?

Yes. I didn't get any failures.

>
> > - v63-0009 patch is a draft idea of cleanup memory context handling.
>
> Thanks, looks pretty good!
>
> + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> +    "tidstore storage",
>
> "tidstore storage" sounds a bit strange -- maybe look at some other
> context names for ideas.

Agreed. How about "tidstore's radix tree"?

>
> - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> +    ? tree->leaf_ctx
> +    : tree->context, allocsize);
>
> Instead of branching here, can we copy "context" to "leaf_ctx" when
> necessary (those names should look more like eachother, btw)? I think
> that means anything not covered by this case:
>
> +#ifndef RT_VARLEN_VALUE_SIZE
> + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +    RT_STR(RT_PREFIX) "radix_tree leaf contex",
> +    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +    sizeof(RT_VALUE_TYPE));
> +#endif /* !RT_VARLEN_VALUE_SIZE */
>
> ...also, we should document why we're using slab here. On that, I
> don't recall why we are? We've never had a fixed-length type test case
> on 64-bit, so it wasn't because it won through benchmarking. It seems
> a hold-over from the days of "multi-value leaves". Is it to avoid the
> possibility of space wastage with non-power-of-two size types?

Yes, it matches my understanding.

>
> For this stanza that remains unchanged:
>
> for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> {
>   MemoryContextDelete(tree->node_slabs[i]);
> }
>
> if (tree->leaf_ctx)
> {
>   MemoryContextDelete(tree->leaf_ctx);
> }
>
> ...is there a reason we can't just delete tree->ctx, and let that
> recursively delete child contexts?

I thought that considering the RT_CREATE doesn't create its own memory
context but just uses the passed context, it might be a bit unusable
to delete the passed context in the radix tree code. For example, if a
caller creates a radix tree (or tidstore) on a memory context and
wants to recreate it again and again, he also needs to re-create the
memory context together. It might be okay if we leave comments on
RT_CREATE as a side effect, though. This is the same reason why we
don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

On Fri, Mar 1, 2024 at 1:15 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
> I'm not convinced it's freeing all the memory. It's been many months
> since we discussed this last, but IIRC we cannot just tell DSA to free
> all its segments, right?

Right.

>  Is there currently anything preventing us
> from destroying the whole DSA area at once?

When it comes to tidstore and parallel vacuum, we initialize DSA and
create a tidstore there at the beginning of the lazy vacuum, and
recreate the tidstore again after the heap vacuum. So I don't want to
destroy the whole DSA when destroying the tidstore. Otherwise, we will
need to create a new DSA and pass its handle somehow.

Probably the bitmap scan case is similar. Given that bitmap scan
(re)creates tidbitmap in the same DSA multiple times, it's better to
avoid freeing the whole DSA.

>
> + /* The last level node has pointers to values */
> + if (shift == 0)
> + {
> +   dsa_free(tree->dsa, ptr);
> +   return;
> + }
>
> IIUC, this doesn't actually free leaves, it only frees the last-level
> node. And, this function is unaware of whether children could be
> embedded values. I'm thinking we need to get rid of the above
> pre-check and instead, each node kind to have something like (e.g.
> node4):
>
> RT_PTR_ALLOC child = n4->children[i];
>
> if (shift > 0)
>   RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
> else if (!RT_CHILDPTR_IS_VALUE(child))
>   dsa_free(tree->dsa, child);
>
> ...or am I missing something?

You're not missing anything. RT_FREE_RECURSE() has not been updated
for a long time. If we still need to use RT_FREE_RECURSE(), it should
be updated.

> Thirdly, cosmetic: With the introduction of single-value leaves, it
> seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

Agreed.

On Fri, Mar 1, 2024 at 3:58 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I wrote:
>
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> > the key is bigger than 255, new nodes will go on top of the root node.
> > These have chunk '0'. If all subsequent keys are big enough, the
> > orginal root node will stay empty. If all keys are deleted, there will
> > be a chain of empty nodes remaining. Again, I believe this is
> > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > on this, but likely not today.
>
> This turns out to be a lot trickier than it looked, so it seems best
> to allow a trivial amount of waste, as long as it's documented
> somewhere. It also wouldn't be terrible to re-add those branches,
> since they're highly predictable.
>
> I just noticed there are a lot of unused function parameters
> (referring to parent slots) leftover from a few weeks ago. Those are
> removed in v64-0009. 0010 makes the obvious name change in those
> remaining to "parent_slot". 0011 is a simplification in two places
> regarding reserving slots. This should be a bit easier to read and
> possibly makes it easier on the compiler.

Thank you for the updates. I've briefly looked at these changes and
they look good to me. I'm going to review them again in depth.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> > +    "tidstore storage",
> >
> > "tidstore storage" sounds a bit strange -- maybe look at some other
> > context names for ideas.
>
> Agreed. How about "tidstore's radix tree"?

That might be okay. I'm now thinking "TID storage". On that note, one
improvement needed when we polish tidstore.c is to make sure it's
spelled "TID" in comments, like other files do already.

> > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> > +    ? tree->leaf_ctx
> > +    : tree->context, allocsize);
> >
> > Instead of branching here, can we copy "context" to "leaf_ctx" when
> > necessary (those names should look more like eachother, btw)? I think
> > that means anything not covered by this case:
> >
> > +#ifndef RT_VARLEN_VALUE_SIZE
> > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> > + tree->leaf_ctx = SlabContextCreate(ctx,
> > +    RT_STR(RT_PREFIX) "radix_tree leaf contex",
> > +    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> > +    sizeof(RT_VALUE_TYPE));
> > +#endif /* !RT_VARLEN_VALUE_SIZE */
> >
> > ...also, we should document why we're using slab here. On that, I
> > don't recall why we are? We've never had a fixed-length type test case
> > on 64-bit, so it wasn't because it won through benchmarking. It seems
> > a hold-over from the days of "multi-value leaves". Is it to avoid the
> > possibility of space wastage with non-power-of-two size types?
>
> Yes, it matches my understanding.

There are two issues quoted here, so not sure if you mean both or only
the last one...

For the latter, I'm not sure it makes sense to have code and #ifdef's
to force slab for large-enough fixed-length values just because we
can. There may never be such a use-case anyway. I'm also not against
it, either, but it seems like a premature optimization.

> > For this stanza that remains unchanged:
> >
> > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> > {
> >   MemoryContextDelete(tree->node_slabs[i]);
> > }
> >
> > if (tree->leaf_ctx)
> > {
> >   MemoryContextDelete(tree->leaf_ctx);
> > }
> >
> > ...is there a reason we can't just delete tree->ctx, and let that
> > recursively delete child contexts?
>
> I thought that considering the RT_CREATE doesn't create its own memory
> context but just uses the passed context, it might be a bit unusable
> to delete the passed context in the radix tree code. For example, if a
> caller creates a radix tree (or tidstore) on a memory context and
> wants to recreate it again and again, he also needs to re-create the
> memory context together. It might be okay if we leave comments on
> RT_CREATE as a side effect, though. This is the same reason why we
> don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

Right, I should have said "reset". Resetting a context will delete
it's children as well, and seems like it should work to reset the tree
context, and we don't have to know whether that context actually
contains leaves at all. That should allow copying "tree context" to
"leaf context" in the case where we have no special context for
leaves.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> > > +    "tidstore storage",
> > >
> > > "tidstore storage" sounds a bit strange -- maybe look at some other
> > > context names for ideas.
> >
> > Agreed. How about "tidstore's radix tree"?
>
> That might be okay. I'm now thinking "TID storage". On that note, one
> improvement needed when we polish tidstore.c is to make sure it's
> spelled "TID" in comments, like other files do already.
>
> > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> > > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> > > +    ? tree->leaf_ctx
> > > +    : tree->context, allocsize);
> > >
> > > Instead of branching here, can we copy "context" to "leaf_ctx" when
> > > necessary (those names should look more like eachother, btw)? I think
> > > that means anything not covered by this case:
> > >
> > > +#ifndef RT_VARLEN_VALUE_SIZE
> > > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> > > + tree->leaf_ctx = SlabContextCreate(ctx,
> > > +    RT_STR(RT_PREFIX) "radix_tree leaf contex",
> > > +    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> > > +    sizeof(RT_VALUE_TYPE));
> > > +#endif /* !RT_VARLEN_VALUE_SIZE */
> > >
> > > ...also, we should document why we're using slab here. On that, I
> > > don't recall why we are? We've never had a fixed-length type test case
> > > on 64-bit, so it wasn't because it won through benchmarking. It seems
> > > a hold-over from the days of "multi-value leaves". Is it to avoid the
> > > possibility of space wastage with non-power-of-two size types?
> >
> > Yes, it matches my understanding.
>
> There are two issues quoted here, so not sure if you mean both or only
> the last one...

I meant only the last one.

>
> For the latter, I'm not sure it makes sense to have code and #ifdef's
> to force slab for large-enough fixed-length values just because we
> can. There may never be such a use-case anyway. I'm also not against
> it, either, but it seems like a premature optimization.

Reading the old threads, the fact that using a slab context for leaves
originally came from Andres's prototype patch, was to avoid rounding
up the bytes to a power of 2 number by aset.c. It makes sense to me to
use a slab context for this case. To measure the effect of using a
slab, I've updated bench_radix_tree so it uses a large fixed-length
value. The struct I used is:

typedef struct mytype
{
   uint64  a;
   uint64  b;
   uint64  c;
   uint64  d;
   char    e[100];
} mytype;

The struct size is 136 bytes with padding, just above a power-of-2.
The simple benchmark test showed using a slab context for leaves is
more space efficient. The results are:

slab:
= #select * from bench_load_random_int(1000000);
 mem_allocated | load_ms
---------------+---------
     405643264 |     560
(1 row)

aset:
=# select * from bench_load_random_int(1000000);
 mem_allocated | load_ms
---------------+---------
     527777792 |     576
(1 row)

>
> > > For this stanza that remains unchanged:
> > >
> > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> > > {
> > >   MemoryContextDelete(tree->node_slabs[i]);
> > > }
> > >
> > > if (tree->leaf_ctx)
> > > {
> > >   MemoryContextDelete(tree->leaf_ctx);
> > > }
> > >
> > > ...is there a reason we can't just delete tree->ctx, and let that
> > > recursively delete child contexts?
> >
> > I thought that considering the RT_CREATE doesn't create its own memory
> > context but just uses the passed context, it might be a bit unusable
> > to delete the passed context in the radix tree code. For example, if a
> > caller creates a radix tree (or tidstore) on a memory context and
> > wants to recreate it again and again, he also needs to re-create the
> > memory context together. It might be okay if we leave comments on
> > RT_CREATE as a side effect, though. This is the same reason why we
> > don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),
>
> Right, I should have said "reset". Resetting a context will delete
> it's children as well, and seems like it should work to reset the tree
> context, and we don't have to know whether that context actually
> contains leaves at all. That should allow copying "tree context" to
> "leaf context" in the case where we have no special context for
> leaves.

Resetting the tree->context seems to work. But I think we should note
for callers that the dsa_area passed to RT_CREATE should be created in
a different context than the context passed to RT_CREATE because
otherwise RT_FREE() will also free the dsa_area. For example, the
following code in test_radixtree.c will no longer work:

dsa = dsa_create(tranche_id);
radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
:
rt_free(radixtree);
dsa_detach(dsa); // dsa is already freed.

So I think that a practical usage of the radix tree will be that the
caller  creates a memory context for a radix tree and passes it to
RT_CREATE().

I've attached an update patch set:

- 0008 updates RT_FREE_RECURSE().
- 0009 patch is an updated version of cleanup radix tree memory handling.
- 0010 updates comments in tidstore.c such as replacing "Tid" with "TID".
- 0011 rename TidStore to TIDSTORE all places.
- 0012 update bench_radix_tree so it uses a (possibly large) struct
instead of uint64.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > Right, I should have said "reset". Resetting a context will delete
> > it's children as well, and seems like it should work to reset the tree
> > context, and we don't have to know whether that context actually
> > contains leaves at all. That should allow copying "tree context" to
> > "leaf context" in the case where we have no special context for
> > leaves.
>
> Resetting the tree->context seems to work. But I think we should note
> for callers that the dsa_area passed to RT_CREATE should be created in
> a different context than the context passed to RT_CREATE because
> otherwise RT_FREE() will also free the dsa_area. For example, the
> following code in test_radixtree.c will no longer work:
>
> dsa = dsa_create(tranche_id);
> radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> :
> rt_free(radixtree);
> dsa_detach(dsa); // dsa is already freed.
>
> So I think that a practical usage of the radix tree will be that the
> caller  creates a memory context for a radix tree and passes it to
> RT_CREATE().

That sounds workable to me.

> I've attached an update patch set:
>
> - 0008 updates RT_FREE_RECURSE().

Thanks!

> - 0009 patch is an updated version of cleanup radix tree memory handling.

Looks pretty good, as does the rest. I'm going through again,
squashing and making tiny adjustments to the template. The only thing
not done is changing the test with many values to resemble the perf
test more.

I wrote:
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> > the key is bigger than 255, new nodes will go on top of the root node.
> > These have chunk '0'. If all subsequent keys are big enough, the
> > orginal root node will stay empty. If all keys are deleted, there will
> > be a chain of empty nodes remaining. Again, I believe this is
> > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > on this, but likely not today.
>
> This turns out to be a lot trickier than it looked, so it seems best
> to allow a trivial amount of waste, as long as it's documented
> somewhere. It also wouldn't be terrible to re-add those branches,
> since they're highly predictable.

I put a little more work into this, and got it working, just needs a
small amount of finicky coding. I'll share tomorrow.

I have a question about RT_FREE_RECURSE:

+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();

I'm not sure why these are here: The first seems overly paranoid,
although harmless, but the second is probably a bad idea. Why should
the user be able to to interrupt the freeing of memory?

Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
tree, leading to coding like "iter->tree->ctl->root". I *think* it
would be easier to read if the tree was a parameter to these iteration
functions. That would require an API change, so the tests/tidstore
would have some churn. I can do that, but before trying I wanted to
see what you think -- is there some reason to keep the current way?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 4, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > Right, I should have said "reset". Resetting a context will delete
> > > it's children as well, and seems like it should work to reset the tree
> > > context, and we don't have to know whether that context actually
> > > contains leaves at all. That should allow copying "tree context" to
> > > "leaf context" in the case where we have no special context for
> > > leaves.
> >
> > Resetting the tree->context seems to work. But I think we should note
> > for callers that the dsa_area passed to RT_CREATE should be created in
> > a different context than the context passed to RT_CREATE because
> > otherwise RT_FREE() will also free the dsa_area. For example, the
> > following code in test_radixtree.c will no longer work:
> >
> > dsa = dsa_create(tranche_id);
> > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> > :
> > rt_free(radixtree);
> > dsa_detach(dsa); // dsa is already freed.
> >
> > So I think that a practical usage of the radix tree will be that the
> > caller  creates a memory context for a radix tree and passes it to
> > RT_CREATE().
>
> That sounds workable to me.
>
> > I've attached an update patch set:
> >
> > - 0008 updates RT_FREE_RECURSE().
>
> Thanks!
>
> > - 0009 patch is an updated version of cleanup radix tree memory handling.
>
> Looks pretty good, as does the rest. I'm going through again,
> squashing and making tiny adjustments to the template. The only thing
> not done is changing the test with many values to resemble the perf
> test more.
>
> I wrote:
> > > Secondly, I thought about my recent work to skip checking if we first
> > > need to create a root node, and that has a harmless (for vacuum at
> > > least) but slightly untidy behavior: When RT_SET is first called, and
> > > the key is bigger than 255, new nodes will go on top of the root node.
> > > These have chunk '0'. If all subsequent keys are big enough, the
> > > orginal root node will stay empty. If all keys are deleted, there will
> > > be a chain of empty nodes remaining. Again, I believe this is
> > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > > on this, but likely not today.
> >
> > This turns out to be a lot trickier than it looked, so it seems best
> > to allow a trivial amount of waste, as long as it's documented
> > somewhere. It also wouldn't be terrible to re-add those branches,
> > since they're highly predictable.
>
> I put a little more work into this, and got it working, just needs a
> small amount of finicky coding. I'll share tomorrow.
>
> I have a question about RT_FREE_RECURSE:
>
> + check_stack_depth();
> + CHECK_FOR_INTERRUPTS();
>
> I'm not sure why these are here: The first seems overly paranoid,
> although harmless, but the second is probably a bad idea. Why should
> the user be able to to interrupt the freeing of memory?

Good catch. We should not check the interruption there.

> Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
> tree, leading to coding like "iter->tree->ctl->root". I *think* it
> would be easier to read if the tree was a parameter to these iteration
> functions. That would require an API change, so the tests/tidstore
> would have some churn. I can do that, but before trying I wanted to
> see what you think -- is there some reason to keep the current way?

I considered both usages, there are two reasons for the current style.
I'm concerned that if we pass both the tree and RT_ITER to iteration
functions, the caller could mistakenly pass a different tree than the
one that was specified to create the RT_ITER. And the second reason is
just to make it consistent with other data structures such as
dynahash.c and dshash.c, but I now realized that in simplehash.h we
pass both the hash table and the iterator.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Mar 5, 2024 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Mar 4, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > > Resetting the tree->context seems to work. But I think we should note
> > > for callers that the dsa_area passed to RT_CREATE should be created in
> > > a different context than the context passed to RT_CREATE because
> > > otherwise RT_FREE() will also free the dsa_area. For example, the
> > > following code in test_radixtree.c will no longer work:

I've added a comment in v66-0004, which contains a number of other
small corrections and edits.

On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Thirdly, cosmetic: With the introduction of single-value leaves, it
> > seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?
>
> Agreed.

Done in v66-0005.

v66-0006 removes outdated tests for invalid root that somehow got left over.

> > I wrote:
> > > > Secondly, I thought about my recent work to skip checking if we first
> > > > need to create a root node, and that has a harmless (for vacuum at
> > > > least) but slightly untidy behavior: When RT_SET is first called, and
> > > > the key is bigger than 255, new nodes will go on top of the root node.
> > > > These have chunk '0'. If all subsequent keys are big enough, the
> > > > orginal root node will stay empty. If all keys are deleted, there will
> > > > be a chain of empty nodes remaining. Again, I believe this is
> > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > > > on this, but likely not today.

> > I put a little more work into this, and got it working, just needs a
> > small amount of finicky coding. I'll share tomorrow.

Done in v66-0007. I'm a bit disappointed in the extra messiness this
adds, although it's not a lot.

> > + check_stack_depth();
> > + CHECK_FOR_INTERRUPTS();
> >
> > I'm not sure why these are here: The first seems overly paranoid,
> > although harmless, but the second is probably a bad idea. Why should
> > the user be able to to interrupt the freeing of memory?
>
> Good catch. We should not check the interruption there.

Removed in v66-0008.

> > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
> > tree, leading to coding like "iter->tree->ctl->root". I *think* it
> > would be easier to read if the tree was a parameter to these iteration
> > functions. That would require an API change, so the tests/tidstore
> > would have some churn. I can do that, but before trying I wanted to
> > see what you think -- is there some reason to keep the current way?
>
> I considered both usages, there are two reasons for the current style.
> I'm concerned that if we pass both the tree and RT_ITER to iteration
> functions, the caller could mistakenly pass a different tree than the
> one that was specified to create the RT_ITER. And the second reason is
> just to make it consistent with other data structures such as
> dynahash.c and dshash.c, but I now realized that in simplehash.h we
> pass both the hash table and the iterator.

Okay, then I don't think it's worth messing with at this point.

On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > It's pretty hard to see what test_pattern() is doing, or why it's
> > useful. I wonder if instead the test could use something like the
> > benchmark where random integers are masked off. That seems simpler. I
> > can work on that, but I'd like to hear your side about test_pattern().
>
> Yeah, test_pattern() is originally created for the integerset so it
> doesn't necessarily fit the radixtree. I agree to use some tests from
> benchmarks.

Done in v66-0009. I'd be curious to hear any feedback. I like the
aspect that the random numbers come from a different seed every time
the test runs.

v66-0010/0011 run pgindent, the latter with one typedef added for the
test module. 0012 - 0017 are copied from v65, and I haven't done any
work on tidstore or vacuum, except for squashing most v65 follow-up
patches.

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > It's pretty hard to see what test_pattern() is doing, or why it's
> > > useful. I wonder if instead the test could use something like the
> > > benchmark where random integers are masked off. That seems simpler. I
> > > can work on that, but I'd like to hear your side about test_pattern().
> >
> > Yeah, test_pattern() is originally created for the integerset so it
> > doesn't necessarily fit the radixtree. I agree to use some tests from
> > benchmarks.
>
> Done in v66-0009. I'd be curious to hear any feedback. I like the
> aspect that the random numbers come from a different seed every time
> the test runs.

The new tests look good. Here are some comments:

---
+               expected = keys[i];
+               iterval = rt_iterate_next(iter, &iterkey);

-               ndeleted++;
+               EXPECT_TRUE(iterval != NULL);
+               EXPECT_EQ_U64(iterkey, expected);
+               EXPECT_EQ_U64(*iterval, expected);

Can we verify that the iteration returns keys in ascending order?

---
+     /* reset random number generator for deletion */
+     pg_prng_seed(&state, seed);

Why is resetting the seed required here?

---
The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

---
    radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
                                          "test_radix_tree",
                                          ALLOCSET_DEFAULT_SIZES);

We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
think it's better to use either one for consistency.

> I'd like to push 0001 and 0002 shortly, and then do another sweep over
> 0003, with remaining feedback, and get that in so we get some
> buildfarm testing before the remaining polishing work on
> tidstore/vacuum.

Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
polish tidstore and vacuum patches and update commit messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > Done in v66-0009. I'd be curious to hear any feedback. I like the
> > aspect that the random numbers come from a different seed every time
> > the test runs.
>
> The new tests look good. Here are some comments:
>
> ---
> +               expected = keys[i];
> +               iterval = rt_iterate_next(iter, &iterkey);
>
> -               ndeleted++;
> +               EXPECT_TRUE(iterval != NULL);
> +               EXPECT_EQ_U64(iterkey, expected);
> +               EXPECT_EQ_U64(*iterval, expected);
>
> Can we verify that the iteration returns keys in ascending order?

We get the "expected" value from the keys we saved in the now-sorted
array, so we do already. Unless I misunderstand you.

> ---
> +     /* reset random number generator for deletion */
> +     pg_prng_seed(&state, seed);
>
> Why is resetting the seed required here?

Good catch - My intention was to delete in the same random order we
inserted with. We still have the keys in the array, but they're sorted
by now. I forgot to go the extra step and use the prng when generating
the keys for deletion -- will fix.

> ---
> The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

Will fix.

> ---
>     radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
>                                           "test_radix_tree",
>                                           ALLOCSET_DEFAULT_SIZES);
>
> We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
> think it's better to use either one for consistency.

Will change to "small", since 32-bit platforms will use slab for leaves.

I'll look at the memory usage and estimate what 32-bit platforms will
use, and maybe adjust the number of keys. A few megabytes is fine, but
not many megabytes.

> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> polish tidstore and vacuum patches and update commit messages.

Sounds good.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 6, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > Done in v66-0009. I'd be curious to hear any feedback. I like the
> > > aspect that the random numbers come from a different seed every time
> > > the test runs.
> >
> > The new tests look good. Here are some comments:
> >
> > ---
> > +               expected = keys[i];
> > +               iterval = rt_iterate_next(iter, &iterkey);
> >
> > -               ndeleted++;
> > +               EXPECT_TRUE(iterval != NULL);
> > +               EXPECT_EQ_U64(iterkey, expected);
> > +               EXPECT_EQ_U64(*iterval, expected);
> >
> > Can we verify that the iteration returns keys in ascending order?
>
> We get the "expected" value from the keys we saved in the now-sorted
> array, so we do already. Unless I misunderstand you.

Ah, you're right. Please ignore this comment.

>
> > ---
> >     radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
> >                                           "test_radix_tree",
> >                                           ALLOCSET_DEFAULT_SIZES);
> >
> > We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
> > think it's better to use either one for consistency.
>
> Will change to "small", since 32-bit platforms will use slab for leaves.

Agreed.

>
> I'll look at the memory usage and estimate what 32-bit platforms will
> use, and maybe adjust the number of keys. A few megabytes is fine, but
> not many megabytes.

Thanks, sounds good.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2024-03-05 16:41:30 +0700, John Naylor wrote:
> I'd like to push 0001 and 0002 shortly, and then do another sweep over
> 0003, with remaining feedback, and get that in so we get some
> buildfarm testing before the remaining polishing work on
> tidstore/vacuum.

A few ARM buildfarm animals are complaining:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18

Greetings,

Andres Freund



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2024-03-05 16:41:30 +0700, John Naylor wrote:
> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> A few ARM buildfarm animals are complaining:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18
>

The error message we got is:

../../src/include/port/simd.h:326:71: error: incompatible type for
argument 1 of \342\200\230vshrq_n_s8\342\200\231
  uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
                                                                       ^

Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

Regard,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:

> > A few ARM buildfarm animals are complaining:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18
> >
>
> The error message we got is:
>
> ../../src/include/port/simd.h:326:71: error: incompatible type for
> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
>                                                                        ^
>
> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

That sounds plausible, and I'll look further.

(Hmm, I thought we had run this code on Arm already...)



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On March 6, 2024 9:06:50 AM GMT+01:00, John Naylor <johncnaylorls@gmail.com> wrote:
>On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:
>
>> > A few ARM buildfarm animals are complaining:
>> >
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18
>> >
>>
>> The error message we got is:
>>
>> ../../src/include/port/simd.h:326:71: error: incompatible type for
>> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
>>                                                                        ^
>>
>> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.
>
>That sounds plausible, and I'll look further.
>
>(Hmm, I thought we had run this code on Arm already...)

Perhaps we should switch one of the CI jobs to ARM...

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 6, 2024 at 3:06 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> (Hmm, I thought we had run this code on Arm already...)

CI MacOS uses Clang on aarch64, which has been working fine. The
failing animals are on gcc 7.3...



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> ../../src/include/port/simd.h:326:71: error: incompatible type for
> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
>                                                                        ^
>
> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

I've looked around and it seems clang is more lax on conversions.
Since it works fine for clang, I think we just need a cast here for
gcc. I've attached a blind attempt at a fix -- I'll apply shortly
unless someone happens to test and find it doesn't work.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 6, 2024 at 5:33 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > ../../src/include/port/simd.h:326:71: error: incompatible type for
> > argument 1 of \342\200\230vshrq_n_s8\342\200\231
> >   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
> >                                                                        ^
> >
> > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.
>
> I've looked around and it seems clang is more lax on conversions.
> Since it works fine for clang, I think we just need a cast here for
> gcc. I've attached a blind attempt at a fix -- I'll apply shortly
> unless someone happens to test and find it doesn't work.

I've reproduced the same error on my raspberry pi, and confirmed the
patch fixes the error.

My previous idea was wrong. With my proposal, the regression test for
radix tree failed on my raspberry pi. On the other hand, with your
patch the tests passed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 6, 2024 at 3:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 6, 2024 at 5:33 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > I've looked around and it seems clang is more lax on conversions.
> > Since it works fine for clang, I think we just need a cast here for
> > gcc. I've attached a blind attempt at a fix -- I'll apply shortly
> > unless someone happens to test and find it doesn't work.
>
> I've reproduced the same error on my raspberry pi, and confirmed the
> patch fixes the error.
>
> My previous idea was wrong. With my proposal, the regression test for
> radix tree failed on my raspberry pi. On the other hand, with your
> patch the tests passed.

Pushed, and at least parula's green now, thanks for testing! And
thanks, Andres, for the ping!



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> polish tidstore and vacuum patches and update commit messages.

I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm
attaching v67 which fixes that and has some small cosmetic adjustments
to the template. One functional change for debugging build is that
RT_STATS now prints out the number of leaves. I'll squash and push
0001 tomorrow morning unless there are further comments.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
Actually, I forgot -- I had one more question: Masahiko, is there a
reason for this extra local variable, which uses the base type, rather
than the typedef'd parameter?

+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shared memory */
+ control = handle;



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 6, 2024 at 8:25 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> Actually, I forgot -- I had one more question: Masahiko, is there a
> reason for this extra local variable, which uses the base type, rather
> than the typedef'd parameter?
>
> +RT_SCOPE RT_RADIX_TREE *
> +RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
> +{
> + RT_RADIX_TREE *tree;
> + dsa_pointer control;
> +
> + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
> +
> + /* Find the control object in shared memory */
> + control = handle;

I think it's mostly because of readability; it makes clear that the
handle should be castable to dsa_pointer and it's a control object. I
borrowed it from dshash_attach().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 6, 2024 at 8:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > > 0003, with remaining feedback, and get that in so we get some
> > > buildfarm testing before the remaining polishing work on
> > > tidstore/vacuum.
> >
> > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> > polish tidstore and vacuum patches and update commit messages.
>
> I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm
> attaching v67 which fixes that and has some small cosmetic adjustments
> to the template.

Thank you for updating the patch.

> One functional change for debugging build is that
> RT_STATS now prints out the number of leaves. I'll squash and push
> 0001 tomorrow morning unless there are further comments.

The 0001 patch looks good to me. I have some minor comments:

--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+   $(WIN32RES) \
+   test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+

"src/backend/lib/radixtree.c" should be updated to
"src/include/lib/radixtree.h".

---
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.

This file is not updated for test_radixtree. I think we can remove it
as the test cases in test_radixtree are clear.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > + /* Find the control object in shared memory */
> > + control = handle;
>
> I think it's mostly because of readability; it makes clear that the
> handle should be castable to dsa_pointer and it's a control object. I
> borrowed it from dshash_attach().

I find that a bit strange, but I went ahead and kept it.



On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> The 0001 patch looks good to me. I have some minor comments:

> +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> +
>
> "src/backend/lib/radixtree.c" should be updated to
> "src/include/lib/radixtree.h".

Done.

> --- /dev/null
> +++ b/src/test/modules/test_radixtree/README
> @@ -0,0 +1,7 @@
> +test_integerset contains unit tests for testing the integer set implementation
> +in src/backend/lib/integerset.c.
> +
> +The tests verify the correctness of the implementation, but they can also be
> +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> +test_integerset.c, the tests will print extra information about execution time
> +and memory usage.
>
> This file is not updated for test_radixtree. I think we can remove it
> as the test cases in test_radixtree are clear.

Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > + /* Find the control object in shared memory */
> > > + control = handle;
> >
> > I think it's mostly because of readability; it makes clear that the
> > handle should be castable to dsa_pointer and it's a control object. I
> > borrowed it from dshash_attach().
>
> I find that a bit strange, but I went ahead and kept it.
>
>
>
> On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > The 0001 patch looks good to me. I have some minor comments:
>
> > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > +
> >
> > "src/backend/lib/radixtree.c" should be updated to
> > "src/include/lib/radixtree.h".
>
> Done.
>
> > --- /dev/null
> > +++ b/src/test/modules/test_radixtree/README
> > @@ -0,0 +1,7 @@
> > +test_integerset contains unit tests for testing the integer set implementation
> > +in src/backend/lib/integerset.c.
> > +
> > +The tests verify the correctness of the implementation, but they can also be
> > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > +test_integerset.c, the tests will print extra information about execution time
> > +and memory usage.
> >
> > This file is not updated for test_radixtree. I think we can remove it
> > as the test cases in test_radixtree are clear.
>
> Done. I pushed this with a few last-minute cosmetic adjustments. This
> has been a very long time coming, but we're finally in the home
> stretch!
>
> Already, I see sifaka doesn't like this, and I'm looking now...

It's complaining that these forward declarations...

/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > + /* Find the control object in shared memory */
> > > > + control = handle;
> > >
> > > I think it's mostly because of readability; it makes clear that the
> > > handle should be castable to dsa_pointer and it's a control object. I
> > > borrowed it from dshash_attach().
> >
> > I find that a bit strange, but I went ahead and kept it.
> >
> >
> >
> > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > The 0001 patch looks good to me. I have some minor comments:
> >
> > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > > +
> > >
> > > "src/backend/lib/radixtree.c" should be updated to
> > > "src/include/lib/radixtree.h".
> >
> > Done.
> >
> > > --- /dev/null
> > > +++ b/src/test/modules/test_radixtree/README
> > > @@ -0,0 +1,7 @@
> > > +test_integerset contains unit tests for testing the integer set implementation
> > > +in src/backend/lib/integerset.c.
> > > +
> > > +The tests verify the correctness of the implementation, but they can also be
> > > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > > +test_integerset.c, the tests will print extra information about execution time
> > > +and memory usage.
> > >
> > > This file is not updated for test_radixtree. I think we can remove it
> > > as the test cases in test_radixtree are clear.
> >
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!
> >
> > Already, I see sifaka doesn't like this, and I'm looking now...
>
> It's complaining that these forward declarations...
>
> /* generate forward declarations necessary to use the radix tree */
> #ifdef RT_DECLARE
>
> typedef struct RT_RADIX_TREE RT_RADIX_TREE;
> typedef struct RT_ITER RT_ITER;
>
> ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> feature [-Werror,-Wtypedef-redefinition]"
>
> I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
  [1] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
  [2] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
These globals were registered at these points:
  [1]:
    #0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
    #1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
    #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
    #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1

  [2]:
    #0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
    #1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
    #2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3
    #3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > > + /* Find the control object in shared memory */
> > > > > + control = handle;
> > > >
> > > > I think it's mostly because of readability; it makes clear that the
> > > > handle should be castable to dsa_pointer and it's a control object. I
> > > > borrowed it from dshash_attach().
> > >
> > > I find that a bit strange, but I went ahead and kept it.
> > >
> > >
> > >
> > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > The 0001 patch looks good to me. I have some minor comments:
> > >
> > > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > > > +
> > > >
> > > > "src/backend/lib/radixtree.c" should be updated to
> > > > "src/include/lib/radixtree.h".
> > >
> > > Done.
> > >
> > > > --- /dev/null
> > > > +++ b/src/test/modules/test_radixtree/README
> > > > @@ -0,0 +1,7 @@
> > > > +test_integerset contains unit tests for testing the integer set implementation
> > > > +in src/backend/lib/integerset.c.
> > > > +
> > > > +The tests verify the correctness of the implementation, but they can also be
> > > > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > > > +test_integerset.c, the tests will print extra information about execution time
> > > > +and memory usage.
> > > >
> > > > This file is not updated for test_radixtree. I think we can remove it
> > > > as the test cases in test_radixtree are clear.
> > >
> > > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > > has been a very long time coming, but we're finally in the home
> > > stretch!
> > >
> > > Already, I see sifaka doesn't like this, and I'm looking now...
> >
> > It's complaining that these forward declarations...
> >
> > /* generate forward declarations necessary to use the radix tree */
> > #ifdef RT_DECLARE
> >
> > typedef struct RT_RADIX_TREE RT_RADIX_TREE;
> > typedef struct RT_ITER RT_ITER;
> >
> > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > feature [-Werror,-Wtypedef-redefinition]"
> >
> > I'll look in the other templates to see if what they do.
>
> Their "declare" sections have full typedefs. I found it works to leave
> out the typedef for the "define" section, but I first want to
> reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -676,7 +676,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 }          RT_RADIX_TREE_CONTROL;

 /* Entry point for allocating and accessing the tree */
-typedef struct RT_RADIX_TREE
+struct RT_RADIX_TREE
 {
    MemoryContext context;

@@ -691,7 +691,7 @@ typedef struct RT_RADIX_TREE
    /* leaf_context is used only for single-value leaves */
    MemoryContextData *leaf_context;
 #endif
-}          RT_RADIX_TREE;
+};

 /*
  * Iteration support.
@@ -714,7 +714,7 @@ typedef struct RT_NODE_ITER
 }          RT_NODE_ITER;

 /* state for iterating over the whole radix tree */
-typedef struct RT_ITER
+struct RT_ITER
 {
    RT_RADIX_TREE *tree;

@@ -728,7 +728,7 @@ typedef struct RT_ITER

    /* The key constructed during iteration */
    uint64      key;
-}          RT_ITER;
+};


 /* verification (available only in assert-enabled builds) */

>
> In addition, olingo and grassquit are showing different kinds of
> "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> make of -- example:
>
> ==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
>   [1] size=256 'pg_leftmost_one_pos'
> /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
>   [2] size=256 'pg_leftmost_one_pos'
> /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
> These globals were registered at these points:
>   [1]:
>     #0 0x563564b97bf6 in __asan_register_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
>     #1 0x563564b98d1d in __asan_register_elf_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
>     #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
>     #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1
>
>   [2]:
>     #0 0x563564b97bf6 in __asan_register_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
>     #1 0x563564b98d1d in __asan_register_elf_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
>     #2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3
>     #3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5

I'll look at them too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 7, 2024 at 3:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> >
> > In addition, olingo and grassquit are showing different kinds of
> > "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> > make of -- example:

odr-violation seems to refer to One Definition Rule (ODR). According
to Wikipedia[1]:

The One Definition Rule (ODR) is an important rule of the C++
programming language that prescribes that classes/structs and
non-inline functions cannot have more than one definition in the
entire program and template and types cannot have more than one
definition by translation unit. It is defined in the ISO C++ Standard
(ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
have similar but differently defined rules towards the same objective.

I don't fully understand this concept yet but are these two different
build failures related?

Regards,

[1] https://en.wikipedia.org/wiki/One_Definition_Rule

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > feature [-Werror,-Wtypedef-redefinition]"
> > >
> > > I'll look in the other templates to see if what they do.
> >
> > Their "declare" sections have full typedefs. I found it works to leave
> > out the typedef for the "define" section, but I first want to
> > reproduce the build failure.
>
> Right. I've reproduced this build failure on my machine by specifying
> flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> below change seems to fix the problem:

Confirmed, will push shortly.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > feature [-Werror,-Wtypedef-redefinition]"
> > > >
> > > > I'll look in the other templates to see if what they do.
> > >
> > > Their "declare" sections have full typedefs. I found it works to leave
> > > out the typedef for the "define" section, but I first want to
> > > reproduce the build failure.
> >
> > Right. I've reproduced this build failure on my machine by specifying
> > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > below change seems to fix the problem:
>
> Confirmed, will push shortly.

mamba complained different build errors[1]:

 2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
      |                              ~~^     ~~~~~~~~~~~~~~~~~~~
      |                                |              |
      |                                long int       int64 {aka long long int}
      |                              %lld
../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
expects argument of type 'long int', but argument 4 has type 'int64'
{aka 'long long int'} [-Werror=format=]
 2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
tree->ctl->num_nodes[i]);
      |                            ~~^
~~~~~~~~~~~~~~~~~~~~~~~
      |                              |
         |
      |                              long int
         int64 {aka long long int}
      |                            %lld
../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
expects argument of type 'long int', but argument 3 has type 'int64'
{aka 'long long int'} [-Werror=format=]
 2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
      |                              ~~^   ~~~~~~~~~~~~~~~~~~~~~
      |                                |            |
      |                                long int     int64 {aka long long int}
      |                              %lld

Regards,

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > > feature [-Werror,-Wtypedef-redefinition]"
> > > > >
> > > > > I'll look in the other templates to see if what they do.
> > > >
> > > > Their "declare" sections have full typedefs. I found it works to leave
> > > > out the typedef for the "define" section, but I first want to
> > > > reproduce the build failure.
> > >
> > > Right. I've reproduced this build failure on my machine by specifying
> > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > > below change seems to fix the problem:
> >
> > Confirmed, will push shortly.
>
> mamba complained different build errors[1]:
>
>  2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
>       |                              ~~^     ~~~~~~~~~~~~~~~~~~~
>       |                                |              |
>       |                                long int       int64 {aka long long int}
>       |                              %lld
> ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
> expects argument of type 'long int', but argument 4 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
> tree->ctl->num_nodes[i]);
>       |                            ~~^
> ~~~~~~~~~~~~~~~~~~~~~~~
>       |                              |
>          |
>       |                              long int
>          int64 {aka long long int}
>       |                            %lld
> ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
> expects argument of type 'long int', but argument 3 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
>       |                              ~~^   ~~~~~~~~~~~~~~~~~~~~~
>       |                                |            |
>       |                                long int     int64 {aka long long int}
>       |                              %lld
>
> Regards,
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

Yeah, the attached fixes it for me.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 7, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > > >
> > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > > > feature [-Werror,-Wtypedef-redefinition]"
> > > > > >
> > > > > > I'll look in the other templates to see if what they do.
> > > > >
> > > > > Their "declare" sections have full typedefs. I found it works to leave
> > > > > out the typedef for the "define" section, but I first want to
> > > > > reproduce the build failure.
> > > >
> > > > Right. I've reproduced this build failure on my machine by specifying
> > > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > > > below change seems to fix the problem:
> > >
> > > Confirmed, will push shortly.
> >
> > mamba complained different build errors[1]:
> >
> >  2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
> >       |                              ~~^     ~~~~~~~~~~~~~~~~~~~
> >       |                                |              |
> >       |                                long int       int64 {aka long long int}
> >       |                              %lld
> > ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
> > expects argument of type 'long int', but argument 4 has type 'int64'
> > {aka 'long long int'} [-Werror=format=]
> >  2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
> > tree->ctl->num_nodes[i]);
> >       |                            ~~^
> > ~~~~~~~~~~~~~~~~~~~~~~~
> >       |                              |
> >          |
> >       |                              long int
> >          int64 {aka long long int}
> >       |                            %lld
> > ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
> > expects argument of type 'long int', but argument 3 has type 'int64'
> > {aka 'long long int'} [-Werror=format=]
> >  2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
> >       |                              ~~^   ~~~~~~~~~~~~~~~~~~~~~
> >       |                                |            |
> >       |                                long int     int64 {aka long long int}
> >       |                              %lld
> >
> > Regards,
> >
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18
>
> Yeah, the attached fixes it for me.

Thanks, LGTM.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> odr-violation seems to refer to One Definition Rule (ODR). According
> to Wikipedia[1]:
>
> The One Definition Rule (ODR) is an important rule of the C++
> programming language that prescribes that classes/structs and
> non-inline functions cannot have more than one definition in the
> entire program and template and types cannot have more than one
> definition by translation unit. It is defined in the ISO C++ Standard
> (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
> have similar but differently defined rules towards the same objective.
>
> I don't fully understand this concept yet but are these two different
> build failures related?

I thought it may have something to do with the prerequisite commit
that moved some symbols from bitmapset.c to .h:

/* Select appropriate bit-twiddling functions for bitmap word size */
#if BITS_PER_BITMAPWORD == 32
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
#define bmw_popcount(w) pg_popcount32(w)
#elif BITS_PER_BITMAPWORD == 64
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
#define bmw_popcount(w) pg_popcount64(w)
#else
#error "invalid BITS_PER_BITMAPWORD"
#endif

...but olingo's error seems strange to me, because it is complaining
of pg_leftmost_one_pos, which refers to the lookup table in
pg_bitutils.c -- I thought all buildfarm members used the bitscan
instructions.

grassquit is complaining of pg_popcount64, which is a global function,
also in pg_bitutils.c. Not sure what to make of this, since we're just
pointing symbols at things which should have a single definition...



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 1:19 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> In addition, olingo and grassquit are showing different kinds of
> "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> make of -- example:

This might be relevant:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 1:19 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > In addition, olingo and grassquit are showing different kinds of
> > "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> > make of -- example:
>
> This might be relevant:
>
> $ git grep 'link_with: pgport_srv'
> src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
>
> No other test module uses this directive, and indeed, removing this
> still builds fine for me. Thoughts?

Yeah, it could be the culprit. The test_radixtree/meson.build is the
sole extension that explicitly specifies a link with pgport_srv. I
think we can get rid of it as I've also confirmed the build still fine
even without it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > $ git grep 'link_with: pgport_srv'
> > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> >
> > No other test module uses this directive, and indeed, removing this
> > still builds fine for me. Thoughts?
>
> Yeah, it could be the culprit. The test_radixtree/meson.build is the
> sole extension that explicitly specifies a link with pgport_srv. I
> think we can get rid of it as I've also confirmed the build still fine
> even without it.

olingo and grassquit have turned green, so that must have been it.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 7, 2024 at 8:06 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > $ git grep 'link_with: pgport_srv'
> > > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> > >
> > > No other test module uses this directive, and indeed, removing this
> > > still builds fine for me. Thoughts?
> >
> > Yeah, it could be the culprit. The test_radixtree/meson.build is the
> > sole extension that explicitly specifies a link with pgport_srv. I
> > think we can get rid of it as I've also confirmed the build still fine
> > even without it.
>
> olingo and grassquit have turned green, so that must have been it.

Cool!

I've attached the remaining patches for CI. I've made some minor
changes in separate patches and drafted the commit message for
tidstore patch.

While reviewing the tidstore code, I thought that it would be more
appropriate to place tidstore.c under src/backend/lib instead of
src/backend/common/access since the tidstore is no longer implemented
only for heap or other access methods, and it might also be used by
executor nodes in the future. What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 7, 2024 at 8:06 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > $ git grep 'link_with: pgport_srv'
> > > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> > >
> > > No other test module uses this directive, and indeed, removing this
> > > still builds fine for me. Thoughts?
> >
> > Yeah, it could be the culprit. The test_radixtree/meson.build is the
> > sole extension that explicitly specifies a link with pgport_srv. I
> > think we can get rid of it as I've also confirmed the build still fine
> > even without it.
>
> olingo and grassquit have turned green, so that must have been it.

fairywren is complaining another build failure:

[1931/2156] "gcc"  -o
src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
FAILED: src/test/modules/test_radixtree/test_radixtree.dll
"gcc"  -o src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe:

src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0):
undefined reference to `pg_popcount64'

It looks like it requires a link with pgport_srv but I'm not sure. It
seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
2019, VS 2019 - Meson & ninja, too.

Regards,

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-03-07%2012%3A53%3A20

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> It looks like it requires a link with pgport_srv but I'm not sure. It
> seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> 2019, VS 2019 - Meson & ninja, too.

Unfortunately, none of the Windows animals happened to run both after
the initial commit and before removing the (seemingly useless on our
daily platfoms) link. I'll confirm on my own CI branch in a few
minutes.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 8, 2024 at 10:04 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > It looks like it requires a link with pgport_srv but I'm not sure. It
> > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> > 2019, VS 2019 - Meson & ninja, too.
>
> Unfortunately, none of the Windows animals happened to run both after
> the initial commit and before removing the (seemingly useless on our
> daily platfoms) link. I'll confirm on my own CI branch in a few
> minutes.

Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:

--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif

 test_radixtree = shared_module('test_radixtree',
   test_radixtree_sources,
+  link_with: host_system == 'windows' ? pgport_srv : [],
   kwargs: pg_test_mod_args,
 )
 test_install_libs += test_radixtree

But I'm not sure it's the right fix especially because I guess it
could raise "AddressSanitizer: odr-violation" error on Windows.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:

Glad you shared before I went and did it.

> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>    test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],

I don't see any similar coding elsewhere, so that leaves me wondering
if we're missing something. On the other hand, maybe no test modules
use files in src/port ...

>    kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree
>
> But I'm not sure it's the right fix especially because I guess it
> could raise "AddressSanitizer: odr-violation" error on Windows.

Well, it's now at zero definitions that it can see, so I imagine it's
possible that adding the above would not cause more than one. In any
case, we might not know since as far as I can tell the MSVC animals
don't have address sanitizer. I'll look around some more, and if I
don't get any revelations, I guess we should go with the above.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:
>
> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>    test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],
>    kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree

pgport_srv is for backend, shared libraries should be using pgport_shlib

Further, the top level meson.build has:

# all shared libraries not part of the backend should depend on this
frontend_shlib_code = declare_dependency(
  include_directories: [postgres_inc],
  link_with: [common_shlib, pgport_shlib],
  sources: generated_headers,
  dependencies: [shlib_code, os_deps, libintl],
)

...but the only things that declare needing frontend_shlib_code are in
src/interfaces/.

In any case, I'm trying it in CI branch with pgport_shlib now.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Mar 8, 2024 at 9:53 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Yesterday I've confirmed the something like the below fixes the
> > problem happened in Windows CI:
> >
> > --- a/src/test/modules/test_radixtree/meson.build
> > +++ b/src/test/modules/test_radixtree/meson.build
> > @@ -12,6 +12,7 @@ endif
> >
> >  test_radixtree = shared_module('test_radixtree',
> >    test_radixtree_sources,
> > +  link_with: host_system == 'windows' ? pgport_srv : [],
> >    kwargs: pg_test_mod_args,
> >  )
> >  test_install_libs += test_radixtree
>
> pgport_srv is for backend, shared libraries should be using pgport_shlib

> In any case, I'm trying it in CI branch with pgport_shlib now.

That seems to work, so I'll push that just to get things green again.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached the remaining patches for CI. I've made some minor
> changes in separate patches and drafted the commit message for
> tidstore patch.
>
> While reviewing the tidstore code, I thought that it would be more
> appropriate to place tidstore.c under src/backend/lib instead of
> src/backend/common/access since the tidstore is no longer implemented
> only for heap or other access methods, and it might also be used by
> executor nodes in the future. What do you think?

That's a heck of a good question. I don't think src/backend/lib is
right -- it seems that's for general-purpose data structures.
Something like backend/utils is also too general.
src/backend/access/common has things for tuple descriptors, toast,
sessions, and I don't think tidstore is out of place here. I'm not
sure there's a better place, but I could be convinced otherwise.

v68-0001:

I'm not sure if commit messages are much a subject of review, and it's
up to the committer, but I'll share a couple comments just as
something to think about, not something I would ask you to change: I
think it's a bit distracting that the commit message talks about the
justification to use it for vacuum. Let's save that for the commit
with actual vacuum changes. Also, I suspect saying there are a "wide
range" of uses is over-selling it a bit, and that paragraph is a bit
awkward aside from that.

+ /* Collect TIDs extracted from the key-value pair */
+ result->num_offsets = 0;
+

This comment has nothing at all to do with this line. If the comment
is for several lines following, some of which are separated by blank
lines, there should be a blank line after the comment. Also, why isn't
tidstore_iter_extract_tids() responsible for setting that to zero?

+ ts->context = CurrentMemoryContext;

As far as I can tell, this member is never accessed again -- am I
missing something?

+ /* DSA for tidstore will be detached at the end of session */

No other test module pins the mapping, but that doesn't necessarily
mean it's wrong. Is there some advantage over explicitly detaching?

+-- Add tids in random order.

I don't see any randomization here. I do remember adding row_number to
remove whitespace in the output, but I don't remember a random order.
On that subject, the row_number was an easy trick to avoid extra
whitespace, but maybe we should just teach the setting function to
return blocknumber rather than null?

+Datum
+tidstore_create(PG_FUNCTION_ARGS)
+{
...
+ tidstore = TidStoreCreate(max_bytes, dsa);

+Datum
+tidstore_set_block_offsets(PG_FUNCTION_ARGS)
+{
....
+ TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);

These names are too similar. Maybe the test module should do
s/tidstore_/test_/ or similar.

+/* Sanity check if we've called tidstore_create() */
+static void
+check_tidstore_available(void)
+{
+ if (tidstore == NULL)
+ elog(ERROR, "tidstore is not initialized");
+}

I don't find this very helpful. If a developer wiped out the create
call, wouldn't the test crash and burn pretty obviously?

In general, the .sql file is still very hard-coded. Functions are
created that contain a VALUES statement. Maybe it's okay for now, but
wanted to mention it. Ideally, we'd have some randomized tests,
without having to display it. That could be in addition to (not
replacing) the small tests we have that display input. (see below)


v68-0002:

@@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)

  ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;

+#ifdef TIDSTORE_DEBUG
+ if (!TidStoreIsShared(ts))
+ {
+ bool ret_debug = ts_debug_is_member(ts, tid);;
+ Assert(ret == ret_debug);
+ }
+#endif

This only checking the case where we haven't returned already. In particular...

+ /* no entry for the blk */
+ if (page == NULL)
+ return false;
+
+ wordnum = WORDNUM(off);
+ bitnum = BITNUM(off);
+
+ /* no bitmap for the off */
+ if (wordnum >= page->nwords)
+ return false;

...these results are not checked.

More broadly, it seems like the test module should be able to test
everything that the debug-build array would complain about. Including
ordered iteration. This may require first saving our test input to a
table. We could create a cursor on a query that fetches the ordered
input from the table and verifies that the tid store iterate produces
the same ordered set, maybe with pl/pgSQL. Or something like that.
Seems like not a whole lot of work. I can try later in the week, if
you like.

v68-0005/6 look ready to squash

v68-0008 - I'm not a fan of captilizing short comment fragments. I use
the style of either: short lower-case phrases, or full sentences
including capitalization, correct grammar and period. I see these two
styles all over the code base, as appropriate.

+ /* Remain attached until end of backend */

We'll probably want this comment, if in fact we want this behavior.

+ /*
+ * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT
+ * before return.
+ */

I'm not sure what this is trying to say or why it's relevant, since
it's been a while since I've written a SRF in C.

That's all I have for now, and I haven't looked at the vacuum changes this time.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 15, 2024 at 8:26 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
[found a big speed-up where we don't expect one]

I tried to reproduce this (similar patch, but rebased on top of a bug
you recently fixed (possibly related?) -- attached, and also shows one
way to address some lack of coverage in the debug build, for as long
as we test that with CI).

Fortunately I cannot see a difference, so I believe it's not affecting
the case in this test all, as expected:

v68:

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 770, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 620.356 MB/s, avg write rate: 124.105 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.74 s, system: 0.68 s, elapsed: 4.45 s
system usage: CPU: user: 3.02 s, system: 0.42 s, elapsed: 3.47 s
system usage: CPU: user: 3.09 s, system: 0.38 s, elapsed: 3.49 s
system usage: CPU: user: 3.00 s, system: 0.43 s, elapsed: 3.45 s

v68 + emb values (that cannot be used because > 3 tids per block):

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 775, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 570.808 MB/s, avg write rate: 114.192 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.11 s, system: 0.62 s, elapsed: 3.75 s
system usage: CPU: user: 3.04 s, system: 0.41 s, elapsed: 3.46 s
system usage: CPU: user: 3.05 s, system: 0.41 s, elapsed: 3.47 s
system usage: CPU: user: 3.04 s, system: 0.43 s, elapsed: 3.49 s

I'll continue polishing the runtime-embeddable values patch as time
permits, for later consideration.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've attached the remaining patches for CI. I've made some minor
> > changes in separate patches and drafted the commit message for
> > tidstore patch.
> >
> > While reviewing the tidstore code, I thought that it would be more
> > appropriate to place tidstore.c under src/backend/lib instead of
> > src/backend/common/access since the tidstore is no longer implemented
> > only for heap or other access methods, and it might also be used by
> > executor nodes in the future. What do you think?
>
> That's a heck of a good question. I don't think src/backend/lib is
> right -- it seems that's for general-purpose data structures.
> Something like backend/utils is also too general.
> src/backend/access/common has things for tuple descriptors, toast,
> sessions, and I don't think tidstore is out of place here. I'm not
> sure there's a better place, but I could be convinced otherwise.

Yeah, I agreed that src/backend/lib seems not to be the place for
tidstore. Let's keep it in src/backend/access/common. If others think
differently, we can move it later.

>
> v68-0001:
>
> I'm not sure if commit messages are much a subject of review, and it's
> up to the committer, but I'll share a couple comments just as
> something to think about, not something I would ask you to change: I
> think it's a bit distracting that the commit message talks about the
> justification to use it for vacuum. Let's save that for the commit
> with actual vacuum changes. Also, I suspect saying there are a "wide
> range" of uses is over-selling it a bit, and that paragraph is a bit
> awkward aside from that.

Thank you for the comment, and I agreed. I've updated the commit message.

>
> + /* Collect TIDs extracted from the key-value pair */
> + result->num_offsets = 0;
> +
>
> This comment has nothing at all to do with this line. If the comment
> is for several lines following, some of which are separated by blank
> lines, there should be a blank line after the comment. Also, why isn't
> tidstore_iter_extract_tids() responsible for setting that to zero?

Agreed, fixed.

I also updated this part so we set result->blkno in
tidstore_iter_extract_tids() too, which seems more readable.

>
> + ts->context = CurrentMemoryContext;
>
> As far as I can tell, this member is never accessed again -- am I
> missing something?

You're right. It was used to re-create the tidstore in the same
context again while resetting it, but we no longer support the reset
API. Considering it again, would it be better to allocate the iterator
struct in the same context as we store the tidstore struct?

>
> + /* DSA for tidstore will be detached at the end of session */
>
> No other test module pins the mapping, but that doesn't necessarily
> mean it's wrong. Is there some advantage over explicitly detaching?

One small benefit of not explicitly detaching dsa_area in
tidstore_destroy() would be simplicity; IIUC if we want to do that, we
need to remember the dsa_area using (for example) a static variable,
and free it if it's non-NULL. I've implemented this idea in the
attached patch.

>
> +-- Add tids in random order.
>
> I don't see any randomization here. I do remember adding row_number to
> remove whitespace in the output, but I don't remember a random order.
> On that subject, the row_number was an easy trick to avoid extra
> whitespace, but maybe we should just teach the setting function to
> return blocknumber rather than null?

Good idea, fixed.

>
> +Datum
> +tidstore_create(PG_FUNCTION_ARGS)
> +{
> ...
> + tidstore = TidStoreCreate(max_bytes, dsa);
>
> +Datum
> +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> +{
> ....
> + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
>
> These names are too similar. Maybe the test module should do
> s/tidstore_/test_/ or similar.

Agreed.

>
> +/* Sanity check if we've called tidstore_create() */
> +static void
> +check_tidstore_available(void)
> +{
> + if (tidstore == NULL)
> + elog(ERROR, "tidstore is not initialized");
> +}
>
> I don't find this very helpful. If a developer wiped out the create
> call, wouldn't the test crash and burn pretty obviously?

Removed.

>
> In general, the .sql file is still very hard-coded. Functions are
> created that contain a VALUES statement. Maybe it's okay for now, but
> wanted to mention it. Ideally, we'd have some randomized tests,
> without having to display it. That could be in addition to (not
> replacing) the small tests we have that display input. (see below)
>

Agreed to add randomized tests in addition to the existing tests.

>
> v68-0002:
>
> @@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
>
>   ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
>
> +#ifdef TIDSTORE_DEBUG
> + if (!TidStoreIsShared(ts))
> + {
> + bool ret_debug = ts_debug_is_member(ts, tid);;
> + Assert(ret == ret_debug);
> + }
> +#endif
>
> This only checking the case where we haven't returned already. In particular...
>
> + /* no entry for the blk */
> + if (page == NULL)
> + return false;
> +
> + wordnum = WORDNUM(off);
> + bitnum = BITNUM(off);
> +
> + /* no bitmap for the off */
> + if (wordnum >= page->nwords)
> + return false;
>
> ...these results are not checked.
>
> More broadly, it seems like the test module should be able to test
> everything that the debug-build array would complain about. Including
> ordered iteration. This may require first saving our test input to a
> table. We could create a cursor on a query that fetches the ordered
> input from the table and verifies that the tid store iterate produces
> the same ordered set, maybe with pl/pgSQL. Or something like that.
> Seems like not a whole lot of work. I can try later in the week, if
> you like.

Sounds a good idea. In fact, if there are some bugs in tidstore, it's
likely that even initdb would fail in practice. However, it's a very
good idea that we can test the tidstore anyway with such a check
without a debug-build array.

Or as another idea, I wonder if we could keep the debug-build array in
some form. For example, we use the array with the particular build
flag and set a BF animal for that. That way, we can test the tidstore
in more real cases.

>
> v68-0005/6 look ready to squash

Done.

>
> v68-0008 - I'm not a fan of captilizing short comment fragments. I use
> the style of either: short lower-case phrases, or full sentences
> including capitalization, correct grammar and period. I see these two
> styles all over the code base, as appropriate.

Agreed.

>
> + /* Remain attached until end of backend */
>
> We'll probably want this comment, if in fact we want this behavior.

Kept it.

>
> + /*
> + * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT
> + * before return.
> + */
>
> I'm not sure what this is trying to say or why it's relevant, since
> it's been a while since I've written a SRF in C.

I wanted to say is that we cannot do like:

SRF_RETURN_NEXT(funcctx, PointerGetDatum(&(tids[funcctx->call_cntr])));

because funcctx->call_cntr is incremented *before* return and
therefore we will end up accessing the index out of range. I've took
some time to realize this fact before.

> That's all I have for now, and I haven't looked at the vacuum changes this time.

Thank you for the comments!

In the latest (v69) patch:

- squashed v68-0005 and v68-0006 patches.
- removed most of the changes in v68-0007 patch.
- addressed above review comments in v69-0002 patch.
- v69-0003, 0004, and 0005 are miscellaneous updates.

As for renaming TidStore to TIDStore, I dropped the patch for now
since it seems we're using "Tid" in some function names and variable
names. If we want to update it, we can do that later.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 11, 2024 at 5:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> In the latest (v69) patch:
>
> - squashed v68-0005 and v68-0006 patches.
> - removed most of the changes in v68-0007 patch.
> - addressed above review comments in v69-0002 patch.
> - v69-0003, 0004, and 0005 are miscellaneous updates.

Since the v69 conflicts with the current HEAD, I've rebased them. In
addition, v70-0008 is the new patch, which cleans up the vacuum
integration patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > + ts->context = CurrentMemoryContext;
> >
> > As far as I can tell, this member is never accessed again -- am I
> > missing something?
>
> You're right. It was used to re-create the tidstore in the same
> context again while resetting it, but we no longer support the reset
> API. Considering it again, would it be better to allocate the iterator
> struct in the same context as we store the tidstore struct?

That makes sense.

> > + /* DSA for tidstore will be detached at the end of session */
> >
> > No other test module pins the mapping, but that doesn't necessarily
> > mean it's wrong. Is there some advantage over explicitly detaching?
>
> One small benefit of not explicitly detaching dsa_area in
> tidstore_destroy() would be simplicity; IIUC if we want to do that, we
> need to remember the dsa_area using (for example) a static variable,
> and free it if it's non-NULL. I've implemented this idea in the
> attached patch.

Okay, I don't have a strong preference at this point.

> > +-- Add tids in random order.
> >
> > I don't see any randomization here. I do remember adding row_number to
> > remove whitespace in the output, but I don't remember a random order.
> > On that subject, the row_number was an easy trick to avoid extra
> > whitespace, but maybe we should just teach the setting function to
> > return blocknumber rather than null?
>
> Good idea, fixed.

+ test_set_block_offsets
+------------------------
+             2147483647
+                      0
+             4294967294
+                      1
+             4294967295

Hmm, was the earlier comment about randomness referring to this? I'm
not sure what other regression tests do in these cases, or how
relibale this is. If this is a problem we could simply insert this
result into a temp table so it's not output.

> > +Datum
> > +tidstore_create(PG_FUNCTION_ARGS)
> > +{
> > ...
> > + tidstore = TidStoreCreate(max_bytes, dsa);
> >
> > +Datum
> > +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> > +{
> > ....
> > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
> >
> > These names are too similar. Maybe the test module should do
> > s/tidstore_/test_/ or similar.
>
> Agreed.

Mostly okay, although a couple look a bit generic now. I'll leave it
up to you if you want to tweak things.

> > In general, the .sql file is still very hard-coded. Functions are
> > created that contain a VALUES statement. Maybe it's okay for now, but
> > wanted to mention it. Ideally, we'd have some randomized tests,
> > without having to display it. That could be in addition to (not
> > replacing) the small tests we have that display input. (see below)
> >
>
> Agreed to add randomized tests in addition to the existing tests.

I'll try something tomorrow.

> Sounds a good idea. In fact, if there are some bugs in tidstore, it's
> likely that even initdb would fail in practice. However, it's a very
> good idea that we can test the tidstore anyway with such a check
> without a debug-build array.
>
> Or as another idea, I wonder if we could keep the debug-build array in
> some form. For example, we use the array with the particular build
> flag and set a BF animal for that. That way, we can test the tidstore
> in more real cases.

I think the purpose of a debug flag is to help developers catch
mistakes. I don't think it's quite useful enough for that. For one, it
has the same 1GB limitation as vacuum's current array. For another,
it'd be a terrible way to debug moving tidbitmap.c from its hash table
to use TID store -- AND/OR operations and lossy pages are pretty much
undoable with a copy of vacuum's array. Last year, when I insisted on
trying a long term realistic load that compares the result with the
array, the encoding scheme was much harder to understand in code. I
think it's now easier, and there are better tests.

> In the latest (v69) patch:
>
> - squashed v68-0005 and v68-0006 patches.
> - removed most of the changes in v68-0007 patch.
> - addressed above review comments in v69-0002 patch.
> - v69-0003, 0004, and 0005 are miscellaneous updates.
>
> As for renaming TidStore to TIDStore, I dropped the patch for now
> since it seems we're using "Tid" in some function names and variable
> names. If we want to update it, we can do that later.

I think we're not consistent across the codebase, and it's fine to
drop that patch.

v70-0008:

@@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
  /*
  * Free the current tidstore and return allocated DSA segments to the
  * operating system. Then we recreate the tidstore with the same max_bytes
- * limitation.
+ * limitation we just used.

Nowadays, max_bytes is now more like a hint for tidstore, and not a
limitation, right? Vacuum has the limitation. Maybe instead of "with",
we should say "passing the same limitation".

I wonder how "di_info" would look as "dead_items_info". I don't feel
too strongly about it, though.

I'm going to try additional regression tests, as mentioned, and try a
couple benchmarks. It should be only a couple more days.

One thing that occurred to me: The radix tree regression tests only
compile and run the local memory case. The tidstore commit would be
the first time the buildfarm has seen the shared memory case, so we
should look out for possible build failures of the same sort we saw
with the the radix tree tests. I see you've already removed the
problematic link_with command -- that's the kind of thing to
double-check for.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Mar 12, 2024 at 7:34 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > + ts->context = CurrentMemoryContext;
> > >
> > > As far as I can tell, this member is never accessed again -- am I
> > > missing something?
> >
> > You're right. It was used to re-create the tidstore in the same
> > context again while resetting it, but we no longer support the reset
> > API. Considering it again, would it be better to allocate the iterator
> > struct in the same context as we store the tidstore struct?
>
> That makes sense.
>
> > > + /* DSA for tidstore will be detached at the end of session */
> > >
> > > No other test module pins the mapping, but that doesn't necessarily
> > > mean it's wrong. Is there some advantage over explicitly detaching?
> >
> > One small benefit of not explicitly detaching dsa_area in
> > tidstore_destroy() would be simplicity; IIUC if we want to do that, we
> > need to remember the dsa_area using (for example) a static variable,
> > and free it if it's non-NULL. I've implemented this idea in the
> > attached patch.
>
> Okay, I don't have a strong preference at this point.

I'd keep the update on that.

>
> > > +-- Add tids in random order.
> > >
> > > I don't see any randomization here. I do remember adding row_number to
> > > remove whitespace in the output, but I don't remember a random order.
> > > On that subject, the row_number was an easy trick to avoid extra
> > > whitespace, but maybe we should just teach the setting function to
> > > return blocknumber rather than null?
> >
> > Good idea, fixed.
>
> + test_set_block_offsets
> +------------------------
> +             2147483647
> +                      0
> +             4294967294
> +                      1
> +             4294967295
>
> Hmm, was the earlier comment about randomness referring to this? I'm
> not sure what other regression tests do in these cases, or how
> relibale this is. If this is a problem we could simply insert this
> result into a temp table so it's not output.

I didn't address the comment about randomness.

I think that we will have both random TIDs tests and fixed TIDs tests
in test_tidstore as we discussed, and probably we can do both tests
with similar steps; insert TIDs into both a temp table and tidstore
and check if the tidstore returned the results as expected by
comparing the results to the temp table. Probably we can have a common
pl/pgsql function that checks that and raises a WARNING or an ERROR.
Given that this is very similar to what we did in test_radixtree, why
do we really want to implement it using a pl/pgsql function? When we
discussed it before, I found the current way makes sense. But given
that we're adding more tests and will add more tests in the future,
doing the tests in C will be more maintainable and faster. Also, I
think we can do the debug-build array stuff in the test_tidstore code
instead.

>
> > > +Datum
> > > +tidstore_create(PG_FUNCTION_ARGS)
> > > +{
> > > ...
> > > + tidstore = TidStoreCreate(max_bytes, dsa);
> > >
> > > +Datum
> > > +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> > > +{
> > > ....
> > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
> > >
> > > These names are too similar. Maybe the test module should do
> > > s/tidstore_/test_/ or similar.
> >
> > Agreed.
>
> Mostly okay, although a couple look a bit generic now. I'll leave it
> up to you if you want to tweak things.
>
> > > In general, the .sql file is still very hard-coded. Functions are
> > > created that contain a VALUES statement. Maybe it's okay for now, but
> > > wanted to mention it. Ideally, we'd have some randomized tests,
> > > without having to display it. That could be in addition to (not
> > > replacing) the small tests we have that display input. (see below)
> > >
> >
> > Agreed to add randomized tests in addition to the existing tests.
>
> I'll try something tomorrow.
>
> > Sounds a good idea. In fact, if there are some bugs in tidstore, it's
> > likely that even initdb would fail in practice. However, it's a very
> > good idea that we can test the tidstore anyway with such a check
> > without a debug-build array.
> >
> > Or as another idea, I wonder if we could keep the debug-build array in
> > some form. For example, we use the array with the particular build
> > flag and set a BF animal for that. That way, we can test the tidstore
> > in more real cases.
>
> I think the purpose of a debug flag is to help developers catch
> mistakes. I don't think it's quite useful enough for that. For one, it
> has the same 1GB limitation as vacuum's current array. For another,
> it'd be a terrible way to debug moving tidbitmap.c from its hash table
> to use TID store -- AND/OR operations and lossy pages are pretty much
> undoable with a copy of vacuum's array.

Valid points.

As I mentioned above, if we implement the test cases in C, we can use
the debug-build array in the test code. And we won't use it in AND/OR
operations tests in the future.

>
> > In the latest (v69) patch:
> >
> > - squashed v68-0005 and v68-0006 patches.
> > - removed most of the changes in v68-0007 patch.
> > - addressed above review comments in v69-0002 patch.
> > - v69-0003, 0004, and 0005 are miscellaneous updates.
> >
> > As for renaming TidStore to TIDStore, I dropped the patch for now
> > since it seems we're using "Tid" in some function names and variable
> > names. If we want to update it, we can do that later.
>
> I think we're not consistent across the codebase, and it's fine to
> drop that patch.
>
> v70-0008:
>
> @@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
>   /*
>   * Free the current tidstore and return allocated DSA segments to the
>   * operating system. Then we recreate the tidstore with the same max_bytes
> - * limitation.
> + * limitation we just used.
>
> Nowadays, max_bytes is now more like a hint for tidstore, and not a
> limitation, right? Vacuum has the limitation.

Right.

>  Maybe instead of "with",
> we should say "passing the same limitation".

Will fix.

>
> I wonder how "di_info" would look as "dead_items_info". I don't feel
> too strongly about it, though.

Agreed.

>
> I'm going to try additional regression tests, as mentioned, and try a
> couple benchmarks. It should be only a couple more days.

Thank you!

> One thing that occurred to me: The radix tree regression tests only
> compile and run the local memory case. The tidstore commit would be
> the first time the buildfarm has seen the shared memory case, so we
> should look out for possible build failures of the same sort we saw
> with the the radix tree tests. I see you've already removed the
> problematic link_with command -- that's the kind of thing to
> double-check for.

Good point, agreed. I'll double-check it again.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> As I mentioned above, if we implement the test cases in C, we can use
> the debug-build array in the test code. And we won't use it in AND/OR
> operations tests in the future.

That's a really interesting idea, so I went ahead and tried that for
v71. This seems like a good basis for testing larger, randomized
inputs, once we decide how best to hide that from the expected output.
The tests use SQL functions do_set_block_offsets() and
check_set_block_offsets(). The latter does two checks against a tid
array, and replaces test_dump_tids(). Funnily enough, the debug array
itself gave false failures when using a similar array in the test
harness, because it didn't know all the places where the array should
have been sorted -- it only worked by chance before because of what
order things were done.

I squashed everything from v70 and also took the liberty of switching
on shared memory for tid store tests. The only reason we didn't do
this with the radix tree tests is that the static attach/detach
functions would raise warnings since they are not used.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > As I mentioned above, if we implement the test cases in C, we can use
> > the debug-build array in the test code. And we won't use it in AND/OR
> > operations tests in the future.
>
> That's a really interesting idea, so I went ahead and tried that for
> v71. This seems like a good basis for testing larger, randomized
> inputs, once we decide how best to hide that from the expected output.
> The tests use SQL functions do_set_block_offsets() and
> check_set_block_offsets(). The latter does two checks against a tid
> array, and replaces test_dump_tids().

Great! I think that's a very good starter.

The lookup_test() (and test_lookup_tids()) do also test that the
IsMember() function returns false as expected if the TID doesn't exist
in it, and probably we can do these tests in a C function too.

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

> Funnily enough, the debug array
> itself gave false failures when using a similar array in the test
> harness, because it didn't know all the places where the array should
> have been sorted -- it only worked by chance before because of what
> order things were done.

Good catch, thanks.

> I squashed everything from v70 and also took the liberty of switching
> on shared memory for tid store tests. The only reason we didn't do
> this with the radix tree tests is that the static attach/detach
> functions would raise warnings since they are not used.

Agreed to test the tidstore on shared memory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > As I mentioned above, if we implement the test cases in C, we can use
> > > the debug-build array in the test code. And we won't use it in AND/OR
> > > operations tests in the future.
> >
> > That's a really interesting idea, so I went ahead and tried that for
> > v71. This seems like a good basis for testing larger, randomized
> > inputs, once we decide how best to hide that from the expected output.
> > The tests use SQL functions do_set_block_offsets() and
> > check_set_block_offsets(). The latter does two checks against a tid
> > array, and replaces test_dump_tids().
>
> Great! I think that's a very good starter.
>
> The lookup_test() (and test_lookup_tids()) do also test that the
> IsMember() function returns false as expected if the TID doesn't exist
> in it, and probably we can do these tests in a C function too.
>
> BTW do we still want to test the tidstore by using a combination of
> SQL functions? We might no longer need to input TIDs via a SQL
> function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > As I mentioned above, if we implement the test cases in C, we can use
> > > > the debug-build array in the test code. And we won't use it in AND/OR
> > > > operations tests in the future.
> > >
> > > That's a really interesting idea, so I went ahead and tried that for
> > > v71. This seems like a good basis for testing larger, randomized
> > > inputs, once we decide how best to hide that from the expected output.
> > > The tests use SQL functions do_set_block_offsets() and
> > > check_set_block_offsets(). The latter does two checks against a tid
> > > array, and replaces test_dump_tids().
> >
> > Great! I think that's a very good starter.
> >
> > The lookup_test() (and test_lookup_tids()) do also test that the
> > IsMember() function returns false as expected if the TID doesn't exist
> > in it, and probably we can do these tests in a C function too.
> >
> > BTW do we still want to test the tidstore by using a combination of
> > SQL functions? We might no longer need to input TIDs via a SQL
> > function.
>
> I'm not sure. I stopped short of doing that to get feedback on this
> much. One advantage with SQL functions is we can use generate_series
> to easily input lists of blocks with different numbers and strides,
> and array literals for offsets are a bit easier. What do you think?

While I'm not a fan of the following part, I agree that it makes sense
to use SQL functions for test data generation:

-- Constant values used in the tests.
\set maxblkno 4294967295
-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
-- We use a higher number to test tidstore.
\set maxoffset 512

It would also be easier for developers to test the tidstore with their
own data set. So I agreed with the current approach; use SQL functions
for data generation and do the actual tests inside C functions. Is it
convenient for developers if we have functions like generate_tids()
and generate_random_tids() to generate TIDs so that they can pass them
to do_set_block_offsets()? Then they call check_set_block_offsets()
and others for actual data lookup and iteration tests.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote:
> > > BTW do we still want to test the tidstore by using a combination of
> > > SQL functions? We might no longer need to input TIDs via a SQL
> > > function.
> >
> > I'm not sure. I stopped short of doing that to get feedback on this
> > much. One advantage with SQL functions is we can use generate_series
> > to easily input lists of blocks with different numbers and strides,
> > and array literals for offsets are a bit easier. What do you think?
>
> While I'm not a fan of the following part, I agree that it makes sense
> to use SQL functions for test data generation:
>
> -- Constant values used in the tests.
> \set maxblkno 4294967295
> -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
> -- We use a higher number to test tidstore.
> \set maxoffset 512

I'm not really a fan of these either, and could be removed a some
point if we've done everything else nicely.

> It would also be easier for developers to test the tidstore with their
> own data set. So I agreed with the current approach; use SQL functions
> for data generation and do the actual tests inside C functions.

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

Further thought: We may not really need to test block numbers that
vigorously, since the radix tree tests should cover keys/values pretty
well. The difference here is using bitmaps of tids and that should be
well covered.

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

> Is it
> convenient for developers if we have functions like generate_tids()
> and generate_random_tids() to generate TIDs so that they can pass them
> to do_set_block_offsets()?

I guess I don't see the advantage of adding a layer of indirection at
this point, but it could be useful at a later time.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote:
> > > > BTW do we still want to test the tidstore by using a combination of
> > > > SQL functions? We might no longer need to input TIDs via a SQL
> > > > function.
> > >
> > > I'm not sure. I stopped short of doing that to get feedback on this
> > > much. One advantage with SQL functions is we can use generate_series
> > > to easily input lists of blocks with different numbers and strides,
> > > and array literals for offsets are a bit easier. What do you think?
> >
> > While I'm not a fan of the following part, I agree that it makes sense
> > to use SQL functions for test data generation:
> >
> > -- Constant values used in the tests.
> > \set maxblkno 4294967295
> > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
> > -- We use a higher number to test tidstore.
> > \set maxoffset 512
>
> I'm not really a fan of these either, and could be removed a some
> point if we've done everything else nicely.
>
> > It would also be easier for developers to test the tidstore with their
> > own data set. So I agreed with the current approach; use SQL functions
> > for data generation and do the actual tests inside C functions.
>
> Okay, here's an another idea: Change test_lookup_tids() to be more
> general and put the validation down into C as well. First we save the
> blocks from do_set_block_offsets() into a table, then with all those
> blocks lookup a sufficiently-large range of possible offsets and save
> found values in another array. So the static items structure would
> have 3 arrays: inserts, successful lookups, and iteration (currently
> the iteration output is private to check_set_block_offsets(). Then
> sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

>
> Further thought: We may not really need to test block numbers that
> vigorously, since the radix tree tests should cover keys/values pretty
> well.

Agreed. Probably boundary block numbers: 0, 1, MaxBlockNumber - 1, and
MaxBlockNumber, would be sufficient.

>  The difference here is using bitmaps of tids and that should be
> well covered.

Right. We would need to test offset numbers vigorously instead.

>
> Locally (not CI), we should try big inputs to make sure we can
> actually go up to many GB -- it's easier and faster this way than
> having vacuum give us a large data set.

I'll do these tests.

>
> > Is it
> > convenient for developers if we have functions like generate_tids()
> > and generate_random_tids() to generate TIDs so that they can pass them
> > to do_set_block_offsets()?
>
> I guess I don't see the advantage of adding a layer of indirection at
> this point, but it could be useful at a later time.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > Okay, here's an another idea: Change test_lookup_tids() to be more
> > general and put the validation down into C as well. First we save the
> > blocks from do_set_block_offsets() into a table, then with all those
> > blocks lookup a sufficiently-large range of possible offsets and save
> > found values in another array. So the static items structure would
> > have 3 arrays: inserts, successful lookups, and iteration (currently
> > the iteration output is private to check_set_block_offsets(). Then
> > sort as needed and check they are all the same.
>
> That's a promising idea. We can use the same mechanism for randomized
> tests too. If you're going to work on this, I'll do other tests on my
> environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine. Some open questions on the test module:

- Since there are now three arrays we should reduce max bytes to
something smaller.
- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.
- Why are we switching to TopMemoryContext? It's not explained -- the
comment only tells what the code is doing (which is obvious), but not
why.
- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > general and put the validation down into C as well. First we save the
> > > blocks from do_set_block_offsets() into a table, then with all those
> > > blocks lookup a sufficiently-large range of possible offsets and save
> > > found values in another array. So the static items structure would
> > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > the iteration output is private to check_set_block_offsets(). Then
> > > sort as needed and check they are all the same.
> >
> > That's a promising idea. We can use the same mechanism for randomized
> > tests too. If you're going to work on this, I'll do other tests on my
> > environment in the meantime.
>
> Some progress on this in v72 -- I tried first without using SQL to
> save the blocks, just using the unique blocks from the verification
> array. It seems to work fine.

Thanks!

>
> - Since there are now three arrays we should reduce max bytes to
> something smaller.

Agreed.

> - Further on that, I'm not sure if the "is full" test is telling us
> much. It seems we could make max bytes a static variable and set it to
> the size of the empty store. I'm guessing it wouldn't take much to add
> enough tids so that the contexts need to allocate some blocks, and
> then it would appear full and we can test that. I've made it so all
> arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

> - Why are we switching to TopMemoryContext? It's not explained -- the
> comment only tells what the code is doing (which is obvious), but not
> why.

This is because the tidstore needs to live across the transaction
boundary. We can use TopMemoryContext or CacheMemoryContext.

> - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> now have a separate lookup test, the only thing it can tell us is that
> lookups fail on an empty store. I arranged it so that
> check_set_block_offsets() works on an empty store. Although that's
> even more trivial, it's just reusing what we already need.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 14, 2024 at 9:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > general and put the validation down into C as well. First we save the
> > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > found values in another array. So the static items structure would
> > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > the iteration output is private to check_set_block_offsets(). Then
> > > > sort as needed and check they are all the same.
> > >
> > > That's a promising idea. We can use the same mechanism for randomized
> > > tests too. If you're going to work on this, I'll do other tests on my
> > > environment in the meantime.
> >
> > Some progress on this in v72 -- I tried first without using SQL to
> > save the blocks, just using the unique blocks from the verification
> > array. It seems to work fine.
>
> Thanks!
>
> >
> > - Since there are now three arrays we should reduce max bytes to
> > something smaller.
>
> Agreed.
>
> > - Further on that, I'm not sure if the "is full" test is telling us
> > much. It seems we could make max bytes a static variable and set it to
> > the size of the empty store. I'm guessing it wouldn't take much to add
> > enough tids so that the contexts need to allocate some blocks, and
> > then it would appear full and we can test that. I've made it so all
> > arrays repalloc when needed, just in case.
>
> How about using work_mem as max_bytes instead of having it as a static
> variable? In test_tidstore.sql we set work_mem before creating the
> tidstore. It would make the tidstore more controllable by SQL queries.
>
> > - Why are we switching to TopMemoryContext? It's not explained -- the
> > comment only tells what the code is doing (which is obvious), but not
> > why.
>
> This is because the tidstore needs to live across the transaction
> boundary. We can use TopMemoryContext or CacheMemoryContext.
>
> > - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> > now have a separate lookup test, the only thing it can tell us is that
> > lookups fail on an empty store. I arranged it so that
> > check_set_block_offsets() works on an empty store. Although that's
> > even more trivial, it's just reusing what we already need.
>
> Agreed.
>

I have two questions on tidstore.c:

+/*
+ * Set the given TIDs on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */

Do we need some assertions to check if the given offset numbers are
sorted expectedly?

---
+   if (TidStoreIsShared(ts))
+       found = shared_rt_set(ts->tree.shared, blkno, page);
+   else
+       found = local_rt_set(ts->tree.local, blkno, page);
+
+   Assert(!found);

Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > general and put the validation down into C as well. First we save the
> > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > found values in another array. So the static items structure would
> > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > the iteration output is private to check_set_block_offsets(). Then
> > > > sort as needed and check they are all the same.
> > >
> > > That's a promising idea. We can use the same mechanism for randomized
> > > tests too. If you're going to work on this, I'll do other tests on my
> > > environment in the meantime.
> >
> > Some progress on this in v72 -- I tried first without using SQL to
> > save the blocks, just using the unique blocks from the verification
> > array. It seems to work fine.
>
> Thanks!

Seems I forgot the attachment last time...there's more stuff now
anyway, based on discussion.

> > - Since there are now three arrays we should reduce max bytes to
> > something smaller.
>
> Agreed.

I went further than this, see below.

> > - Further on that, I'm not sure if the "is full" test is telling us
> > much. It seems we could make max bytes a static variable and set it to
> > the size of the empty store. I'm guessing it wouldn't take much to add
> > enough tids so that the contexts need to allocate some blocks, and
> > then it would appear full and we can test that. I've made it so all
> > arrays repalloc when needed, just in case.
>
> How about using work_mem as max_bytes instead of having it as a static
> variable? In test_tidstore.sql we set work_mem before creating the
> tidstore. It would make the tidstore more controllable by SQL queries.

My complaint is that the "is full" test is trivial, and also strange
in that max_bytes is used for two unrelated things:

- the initial size of the verification arrays, which was always larger
than necessary, and now there are three of them
- the hint to TidStoreCreate to calculate its max block size / the
threshold for being "full"

To make the "is_full" test slightly less trivial, my idea is to save
the empty store size and later add enough tids so that it has to
allocate new blocks/DSA segments, which is not that many, and then it
will appear full. I've done this and also separated the purpose of
various sizes in v72-0009/10.

Using actual work_mem seems a bit more difficult to make this work.

> > - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> > now have a separate lookup test, the only thing it can tell us is that
> > lookups fail on an empty store. I arranged it so that
> > check_set_block_offsets() works on an empty store. Although that's
> > even more trivial, it's just reusing what we already need.
>
> Agreed.

Removed in v72-0007

On Fri, Mar 15, 2024 at 9:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I have two questions on tidstore.c:
>
> +/*
> + * Set the given TIDs on the blkno to TidStore.
> + *
> + * NB: the offset numbers in offsets must be sorted in ascending order.
> + */
>
> Do we need some assertions to check if the given offset numbers are
> sorted expectedly?

Done in v72-0008

> ---
> +   if (TidStoreIsShared(ts))
> +       found = shared_rt_set(ts->tree.shared, blkno, page);
> +   else
> +       found = local_rt_set(ts->tree.local, blkno, page);
> +
> +   Assert(!found);
>
> Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> overwrite) the value, I think we should not expect that found is
> always false.

I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. You and I have long
known that tidstore would need a separate, more complex, function to
add or remove individual tids from existing entries, but it might be
good to have that documented.

Other things:

v72-0011: Test that zero offset raises an error.

v72-0013: I had wanted to microbenchmark this, but since we are
running short of time I decided to skip that, so I want to revert some
code to make it again more similar to the equivalent in tidbitmap.c.
In the absence of evidence, it seems better to do it this way.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > > general and put the validation down into C as well. First we save the
> > > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > > found values in another array. So the static items structure would
> > > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > > the iteration output is private to check_set_block_offsets(). Then
> > > > > sort as needed and check they are all the same.
> > > >
> > > > That's a promising idea. We can use the same mechanism for randomized
> > > > tests too. If you're going to work on this, I'll do other tests on my
> > > > environment in the meantime.
> > >
> > > Some progress on this in v72 -- I tried first without using SQL to
> > > save the blocks, just using the unique blocks from the verification
> > > array. It seems to work fine.
> >
> > Thanks!
>
> Seems I forgot the attachment last time...there's more stuff now
> anyway, based on discussion.

Thank you for updating the patches!

The idea of using three TID arrays for the lookup test and iteration
test looks good to me. I think we can add random-TIDs tests on top of
it.

>
> > > - Since there are now three arrays we should reduce max bytes to
> > > something smaller.
> >
> > Agreed.
>
> I went further than this, see below.
>
> > > - Further on that, I'm not sure if the "is full" test is telling us
> > > much. It seems we could make max bytes a static variable and set it to
> > > the size of the empty store. I'm guessing it wouldn't take much to add
> > > enough tids so that the contexts need to allocate some blocks, and
> > > then it would appear full and we can test that. I've made it so all
> > > arrays repalloc when needed, just in case.
> >
> > How about using work_mem as max_bytes instead of having it as a static
> > variable? In test_tidstore.sql we set work_mem before creating the
> > tidstore. It would make the tidstore more controllable by SQL queries.
>
> My complaint is that the "is full" test is trivial, and also strange
> in that max_bytes is used for two unrelated things:
>
> - the initial size of the verification arrays, which was always larger
> than necessary, and now there are three of them
> - the hint to TidStoreCreate to calculate its max block size / the
> threshold for being "full"
>
> To make the "is_full" test slightly less trivial, my idea is to save
> the empty store size and later add enough tids so that it has to
> allocate new blocks/DSA segments, which is not that many, and then it
> will appear full. I've done this and also separated the purpose of
> various sizes in v72-0009/10.

I see your point and the changes look good to me.

> Using actual work_mem seems a bit more difficult to make this work.

Agreed.

>
>
> > ---
> > +   if (TidStoreIsShared(ts))
> > +       found = shared_rt_set(ts->tree.shared, blkno, page);
> > +   else
> > +       found = local_rt_set(ts->tree.local, blkno, page);
> > +
> > +   Assert(!found);
> >
> > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > overwrite) the value, I think we should not expect that found is
> > always false.
>
> I find that a puzzling statement, since 1) it was designed for
> insert-only workloads, not actual overwrite IIRC and 2) the tests will
> now fail if the same block is set twice, since we just switched the
> tests to use a remnant of vacuum's old array. Having said that, I
> don't object to removing artificial barriers to using it for purposes
> not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

As for test_tidstore you're right that the test code doesn't handle
the case where setting the same block twice. I think that there is no
problem in the fixed-TIDs tests, but we would need something for
random-TIDs tests so that we don't set the same block twice. I guess
it could be trivial since we can use SQL queries to generate TIDs. I'm
not sure how the random-TIDs tests would be like, but I think we can
use SELECT DISTINCT to eliminate the duplicates of block numbers to
use.

>
> Given the above two things, I think this function's comment needs
> stronger language about its limitations. Perhaps even mention that
> it's intended for, and optimized for, vacuum. You and I have long
> known that tidstore would need a separate, more complex, function to
> add or remove individual tids from existing entries, but it might be
> good to have that documented.

Agreed.

>
> Other things:
>
> v72-0011: Test that zero offset raises an error.
>
> v72-0013: I had wanted to microbenchmark this, but since we are
> running short of time I decided to skip that, so I want to revert some
> code to make it again more similar to the equivalent in tidbitmap.c.
> In the absence of evidence, it seems better to do it this way.

LGTM.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > overwrite) the value, I think we should not expect that found is
> > > always false.
> >
> > I find that a puzzling statement, since 1) it was designed for
> > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > now fail if the same block is set twice, since we just switched the
> > tests to use a remnant of vacuum's old array. Having said that, I
> > don't object to removing artificial barriers to using it for purposes
> > not yet imagined, as long as test_tidstore.sql warns against that.
>
> I think that if it supports only insert-only workload and expects the
> same block is set only once, it should raise an error rather than an
> assertion. It's odd to me that the function fails only with an
> assertion build assertions even though it actually works fine even in
> that case.

After thinking some more, I think you're right -- it's too
heavy-handed to throw an error/assert and a public function shouldn't
make assumptions about the caller. It's probably just a matter of
documenting the function (and it's lack of generality), and the tests
(which are based on the thing we're replacing).

> As for test_tidstore you're right that the test code doesn't handle
> the case where setting the same block twice. I think that there is no
> problem in the fixed-TIDs tests, but we would need something for
> random-TIDs tests so that we don't set the same block twice. I guess
> it could be trivial since we can use SQL queries to generate TIDs. I'm
> not sure how the random-TIDs tests would be like, but I think we can
> use SELECT DISTINCT to eliminate the duplicates of block numbers to
> use.

Also, I don't think we need random blocks, since the radix tree tests
excercise that heavily already.

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

> > Given the above two things, I think this function's comment needs
> > stronger language about its limitations. Perhaps even mention that
> > it's intended for, and optimized for, vacuum. You and I have long
> > known that tidstore would need a separate, more complex, function to
> > add or remove individual tids from existing entries, but it might be
> > good to have that documented.
>
> Agreed.

How about this:

 /*
- * Set the given TIDs on the blkno to TidStore.
+ * Create or replace an entry for the given block and array of offsets
  *
- * NB: the offset numbers in offsets must be sorted in ascending order.
+ * NB: This function is designed and optimized for vacuum's heap scanning
+ * phase, so has some limitations:
+ * - The offset numbers in "offsets" must be sorted in ascending order.
+ * - If the block number already exists, the entry will be replaced --
+ *   there is no way to add or remove offsets from an entry.
  */
 void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables. More
comments:

+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, TidStoreEndIterate() needs to be called when finished.

+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.

This is outdated. Locking is optional. The remaining real reason now
is that TidStoreEndIterate needs to free memory. We probably need to
say something about locking, too, but not this.

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */

s/existing/exiting/ ?

It seems to say we need to finish after finishing. Maybe more precise wording.

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)

This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.

+ * tidstore.h
+ *   Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group

Update year.

+typedef struct BlocktableEntry
+{
+ uint16 nwords;
+ bitmapword words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;

In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
That doesn't have any real-world affect on the largest offset
encountered, and only in 32-bit builds with 32kB block size would the
theoretical max change at all. To be precise, we could use in the
MaxBlocktableEntrySize calculation:

Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);

Tests: I never got rid of maxblkno and maxoffset, in case you wanted
to do that. And as discussed above, maybe

-- Note: The test code use an array of TIDs for verification similar
-- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > > overwrite) the value, I think we should not expect that found is
> > > > always false.
> > >
> > > I find that a puzzling statement, since 1) it was designed for
> > > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > > now fail if the same block is set twice, since we just switched the
> > > tests to use a remnant of vacuum's old array. Having said that, I
> > > don't object to removing artificial barriers to using it for purposes
> > > not yet imagined, as long as test_tidstore.sql warns against that.
> >
> > I think that if it supports only insert-only workload and expects the
> > same block is set only once, it should raise an error rather than an
> > assertion. It's odd to me that the function fails only with an
> > assertion build assertions even though it actually works fine even in
> > that case.
>
> After thinking some more, I think you're right -- it's too
> heavy-handed to throw an error/assert and a public function shouldn't
> make assumptions about the caller. It's probably just a matter of
> documenting the function (and it's lack of generality), and the tests
> (which are based on the thing we're replacing).

Removed 'found' in 0003 patch.

>
> > As for test_tidstore you're right that the test code doesn't handle
> > the case where setting the same block twice. I think that there is no
> > problem in the fixed-TIDs tests, but we would need something for
> > random-TIDs tests so that we don't set the same block twice. I guess
> > it could be trivial since we can use SQL queries to generate TIDs. I'm
> > not sure how the random-TIDs tests would be like, but I think we can
> > use SELECT DISTINCT to eliminate the duplicates of block numbers to
> > use.
>
> Also, I don't think we need random blocks, since the radix tree tests
> excercise that heavily already.
>
> Random offsets is what I was thinking of (if made distinct and
> ordered), but even there the code is fairy trivial, so I don't have a
> strong feeling about it.

Agreed.

>
> > > Given the above two things, I think this function's comment needs
> > > stronger language about its limitations. Perhaps even mention that
> > > it's intended for, and optimized for, vacuum. You and I have long
> > > known that tidstore would need a separate, more complex, function to
> > > add or remove individual tids from existing entries, but it might be
> > > good to have that documented.
> >
> > Agreed.
>
> How about this:
>
>  /*
> - * Set the given TIDs on the blkno to TidStore.
> + * Create or replace an entry for the given block and array of offsets
>   *
> - * NB: the offset numbers in offsets must be sorted in ascending order.
> + * NB: This function is designed and optimized for vacuum's heap scanning
> + * phase, so has some limitations:
> + * - The offset numbers in "offsets" must be sorted in ascending order.
> + * - If the block number already exists, the entry will be replaced --
> + *   there is no way to add or remove offsets from an entry.
>   */
>  void
>  TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

Looks good.

>
> I think we can stop including the debug-tid-store patch for CI now.
> That would allow getting rid of some unnecessary variables.

Agreed.

>
> + * Prepare to iterate through a TidStore. Since the radix tree is locked during
> + * the iteration, TidStoreEndIterate() needs to be called when finished.
>
> + * Concurrent updates during the iteration will be blocked when inserting a
> + * key-value to the radix tree.
>
> This is outdated. Locking is optional. The remaining real reason now
> is that TidStoreEndIterate needs to free memory. We probably need to
> say something about locking, too, but not this.

Fixed.

>
> + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> + * in one block. We return the block numbers in ascending order and the offset
> + * numbers in each result is also sorted in ascending order.
> + */
> +TidStoreIterResult *
> +TidStoreIterateNext(TidStoreIter *iter)
>
> The wording is a bit awkward.

Fixed.

>
> +/*
> + * Finish an iteration over TidStore. This needs to be called after finishing
> + * or when existing an iteration.
> + */
>
> s/existing/exiting/ ?
>
> It seems to say we need to finish after finishing. Maybe more precise wording.

Fixed.

>
> +/* Extract TIDs from the given key-value pair */
> +static void
> +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> BlocktableEntry *page)
>
> This is a leftover from the old encoding scheme. This should really
> take a "BlockNumber blockno" not a "key", and the only call site
> should probably cast the uint64 to BlockNumber.

Fixed.

>
> + * tidstore.h
> + *   Tid storage.
> + *
> + *
> + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
>
> Update year.

Updated.

>
> +typedef struct BlocktableEntry
> +{
> + uint16 nwords;
> + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> +} BlocktableEntry;
>
> In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
> That doesn't have any real-world affect on the largest offset
> encountered, and only in 32-bit builds with 32kB block size would the
> theoretical max change at all. To be precise, we could use in the
> MaxBlocktableEntrySize calculation:
>
> Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);

I don't get this expression. Making the nwords one byte works well?
With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
OS, respectively. One byte nwrods variable seems not to be sufficient
for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
PG_INT8_MAX - 1" come from?

>
> Tests: I never got rid of maxblkno and maxoffset, in case you wanted
> to do that. And as discussed above, maybe
>
> -- Note: The test code use an array of TIDs for verification similar
> -- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
> -- each call to do_set_block_offsets() should use different block
> -- numbers.

I've added this comment on top of the .sql file.

I've attached the new patch sets. The summary of updates is:

- Squashed all updates of v72
- 0004 and 0005 are updates for test_tidstore.sql. Particularly the
0005 patch adds randomized TID tests.
- 0006 addresses review comments above.
- 0007 and 0008 patches are pgindent stuff.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:

> > Random offsets is what I was thinking of (if made distinct and
> > ordered), but even there the code is fairy trivial, so I don't have a
> > strong feeling about it.
>
> Agreed.

Looks good.

A related thing I should mention is that the tests which look up all
possible offsets are really expensive with the number of blocks we're
using now (assert build):

v70 0.33s
v72 1.15s
v73 1.32

To trim that back, I think we should give up on using shared memory
for the is-full test: We can cause aset to malloc a new block with a
lot fewer entries. In the attached, this brings it back down to 0.43s.
It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

> > I think we can stop including the debug-tid-store patch for CI now.
> > That would allow getting rid of some unnecessary variables.
>
> Agreed.

Okay, all that remains here is to get rid of those variables (might be
just one).

> > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> > + * in one block. We return the block numbers in ascending order and the offset
> > + * numbers in each result is also sorted in ascending order.
> > + */
> > +TidStoreIterResult *
> > +TidStoreIterateNext(TidStoreIter *iter)
> >
> > The wording is a bit awkward.
>
> Fixed.

- * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
- * in one block. We return the block numbers in ascending order and the offset
- * numbers in each result is also sorted in ascending order.
+ * Scan the TidStore and return the TIDs of the next block. The returned block
+ * numbers is sorted in ascending order, and the offset numbers in each result
+ * is also sorted in ascending order.

Better, but it's still not very clear. Maybe "The offsets in each
iteration result are ordered, as are the block numbers over all
iterations."

> > +/* Extract TIDs from the given key-value pair */
> > +static void
> > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > BlocktableEntry *page)
> >
> > This is a leftover from the old encoding scheme. This should really
> > take a "BlockNumber blockno" not a "key", and the only call site
> > should probably cast the uint64 to BlockNumber.
>
> Fixed.

This part looks good. I didn't notice earlier, but this comment has a
similar issue

@@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
  return NULL;

  /* Collect TIDs extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, page);
+ tidstore_iter_extract_tids(iter, (BlockNumber) key, page);

..."extracted" was once a separate operation. I think just removing
that one word is enough to update it.

Some other review on code comments:

v73-0001:

+ /* Enlarge the TID array if necessary */

It's "arrays" now.

v73-0005:

+-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
+-- different randon 100 offset numbers each other.

The numbers are obvious from the query. Maybe just mention that the
offsets are randomized and must be unique and ordered.

+ * The caller is responsible for release any locks.

"releasing"

> > +typedef struct BlocktableEntry
> > +{
> > + uint16 nwords;
> > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > +} BlocktableEntry;
> >
> > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.

I should be more clear here: nwords fitting into one byte allows 3
embedded offsets (1 on 32-bit platforms, which is good for testing at
least). With uint16 nwords that reduces to 2 (none on 32-bit
platforms). Further, after the current patch series is fully
committed, I plan to split the embedded-offset patch into two parts:
The first would store the offsets in the header, but would still need
a (smaller) allocation. The second would embed them in the child
pointer. Only the second patch will care about the size of nwords
because it needs to reserve a byte for the pointer tag.

> > That doesn't have any real-world affect on the largest offset
> > encountered, and only in 32-bit builds with 32kB block size would the
> > theoretical max change at all. To be precise, we could use in the
> > MaxBlocktableEntrySize calculation:
> >
> > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
>
> I don't get this expression. Making the nwords one byte works well?
> With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
> bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
> OS, respectively. One byte nwrods variable seems not to be sufficient

I believe there is confusion between bitmap words and bytes:
2048 / 64 = 32 words = 256 bytes

It used to be max tuples per (heap) page, but we wanted a simple way
to make this independent of heap. I believe we won't need to ever
store the actual MaxOffsetNumber, although we technically still could
with a one-byte type and 32kB pages, at least on 64-bit platforms.

> for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
> PG_INT8_MAX - 1" come from?

127 words, each with 64 (or 32) bits. The zero bit is not a valid
offset, so subtract one. And I used signed type in case there was a
need for -1 to mean something.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > Random offsets is what I was thinking of (if made distinct and
> > > ordered), but even there the code is fairy trivial, so I don't have a
> > > strong feeling about it.
> >
> > Agreed.
>
> Looks good.
>
> A related thing I should mention is that the tests which look up all
> possible offsets are really expensive with the number of blocks we're
> using now (assert build):
>
> v70 0.33s
> v72 1.15s
> v73 1.32
>
> To trim that back, I think we should give up on using shared memory
> for the is-full test: We can cause aset to malloc a new block with a
> lot fewer entries. In the attached, this brings it back down to 0.43s.

Looks good. Agreed with this change.

> It might also be worth reducing the number of blocks in the random
> test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks   : 228 ms

>
> > > I think we can stop including the debug-tid-store patch for CI now.
> > > That would allow getting rid of some unnecessary variables.
> >
> > Agreed.
>
> Okay, all that remains here is to get rid of those variables (might be
> just one).

Removed some unnecessary variables in 0002 patch.

>
> > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> > > + * in one block. We return the block numbers in ascending order and the offset
> > > + * numbers in each result is also sorted in ascending order.
> > > + */
> > > +TidStoreIterResult *
> > > +TidStoreIterateNext(TidStoreIter *iter)
> > >
> > > The wording is a bit awkward.
> >
> > Fixed.
>
> - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> - * in one block. We return the block numbers in ascending order and the offset
> - * numbers in each result is also sorted in ascending order.
> + * Scan the TidStore and return the TIDs of the next block. The returned block
> + * numbers is sorted in ascending order, and the offset numbers in each result
> + * is also sorted in ascending order.
>
> Better, but it's still not very clear. Maybe "The offsets in each
> iteration result are ordered, as are the block numbers over all
> iterations."

Thanks, fixed.

>
> > > +/* Extract TIDs from the given key-value pair */
> > > +static void
> > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > > BlocktableEntry *page)
> > >
> > > This is a leftover from the old encoding scheme. This should really
> > > take a "BlockNumber blockno" not a "key", and the only call site
> > > should probably cast the uint64 to BlockNumber.
> >
> > Fixed.
>
> This part looks good. I didn't notice earlier, but this comment has a
> similar issue
>
> @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
>   return NULL;
>
>   /* Collect TIDs extracted from the key-value pair */
> - tidstore_iter_extract_tids(iter, key, page);
> + tidstore_iter_extract_tids(iter, (BlockNumber) key, page);
>
> ..."extracted" was once a separate operation. I think just removing
> that one word is enough to update it.

Fixed.

>
> Some other review on code comments:
>
> v73-0001:
>
> + /* Enlarge the TID array if necessary */
>
> It's "arrays" now.
>
> v73-0005:
>
> +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
> +-- different randon 100 offset numbers each other.
>
> The numbers are obvious from the query. Maybe just mention that the
> offsets are randomized and must be unique and ordered.
>
> + * The caller is responsible for release any locks.
>
> "releasing"

Fixed.

>
> > > +typedef struct BlocktableEntry
> > > +{
> > > + uint16 nwords;
> > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > > +} BlocktableEntry;
> > >
> > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
>
> I should be more clear here: nwords fitting into one byte allows 3
> embedded offsets (1 on 32-bit platforms, which is good for testing at
> least). With uint16 nwords that reduces to 2 (none on 32-bit
> platforms). Further, after the current patch series is fully
> committed, I plan to split the embedded-offset patch into two parts:
> The first would store the offsets in the header, but would still need
> a (smaller) allocation. The second would embed them in the child
> pointer. Only the second patch will care about the size of nwords
> because it needs to reserve a byte for the pointer tag.

Thank you for the clarification.

>
> > > That doesn't have any real-world affect on the largest offset
> > > encountered, and only in 32-bit builds with 32kB block size would the
> > > theoretical max change at all. To be precise, we could use in the
> > > MaxBlocktableEntrySize calculation:
> > >
> > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
> >
> > I don't get this expression. Making the nwords one byte works well?
> > With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
> > bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
> > OS, respectively. One byte nwrods variable seems not to be sufficient
>
> I believe there is confusion between bitmap words and bytes:
> 2048 / 64 = 32 words = 256 bytes

Oops, you're right.

>
> It used to be max tuples per (heap) page, but we wanted a simple way
> to make this independent of heap. I believe we won't need to ever
> store the actual MaxOffsetNumber, although we technically still could
> with a one-byte type and 32kB pages, at least on 64-bit platforms.
>
> > for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
> > PG_INT8_MAX - 1" come from?
>
> 127 words, each with 64 (or 32) bits. The zero bit is not a valid
> offset, so subtract one. And I used signed type in case there was a
> need for -1 to mean something.

Okay, I missed that we want to change nwords from uint8 to int8.

So the MaxBlocktableEntrySize calculation would be as follows?

#define MaxBlocktableEntrySize \
    offsetof(BlocktableEntry, words) + \
        (sizeof(bitmapword) * \
        WORDS_PER_PAGE(Min(MaxOffsetNumber, \
                           BITS_PER_BITMAPWORD * PG_INT8_MAX - 1))))

I've made this change in the 0003 patch.

While reviewing the vacuum patch, I realized that we always pass
LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
to the tidstore is therefore always the same. I think it would be
better to make the caller of TidStoreCreate() specify the tranch_id
and pass it to RT_CREATE(). That way, the caller can specify their own
wait event for tidstore. The 0008 patch tried this idea. dshash.c does
the same idea.

Other patches are minor updates for tidstore and vacuum patches.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:

> > It might also be worth reducing the number of blocks in the random
> > test -- multiple runs will have different offsets anyway.
>
> Yes. If we reduce the number of blocks from 1000 to 100, the
> regression test took on my environment:
>
> 1000 blocks : 516 ms
> 100 blocks   : 228 ms

Sounds good.

> Removed some unnecessary variables in 0002 patch.

Looks good.

> So the MaxBlocktableEntrySize calculation would be as follows?
>
> #define MaxBlocktableEntrySize \
>     offsetof(BlocktableEntry, words) + \
>         (sizeof(bitmapword) * \
>         WORDS_PER_PAGE(Min(MaxOffsetNumber, \
>                            BITS_PER_BITMAPWORD * PG_INT8_MAX - 1))))
>
> I've made this change in the 0003 patch.

This is okay, but one side effect is that we have both an assert and
an elog, for different limits. I think we'll need a separate #define
to help. But for now, I don't want to hold up tidstore further with
this because I believe almost everything else in v74 is in pretty good
shape. I'll save this for later as a part of the optimization I
proposed.

Remaining things I noticed:

+#define RT_PREFIX local_rt
+#define RT_PREFIX shared_rt

Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts"

+ /* MemoryContext where the radix tree uses */

s/where/that/

+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+void
+TidStoreLockExclusive(TidStore *ts)

Talking about multiple things, so maybe a blank line after the comment.

With those, I think you can go ahead and squash all the tidstore
patches except for 0003 and commit it.

> While reviewing the vacuum patch, I realized that we always pass
> LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> to the tidstore is therefore always the same. I think it would be
> better to make the caller of TidStoreCreate() specify the tranch_id
> and pass it to RT_CREATE(). That way, the caller can specify their own
> wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> the same idea.

Sounds reasonable. I'll just note that src/include/storage/lwlock.h
still has an entry for LWTRANCHE_SHARED_TIDSTORE.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Tue, Mar 19, 2024 at 6:40 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > It might also be worth reducing the number of blocks in the random
> > > test -- multiple runs will have different offsets anyway.
> >
> > Yes. If we reduce the number of blocks from 1000 to 100, the
> > regression test took on my environment:
> >
> > 1000 blocks : 516 ms
> > 100 blocks   : 228 ms
>
> Sounds good.
>
> > Removed some unnecessary variables in 0002 patch.
>
> Looks good.
>
> > So the MaxBlocktableEntrySize calculation would be as follows?
> >
> > #define MaxBlocktableEntrySize \
> >     offsetof(BlocktableEntry, words) + \
> >         (sizeof(bitmapword) * \
> >         WORDS_PER_PAGE(Min(MaxOffsetNumber, \
> >                            BITS_PER_BITMAPWORD * PG_INT8_MAX - 1))))
> >
> > I've made this change in the 0003 patch.
>
> This is okay, but one side effect is that we have both an assert and
> an elog, for different limits. I think we'll need a separate #define
> to help. But for now, I don't want to hold up tidstore further with
> this because I believe almost everything else in v74 is in pretty good
> shape. I'll save this for later as a part of the optimization I
> proposed.
>
> Remaining things I noticed:
>
> +#define RT_PREFIX local_rt
> +#define RT_PREFIX shared_rt
>
> Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts"
>
> + /* MemoryContext where the radix tree uses */
>
> s/where/that/
>
> +/*
> + * Lock support functions.
> + *
> + * We can use the radix tree's lock for shared TidStore as the data we
> + * need to protect is only the shared radix tree.
> + */
> +void
> +TidStoreLockExclusive(TidStore *ts)
>
> Talking about multiple things, so maybe a blank line after the comment.
>
> With those, I think you can go ahead and squash all the tidstore
> patches except for 0003 and commit it.
>
> > While reviewing the vacuum patch, I realized that we always pass
> > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> > to the tidstore is therefore always the same. I think it would be
> > better to make the caller of TidStoreCreate() specify the tranch_id
> > and pass it to RT_CREATE(). That way, the caller can specify their own
> > wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> > the same idea.
>
> Sounds reasonable. I'll just note that src/include/storage/lwlock.h
> still has an entry for LWTRANCHE_SHARED_TIDSTORE.

Thank you. I've incorporated all the comments above. I've attached the
latest patches, and am going to push them (one by one) after
self-review again.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > Locally (not CI), we should try big inputs to make sure we can
> > actually go up to many GB -- it's easier and faster this way than
> > having vacuum give us a large data set.
>
> I'll do these tests.

I just remembered this -- did any of this kind of testing happen? I
can do it as well.

> Thank you. I've incorporated all the comments above. I've attached the
> latest patches, and am going to push them (one by one) after
> self-review again.

One more cosmetic thing in 0001 that caught my eye:

diff --git a/src/backend/access/common/Makefile
b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
  syncscan.o \
  toast_compression.o \
  toast_internals.o \
+ tidstore.o \
  tupconvert.o \
  tupdesc.o

diff --git a/src/backend/access/common/meson.build
b/src/backend/access/common/meson.build
index 725041a4ce..a02397855e 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )

These aren't in alphabetical order.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 20, 2024 at 3:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > Locally (not CI), we should try big inputs to make sure we can
> > > actually go up to many GB -- it's easier and faster this way than
> > > having vacuum give us a large data set.
> >
> > I'll do these tests.
>
> I just remembered this -- did any of this kind of testing happen? I
> can do it as well.

I forgot to report the results. Yes, I did some tests where I inserted
many TIDs to make the tidstore use several GB memory. I did two cases:

1. insert 100M blocks of TIDs with an offset of 100.
2. insert 10M blocks of TIDs with an offset of 2048.

The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
and iteration results were expected.

>
> > Thank you. I've incorporated all the comments above. I've attached the
> > latest patches, and am going to push them (one by one) after
> > self-review again.
>
> One more cosmetic thing in 0001 that caught my eye:
>
> diff --git a/src/backend/access/common/Makefile
> b/src/backend/access/common/Makefile
> index b9aff0ccfd..67b8cc6108 100644
> --- a/src/backend/access/common/Makefile
> +++ b/src/backend/access/common/Makefile
> @@ -27,6 +27,7 @@ OBJS = \
>   syncscan.o \
>   toast_compression.o \
>   toast_internals.o \
> + tidstore.o \
>   tupconvert.o \
>   tupdesc.o
>
> diff --git a/src/backend/access/common/meson.build
> b/src/backend/access/common/meson.build
> index 725041a4ce..a02397855e 100644
> --- a/src/backend/access/common/meson.build
> +++ b/src/backend/access/common/meson.build
> @@ -15,6 +15,7 @@ backend_sources += files(
>    'syncscan.c',
>    'toast_compression.c',
>    'toast_internals.c',
> +  'tidstore.c',
>    'tupconvert.c',
>    'tupdesc.c',
>  )
>
> These aren't in alphabetical order.

Good catch. I'll fix them before the push.

While reviewing the codes again, the following two things caught my eyes:

in check_set_block_offset() function, we don't take a lock on the
tidstore while checking all possible TIDs. I'll add
TidStoreLockShare() and TidStoreUnlock() as follows:

+           TidStoreLockShare(tidstore);
            if (TidStoreIsMember(tidstore, &tid))
                ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
blkno, offset);
+           TidStoreUnlock(tidstore);

---
Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
a lock on the shared tidstore since dsa_get_total_size() (called by
RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
in the comment as follows:

-/* Return the memory usage of TidStore */
+/*
+ * Return the memory usage of TidStore.
+ *
+ * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
+ * locking, the caller doesn't need to take a lock.
+ */

What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I forgot to report the results. Yes, I did some tests where I inserted
> many TIDs to make the tidstore use several GB memory. I did two cases:
>
> 1. insert 100M blocks of TIDs with an offset of 100.
> 2. insert 10M blocks of TIDs with an offset of 2048.
>
> The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
> and iteration results were expected.

Thanks for confirming!

> While reviewing the codes again, the following two things caught my eyes:
>
> in check_set_block_offset() function, we don't take a lock on the
> tidstore while checking all possible TIDs. I'll add
> TidStoreLockShare() and TidStoreUnlock() as follows:
>
> +           TidStoreLockShare(tidstore);
>             if (TidStoreIsMember(tidstore, &tid))
>                 ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
> blkno, offset);
> +           TidStoreUnlock(tidstore);

In one sense, all locking in the test module is useless since there is
only a single process. On the other hand, it seems good to at least
run what we have written to run it trivially, and serve as an example
of usage. We should probably be consistent, and document at the top
that the locks are pro-forma only.

It's both a blessing and a curse that vacuum only has a single writer.
It makes development less of a hassle, but also means that tidstore
locking is done for API-completeness reasons, not (yet) as a practical
necessity. Even tidbitmap.c's hash table currently has a single
writer, and while using tidstore for that is still an engineering
challenge for other reasons, it wouldn't exercise locking
meaningfully, either, at least at first.

> Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
> a lock on the shared tidstore since dsa_get_total_size() (called by
> RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
> in the comment as follows:
>
> -/* Return the memory usage of TidStore */
> +/*
> + * Return the memory usage of TidStore.
> + *
> + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
> + * locking, the caller doesn't need to take a lock.
> + */
>
> What do you think?

That duplicates the underlying comment on the radix tree function that
this calls, so I'm inclined to leave it out. At this level it's
probably best to document when a caller _does_ need to take an action.

One thing I forgot to ask about earlier:

+-- Add tids in out of order.

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I forgot to report the results. Yes, I did some tests where I inserted
> > many TIDs to make the tidstore use several GB memory. I did two cases:
> >
> > 1. insert 100M blocks of TIDs with an offset of 100.
> > 2. insert 10M blocks of TIDs with an offset of 2048.
> >
> > The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
> > and iteration results were expected.
>
> Thanks for confirming!
>
> > While reviewing the codes again, the following two things caught my eyes:
> >
> > in check_set_block_offset() function, we don't take a lock on the
> > tidstore while checking all possible TIDs. I'll add
> > TidStoreLockShare() and TidStoreUnlock() as follows:
> >
> > +           TidStoreLockShare(tidstore);
> >             if (TidStoreIsMember(tidstore, &tid))
> >                 ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
> > blkno, offset);
> > +           TidStoreUnlock(tidstore);
>
> In one sense, all locking in the test module is useless since there is
> only a single process. On the other hand, it seems good to at least
> run what we have written to run it trivially, and serve as an example
> of usage. We should probably be consistent, and document at the top
> that the locks are pro-forma only.

Agreed.

>
> > Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
> > a lock on the shared tidstore since dsa_get_total_size() (called by
> > RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
> > in the comment as follows:
> >
> > -/* Return the memory usage of TidStore */
> > +/*
> > + * Return the memory usage of TidStore.
> > + *
> > + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
> > + * locking, the caller doesn't need to take a lock.
> > + */
> >
> > What do you think?
>
> That duplicates the underlying comment on the radix tree function that
> this calls, so I'm inclined to leave it out. At this level it's
> probably best to document when a caller _does_ need to take an action.

Okay, I didn't change it.

>
> One thing I forgot to ask about earlier:
>
> +-- Add tids in out of order.
>
> Are they (the blocks to be precise) really out of order? The VALUES
> statement is ordered, but after inserting it does not output that way.
> I wondered if this is platform independent, but CI and our dev
> machines haven't failed this test, and I haven't looked into what
> determines the order. It's easy enough to hide the blocks if we ever
> need to, as we do elsewhere...

It seems not necessary as such a test is already covered by
test_radixtree. I've changed the query to hide the output blocks.

I've pushed the tidstore patch after incorporating the above changes.
In addition to that, I've added the following changes before the push:

- Added src/test/modules/test_tidstore/.gitignore file.
- Removed unnecessary #include from tidstore.c.

The buildfarm has been all-green so far.

I've attached the latest vacuum improvement patch.

I just remembered that the tidstore cannot still be used for parallel
vacuum with minimum maintenance_work_mem. Even when the shared
tidstore is empty, its memory usage reports 1056768 bytes, a bit above
1MB (1048576 bytes). We need something discussed on another thread[1]
in order to make it work.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > Are they (the blocks to be precise) really out of order? The VALUES
> > statement is ordered, but after inserting it does not output that way.
> > I wondered if this is platform independent, but CI and our dev
> > machines haven't failed this test, and I haven't looked into what
> > determines the order. It's easy enough to hide the blocks if we ever
> > need to, as we do elsewhere...
>
> It seems not necessary as such a test is already covered by
> test_radixtree. I've changed the query to hide the output blocks.

Okay.

> The buildfarm has been all-green so far.

Great!

> I've attached the latest vacuum improvement patch.
>
> I just remembered that the tidstore cannot still be used for parallel
> vacuum with minimum maintenance_work_mem. Even when the shared
> tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> 1MB (1048576 bytes). We need something discussed on another thread[1]
> in order to make it work.

For exactly this reason, we used to have a clamp on max_bytes when it
was internal to tidstore, so that it never reported full when first
created, so I guess that got thrown away when we got rid of the
control object in shared memory. Forcing callers to clamp their own
limits seems pretty unfriendly, though.

The proposals in that thread are pretty simple. If those don't move
forward soon, a hackish workaround would be to round down the number
we get from dsa_get_total_size to the nearest megabyte. Then
controlling min/max segment size would be a nice-to-have for PG17, not
a prerequisite.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 21, 2024 at 12:40 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > Are they (the blocks to be precise) really out of order? The VALUES
> > > statement is ordered, but after inserting it does not output that way.
> > > I wondered if this is platform independent, but CI and our dev
> > > machines haven't failed this test, and I haven't looked into what
> > > determines the order. It's easy enough to hide the blocks if we ever
> > > need to, as we do elsewhere...
> >
> > It seems not necessary as such a test is already covered by
> > test_radixtree. I've changed the query to hide the output blocks.
>
> Okay.
>
> > The buildfarm has been all-green so far.
>
> Great!
>
> > I've attached the latest vacuum improvement patch.
> >
> > I just remembered that the tidstore cannot still be used for parallel
> > vacuum with minimum maintenance_work_mem. Even when the shared
> > tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> > 1MB (1048576 bytes). We need something discussed on another thread[1]
> > in order to make it work.
>
> For exactly this reason, we used to have a clamp on max_bytes when it
> was internal to tidstore, so that it never reported full when first
> created, so I guess that got thrown away when we got rid of the
> control object in shared memory. Forcing callers to clamp their own
> limits seems pretty unfriendly, though.

Or we can have a new function for dsa.c to set the initial and max
segment size (or either one) to the existing DSA area so that
TidStoreCreate() can specify them at creation. In shared TidStore
cases, since all memory required by shared radix tree is allocated in
the passed-in DSA area and the memory usage is the total segment size
allocated in the DSA area, the user will have to prepare a DSA area
only for the shared tidstore. So we might be able to expect that the
DSA passed-in to TidStoreCreate() is empty and its segment sizes can
be adjustable.

>
> The proposals in that thread are pretty simple. If those don't move
> forward soon, a hackish workaround would be to round down the number
> we get from dsa_get_total_size to the nearest megabyte. Then
> controlling min/max segment size would be a nice-to-have for PG17, not
> a prerequisite.

Interesting idea.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 21, 2024 at 3:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 12:40 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > > Are they (the blocks to be precise) really out of order? The VALUES
> > > > statement is ordered, but after inserting it does not output that way.
> > > > I wondered if this is platform independent, but CI and our dev
> > > > machines haven't failed this test, and I haven't looked into what
> > > > determines the order. It's easy enough to hide the blocks if we ever
> > > > need to, as we do elsewhere...
> > >
> > > It seems not necessary as such a test is already covered by
> > > test_radixtree. I've changed the query to hide the output blocks.
> >
> > Okay.
> >
> > > The buildfarm has been all-green so far.
> >
> > Great!
> >
> > > I've attached the latest vacuum improvement patch.
> > >
> > > I just remembered that the tidstore cannot still be used for parallel
> > > vacuum with minimum maintenance_work_mem. Even when the shared
> > > tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> > > 1MB (1048576 bytes). We need something discussed on another thread[1]
> > > in order to make it work.
> >
> > For exactly this reason, we used to have a clamp on max_bytes when it
> > was internal to tidstore, so that it never reported full when first
> > created, so I guess that got thrown away when we got rid of the
> > control object in shared memory. Forcing callers to clamp their own
> > limits seems pretty unfriendly, though.
>
> Or we can have a new function for dsa.c to set the initial and max
> segment size (or either one) to the existing DSA area so that
> TidStoreCreate() can specify them at creation. In shared TidStore
> cases, since all memory required by shared radix tree is allocated in
> the passed-in DSA area and the memory usage is the total segment size
> allocated in the DSA area, the user will have to prepare a DSA area
> only for the shared tidstore. So we might be able to expect that the
> DSA passed-in to TidStoreCreate() is empty and its segment sizes can
> be adjustable.

Yet another idea is that TidStore creates its own DSA area in
TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
(using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
API to get the DSA area. The caller (e.g. parallel vacuum) gets the
dsa_handle of the DSA and stores it in the shared memory (e.g. in
PVShared). TidStoreAttach() will take two arguments: dsa_handle for
the DSA area and dsa_pointer for the shared radix tree. This idea
still requires controlling min/max segment sizes since dsa_create()
uses the 1MB as the initial segment size. But the TidStoreCreate()
would be more user friendly.

I've attached a PoC patch for discussion.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Or we can have a new function for dsa.c to set the initial and max
> segment size (or either one) to the existing DSA area so that
> TidStoreCreate() can specify them at creation.

I didn't like this very much, because it's splitting an operation
across an API boundary. The caller already has all the information it
needs when it creates the DSA. Straw man proposal: it could do the
same for local memory, then they'd be more similar. But if we made
local contexts the responsibility of the caller, that would cause
duplication between creating and resetting.

> In shared TidStore
> cases, since all memory required by shared radix tree is allocated in
> the passed-in DSA area and the memory usage is the total segment size
> allocated in the DSA area

...plus apparently some overhead, I just found out today, but that's
beside the point.

On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Yet another idea is that TidStore creates its own DSA area in
> TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
> (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
> API to get the DSA area. The caller (e.g. parallel vacuum) gets the
> dsa_handle of the DSA and stores it in the shared memory (e.g. in
> PVShared). TidStoreAttach() will take two arguments: dsa_handle for
> the DSA area and dsa_pointer for the shared radix tree. This idea
> still requires controlling min/max segment sizes since dsa_create()
> uses the 1MB as the initial segment size. But the TidStoreCreate()
> would be more user friendly.

This seems like an overall simplification, aside from future size
configuration, so +1 to continue looking into this. If we go this
route, I'd like to avoid a boolean parameter and cleanly separate
TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
after that can introspect, but it's a bit awkward to force these cases
into the same function. It always was a little bit, but this change
makes it more so.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 21, 2024 at 4:35 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Or we can have a new function for dsa.c to set the initial and max
> > segment size (or either one) to the existing DSA area so that
> > TidStoreCreate() can specify them at creation.
>
> I didn't like this very much, because it's splitting an operation
> across an API boundary. The caller already has all the information it
> needs when it creates the DSA. Straw man proposal: it could do the
> same for local memory, then they'd be more similar. But if we made
> local contexts the responsibility of the caller, that would cause
> duplication between creating and resetting.

Fair point.

>
> > In shared TidStore
> > cases, since all memory required by shared radix tree is allocated in
> > the passed-in DSA area and the memory usage is the total segment size
> > allocated in the DSA area
>
> ...plus apparently some overhead, I just found out today, but that's
> beside the point.
>
> On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Yet another idea is that TidStore creates its own DSA area in
> > TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
> > (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
> > API to get the DSA area. The caller (e.g. parallel vacuum) gets the
> > dsa_handle of the DSA and stores it in the shared memory (e.g. in
> > PVShared). TidStoreAttach() will take two arguments: dsa_handle for
> > the DSA area and dsa_pointer for the shared radix tree. This idea
> > still requires controlling min/max segment sizes since dsa_create()
> > uses the 1MB as the initial segment size. But the TidStoreCreate()
> > would be more user friendly.
>
> This seems like an overall simplification, aside from future size
> configuration, so +1 to continue looking into this. If we go this
> route, I'd like to avoid a boolean parameter and cleanly separate
> TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
> after that can introspect, but it's a bit awkward to force these cases
> into the same function. It always was a little bit, but this change
> makes it more so.

I've looked into this idea further. Overall, it looks clean and I
don't see any problem so far in terms of integration with lazy vacuum.
I've attached three patches for discussion and tests.

- 0001 patch makes lazy vacuum use of tidstore.
- 0002 patch makes DSA init/max segment size configurable (borrowed
from another thread).
- 0003 patch makes TidStore create its own DSA area with init/max DSA
segment adjustment (PoC patch).

One thing unclear to me is that this idea will be usable even when we
want to use the tidstore for parallel bitmap scan. Currently, we
create a shared tidbitmap on a DSA area in ParallelExecutorInfo. This
DSA area is used not only for tidbitmap but also for parallel hash
etc. If the tidstore created its own DSA area, parallel bitmap scan
would have to use the tidstore's DSA in addition to the DSA area in
ParallelExecutorInfo. I'm not sure if there are some differences
between these usages in terms of resource manager etc. It seems no
problem but I might be missing something.

Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've looked into this idea further. Overall, it looks clean and I
> don't see any problem so far in terms of integration with lazy vacuum.
> I've attached three patches for discussion and tests.

Seems okay in the big picture, it's the details we need to be careful of.

v77-0001

- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+ dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+ dead_items_info->max_bytes = vac_work_mem * 1024L;

This is confusing enough that it looks like a bug:

[inside TidStoreCreate()]
/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
while (16 * maxBlockSize > max_bytes * 1024L)
maxBlockSize >>= 1;

This was copied from CreateWorkExprContext, which operates directly on
work_mem -- if the parameter is actually bytes, we can't "* 1024"
here. If we're passing something measured in kilobytes, the parameter
is badly named. Let's use convert once and use bytes everywhere.

Note: This was not another pass over the whole vacuum patch, just
looking an the issue at hand.
Also for later: Dilip Kumar reviewed an earlier version.

v77-0002:

+#define dsa_create(tranch_id) \
+ dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)

Since these macros are now referring to defaults, maybe their name
should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
(*_MAX_*)

+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)

That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
m_w_m to work correctly.

v77-0003:

+/* Public APIs to create local or shared TidStore */
+
+TidStore *
+TidStoreCreateLocal(size_t max_bytes)
+{
+ return tidstore_create_internal(max_bytes, false, 0);
+}
+
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+ return tidstore_create_internal(max_bytes, true, tranche_id);
+}

I don't think these operations have enough in common to justify
sharing even an internal implementation. Choosing aset block size is
done for both memory types, but it's pointless to do it for shared
memory, because the local context is then only used for small
metadata.

+ /*
+ * Choose the DSA initial and max segment sizes to be no longer than
+ * 1/16 and 1/8 of max_bytes, respectively.
+ */

I'm guessing the 1/8 here because the number of segments is limited? I
know these numbers are somewhat arbitrary, but readers will wonder why
one has 1/8 and the other has 1/16.

+ if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+     dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+ if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
+     dsa_max_size = DSA_MAX_SEGMENT_SIZE;

The second clamp seems against the whole point of this patch -- it
seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
Did you try it with 1MB m_w_m?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've looked into this idea further. Overall, it looks clean and I
> > don't see any problem so far in terms of integration with lazy vacuum.
> > I've attached three patches for discussion and tests.
>
> Seems okay in the big picture, it's the details we need to be careful of.
>
> v77-0001
>
> - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
> - dead_items->max_items = max_items;
> - dead_items->num_items = 0;
> + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> +
> + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> + dead_items_info->max_bytes = vac_work_mem * 1024L;
>
> This is confusing enough that it looks like a bug:
>
> [inside TidStoreCreate()]
> /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> while (16 * maxBlockSize > max_bytes * 1024L)
> maxBlockSize >>= 1;
>
> This was copied from CreateWorkExprContext, which operates directly on
> work_mem -- if the parameter is actually bytes, we can't "* 1024"
> here. If we're passing something measured in kilobytes, the parameter
> is badly named. Let's use convert once and use bytes everywhere.

True. The attached 0001 patch fixes it.

>
> v77-0002:
>
> +#define dsa_create(tranch_id) \
> + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
>
> Since these macros are now referring to defaults, maybe their name
> should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> (*_MAX_*)

It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
name also makes sense to me.

>
> +/* The minimum size of a DSM segment. */
> +#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)
>
> That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
> m_w_m to work correctly.

Fixed.

>
> v77-0003:
>
> +/* Public APIs to create local or shared TidStore */
> +
> +TidStore *
> +TidStoreCreateLocal(size_t max_bytes)
> +{
> + return tidstore_create_internal(max_bytes, false, 0);
> +}
> +
> +TidStore *
> +TidStoreCreateShared(size_t max_bytes, int tranche_id)
> +{
> + return tidstore_create_internal(max_bytes, true, tranche_id);
> +}
>
> I don't think these operations have enough in common to justify
> sharing even an internal implementation. Choosing aset block size is
> done for both memory types, but it's pointless to do it for shared
> memory, because the local context is then only used for small
> metadata.
>
> + /*
> + * Choose the DSA initial and max segment sizes to be no longer than
> + * 1/16 and 1/8 of max_bytes, respectively.
> + */
>
> I'm guessing the 1/8 here because the number of segments is limited? I
> know these numbers are somewhat arbitrary, but readers will wonder why
> one has 1/8 and the other has 1/16.
>
> + if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
> +     dsa_init_size = DSA_MIN_SEGMENT_SIZE;
> + if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
> +     dsa_max_size = DSA_MAX_SEGMENT_SIZE;
>
> The second clamp seems against the whole point of this patch -- it
> seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
> Did you try it with 1MB m_w_m?

I've incorporated the above comments and test results look good to me.

I've attached the several patches:

- 0002 is a minor fix for tidstore I found.
- 0005 changes the create APIs of tidstore.
- 0006 update the vacuum improvement patch to use the new
TidStoreCreateLocal/Shared() APIs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Tom Lane
Date:
John Naylor <johncnaylorls@gmail.com> writes:
> Done. I pushed this with a few last-minute cosmetic adjustments. This
> has been a very long time coming, but we're finally in the home
> stretch!

I'm not sure why it took a couple weeks for Coverity to notice
ee1b30f12, but it saw it today, and it's not happy:

/srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down()
1615             node = child;
1616             shift -= RT_SPAN;
1617         }
1618
1619         /* Reserve slot for the value. */
1620         n4 = (RT_NODE_4 *) node.local;
>>>     CID 1594658:  Integer handling issues  (BAD_SHIFT)
>>>     In expression "key >> shift", shifting by a negative amount has undefined behavior.  The shift amount, "shift",
isas little as -7. 
1621         n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
1622         n4->base.count = 1;
1623
1624         return &n4->children[0];
1625     }
1626

I think the point here is that if you start with an arbitrary
non-negative shift value, the preceding loop may in fact decrement it
down to something less than zero before exiting, in which case we
would indeed have trouble.  I suspect that the code is making
undocumented assumptions about the possible initial values of shift.
Maybe some Asserts would be good?  Also, if we're effectively assuming
that shift must be exactly zero here, why not let the compiler
hard-code that?

-         n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
+         n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

            regards, tom lane



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> John Naylor <johncnaylorls@gmail.com> writes:
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!

Thank you for the report.

>
> I'm not sure why it took a couple weeks for Coverity to notice
> ee1b30f12, but it saw it today, and it's not happy:

Hmm, I've also done Coverity Scan in development but I wasn't able to
see this one for some reason...

>
> /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down()
> 1615                    node = child;
> 1616                    shift -= RT_SPAN;
> 1617            }
> 1618
> 1619            /* Reserve slot for the value. */
> 1620            n4 = (RT_NODE_4 *) node.local;
> >>>     CID 1594658:  Integer handling issues  (BAD_SHIFT)
> >>>     In expression "key >> shift", shifting by a negative amount has undefined behavior.  The shift amount,
"shift",is as little as -7. 
> 1621            n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> 1622            n4->base.count = 1;
> 1623
> 1624            return &n4->children[0];
> 1625     }
> 1626
>
> I think the point here is that if you start with an arbitrary
> non-negative shift value, the preceding loop may in fact decrement it
> down to something less than zero before exiting, in which case we
> would indeed have trouble.  I suspect that the code is making
> undocumented assumptions about the possible initial values of shift.
> Maybe some Asserts would be good?  Also, if we're effectively assuming
> that shift must be exactly zero here, why not let the compiler
> hard-code that?
>
> -       n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> +       n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

Sounds like a good solution. I've attached the patch for that.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Tom Lane
Date:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
> On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think the point here is that if you start with an arbitrary
>> non-negative shift value, the preceding loop may in fact decrement it
>> down to something less than zero before exiting, in which case we
>> would indeed have trouble.  I suspect that the code is making
>> undocumented assumptions about the possible initial values of shift.
>> Maybe some Asserts would be good?  Also, if we're effectively assuming
>> that shift must be exactly zero here, why not let the compiler
>> hard-code that?

> Sounds like a good solution. I've attached the patch for that.

Personally I'd put the Assert immediately after the loop, because
it's not related to the "Reserve slot for the value" comment.
Seems reasonable otherwise.

            regards, tom lane



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I'm not sure why it took a couple weeks for Coverity to notice
> > ee1b30f12, but it saw it today, and it's not happy:
>
> Hmm, I've also done Coverity Scan in development but I wasn't able to
> see this one for some reason...

Hmm, before 30e144287 this code only ran in a test module, is it
possible Coverity would not find it there?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Tom Lane
Date:
John Naylor <johncnaylorls@gmail.com> writes:
> Hmm, before 30e144287 this code only ran in a test module, is it
> possible Coverity would not find it there?

That could indeed explain why Coverity didn't see it.  I'm not
sure how our community run is set up, but it may not build the
test modules.

            regards, tom lane



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 25, 2024 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Masahiko Sawada <sawada.mshk@gmail.com> writes:
> > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think the point here is that if you start with an arbitrary
> >> non-negative shift value, the preceding loop may in fact decrement it
> >> down to something less than zero before exiting, in which case we
> >> would indeed have trouble.  I suspect that the code is making
> >> undocumented assumptions about the possible initial values of shift.
> >> Maybe some Asserts would be good?  Also, if we're effectively assuming
> >> that shift must be exactly zero here, why not let the compiler
> >> hard-code that?
>
> > Sounds like a good solution. I've attached the patch for that.
>
> Personally I'd put the Assert immediately after the loop, because
> it's not related to the "Reserve slot for the value" comment.
> Seems reasonable otherwise.
>

Thanks. Pushed the fix after moving the Assert.


Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > v77-0001
> >
> > - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
> > - dead_items->max_items = max_items;
> > - dead_items->num_items = 0;
> > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > +
> > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> >
> > This is confusing enough that it looks like a bug:
> >
> > [inside TidStoreCreate()]
> > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > while (16 * maxBlockSize > max_bytes * 1024L)
> > maxBlockSize >>= 1;
> >
> > This was copied from CreateWorkExprContext, which operates directly on
> > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > here. If we're passing something measured in kilobytes, the parameter
> > is badly named. Let's use convert once and use bytes everywhere.
>
> True. The attached 0001 patch fixes it.

v78-0001 and 02 are fine, but for 0003 there is a consequence that I
didn't see mentioned: vac_work_mem now refers to bytes, where before
it referred to kilobytes. It seems pretty confusing to use a different
convention from elsewhere, especially if it has the same name but
different meaning across versions. Worse, this change is buried inside
a moving-stuff-around diff, making it hard to see. Maybe "convert only
once" is still possible, but I was actually thinking of

+ dead_items_info->max_bytes = vac_work_mem * 1024L;
+ vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);

That way it's pretty obvious that it's correct. That may require a bit
of duplication and moving around for shmem, but there is some of that
already.

More on 0003:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs

+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum

I wonder if the comments here should refer to it using a more natural
spelling, like "TID store".

- * items in the dead_items array for later vacuuming, count live and
+ * items in the dead_items for later vacuuming, count live and

Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?

- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.

Here maybe "into dead_items".

Also, "we line pointers" seems to be a pre-existing typo.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));

This is a translated message, so let's keep the message the same.

/*
 * Allocate dead_items (either using palloc, or in dynamic shared memory).
 * Sets dead_items in vacrel for caller.
 *
 * Also handles parallel initialization as part of allocating dead_items in
 * DSM when required.
 */
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)

This comment didn't change at all. It's not wrong, but let's consider
updating the specifics.

v78-0004:

> > +#define dsa_create(tranch_id) \
> > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
> >
> > Since these macros are now referring to defaults, maybe their name
> > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> > (*_MAX_*)
>
> It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
> the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
> name also makes sense to me.

Right, that makes sense.

v78-0005:

"Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
still needed to clamp their own limits, which was not consistent and
user-friendly."

Perhaps s/still needed/would have needed/ ..., since we're preventing
that necessity.

> > Did you try it with 1MB m_w_m?
>
> I've incorporated the above comments and test results look good to me.

Could you be more specific about what the test was?
Does it work with 1MB m_w_m?

+ /*
+ * Choose the initial and maximum DSA segment sizes to be no longer
+ * than 1/16 and 1/8 of max_bytes, respectively. If the initial
+ * segment size is low, we end up having many segments, which risks
+ * exceeding the total number of segments the platform can have.

The second sentence is technically correct, but I'm not sure how it
relates to the code that follows.

+ while (16 * dsa_init_size > max_bytes)
+ dsa_init_size >>= 1;
+ while (8 * dsa_max_size > max_bytes)
+ dsa_max_size >>= 1;

I'm not sure we need a separate loop for "dsa_init_size". Can we just have :

while (8 * dsa_max_size > max_bytes)
    dsa_max_size >>= 1;

if (dsa_max_size < DSA_MIN_SEGMENT_SIZE)
    dsa_max_size = DSA_MIN_SEGMENT_SIZE;

if (dsa_init_size > dsa_max_size)
    dsa_init_size = dsa_max_size;

@@ -113,13 +113,10 @@ static void
tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
  * CurrentMemoryContext at the time of this call. The TID storage, backed
  * by a radix tree, will live in its child memory context, rt_context. The
  * TidStore will be limited to (approximately) max_bytes total memory
- * consumption. If the 'area' is non-NULL, the radix tree is created in the
- * DSA area.
- *
- * The returned object is allocated in backend-local memory.
+ * consumption.

The existing comment slipped past my radar, but max_bytes is not a
limit, it's a hint. Come to think of it, it never was a limit in the
normal sense, but in earlier patches it was the criteria for reporting
"I'm full" when asked.

 void
 TidStoreDestroy(TidStore *ts)
 {
- /* Destroy underlying radix tree */
  if (TidStoreIsShared(ts))
+ {
+ /* Destroy underlying radix tree */
  shared_ts_free(ts->tree.shared);
+
+ dsa_detach(ts->area);
+ }
  else
  local_ts_free(ts->tree.local);

It's still destroyed in the local case, so not sure why this comment was moved?

v78-0006:

-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */

I don't see any use in core outside this module -- maybe it's possible
to renumber these?



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > v77-0001
> > >
> > > - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
> > > - dead_items->max_items = max_items;
> > > - dead_items->num_items = 0;
> > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > > +
> > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> > >
> > > This is confusing enough that it looks like a bug:
> > >
> > > [inside TidStoreCreate()]
> > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > > while (16 * maxBlockSize > max_bytes * 1024L)
> > > maxBlockSize >>= 1;
> > >
> > > This was copied from CreateWorkExprContext, which operates directly on
> > > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > > here. If we're passing something measured in kilobytes, the parameter
> > > is badly named. Let's use convert once and use bytes everywhere.
> >
> > True. The attached 0001 patch fixes it.
>
> v78-0001 and 02 are fine, but for 0003 there is a consequence that I
> didn't see mentioned:

I think that the fix done in 0001 patch can be merged into 0003 patch.

>  vac_work_mem now refers to bytes, where before
> it referred to kilobytes. It seems pretty confusing to use a different
> convention from elsewhere, especially if it has the same name but
> different meaning across versions. Worse, this change is buried inside
> a moving-stuff-around diff, making it hard to see. Maybe "convert only
> once" is still possible, but I was actually thinking of
>
> + dead_items_info->max_bytes = vac_work_mem * 1024L;
> + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);
>
> That way it's pretty obvious that it's correct. That may require a bit
> of duplication and moving around for shmem, but there is some of that
> already.

Agreed.

>
> More on 0003:
>
> - * The major space usage for vacuuming is storage for the array of dead TIDs
> + * The major space usage for vacuuming is TidStore, a storage for dead TIDs
>
> + * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
> + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
>
> I wonder if the comments here should refer to it using a more natural
> spelling, like "TID store".
>
> - * items in the dead_items array for later vacuuming, count live and
> + * items in the dead_items for later vacuuming, count live and
>
> Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?
>
> - * remaining LP_DEAD line pointers on the page in the dead_items
> - * array. These dead items include those pruned by lazy_scan_prune()
> - * as well we line pointers previously marked LP_DEAD.
> + * remaining LP_DEAD line pointers on the page in the dead_items.
> + * These dead items include those pruned by lazy_scan_prune() as well
> + * we line pointers previously marked LP_DEAD.
>
> Here maybe "into dead_items".
>
> Also, "we line pointers" seems to be a pre-existing typo.
>
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
> in %u pages",
> + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
>
> This is a translated message, so let's keep the message the same.
>
> /*
>  * Allocate dead_items (either using palloc, or in dynamic shared memory).
>  * Sets dead_items in vacrel for caller.
>  *
>  * Also handles parallel initialization as part of allocating dead_items in
>  * DSM when required.
>  */
> static void
> dead_items_alloc(LVRelState *vacrel, int nworkers)
>
> This comment didn't change at all. It's not wrong, but let's consider
> updating the specifics.

Fixed above comments.

> v78-0005:
>
> "Although commit XXX
> allowed specifying the initial and maximum DSA segment sizes, callers
> still needed to clamp their own limits, which was not consistent and
> user-friendly."
>
> Perhaps s/still needed/would have needed/ ..., since we're preventing
> that necessity.
>
> > > Did you try it with 1MB m_w_m?
> >
> > I've incorporated the above comments and test results look good to me.
>
> Could you be more specific about what the test was?
> Does it work with 1MB m_w_m?

If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.

FYI other test cases I tested were:

* m_w_m = 2199023254528 (maximum value)
initial: 1MB
max: 128GB

* m_w_m = 64MB (default)
initial: 1MB
max: 8MB

>
> + /*
> + * Choose the initial and maximum DSA segment sizes to be no longer
> + * than 1/16 and 1/8 of max_bytes, respectively. If the initial
> + * segment size is low, we end up having many segments, which risks
> + * exceeding the total number of segments the platform can have.
>
> The second sentence is technically correct, but I'm not sure how it
> relates to the code that follows.
>
> + while (16 * dsa_init_size > max_bytes)
> + dsa_init_size >>= 1;
> + while (8 * dsa_max_size > max_bytes)
> + dsa_max_size >>= 1;
>
> I'm not sure we need a separate loop for "dsa_init_size". Can we just have :
>
> while (8 * dsa_max_size > max_bytes)
>     dsa_max_size >>= 1;
>
> if (dsa_max_size < DSA_MIN_SEGMENT_SIZE)
>     dsa_max_size = DSA_MIN_SEGMENT_SIZE;
>
> if (dsa_init_size > dsa_max_size)
>     dsa_init_size = dsa_max_size;

Agreed.

>
> @@ -113,13 +113,10 @@ static void
> tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
>   * CurrentMemoryContext at the time of this call. The TID storage, backed
>   * by a radix tree, will live in its child memory context, rt_context. The
>   * TidStore will be limited to (approximately) max_bytes total memory
> - * consumption. If the 'area' is non-NULL, the radix tree is created in the
> - * DSA area.
> - *
> - * The returned object is allocated in backend-local memory.
> + * consumption.
>
> The existing comment slipped past my radar, but max_bytes is not a
> limit, it's a hint. Come to think of it, it never was a limit in the
> normal sense, but in earlier patches it was the criteria for reporting
> "I'm full" when asked.

Updated the comment.

>
>  void
>  TidStoreDestroy(TidStore *ts)
>  {
> - /* Destroy underlying radix tree */
>   if (TidStoreIsShared(ts))
> + {
> + /* Destroy underlying radix tree */
>   shared_ts_free(ts->tree.shared);
> +
> + dsa_detach(ts->area);
> + }
>   else
>   local_ts_free(ts->tree.local);
>
> It's still destroyed in the local case, so not sure why this comment was moved?
>
> v78-0006:
>
> -#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
> +/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */
>
> I don't see any use in core outside this module -- maybe it's possible
> to renumber these?

Fixed the above points.

I've attached the latest patches. The 0004 and 0006 patches are
updates from the previous version.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > - * remaining LP_DEAD line pointers on the page in the dead_items
> > - * array. These dead items include those pruned by lazy_scan_prune()
> > - * as well we line pointers previously marked LP_DEAD.
> > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * These dead items include those pruned by lazy_scan_prune() as well
> > + * we line pointers previously marked LP_DEAD.
> >
> > Here maybe "into dead_items".

- * remaining LP_DEAD line pointers on the page in the dead_items.
+ * remaining LP_DEAD line pointers on the page into the dead_items.

Let me explain. It used to be "in the dead_items array." It is not an
array anymore, so it was changed to "in the dead_items". dead_items is
a variable name, and names don't take "the". "into dead_items" seems
most natural to me, but there are other possible phrasings.

> > > > Did you try it with 1MB m_w_m?
> > >
> > > I've incorporated the above comments and test results look good to me.
> >
> > Could you be more specific about what the test was?
> > Does it work with 1MB m_w_m?
>
> If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
>
> FYI other test cases I tested were:
>
> * m_w_m = 2199023254528 (maximum value)
> initial: 1MB
> max: 128GB
>
> * m_w_m = 64MB (default)
> initial: 1MB
> max: 8MB

If the test was a vacuum, how big a table was needed to hit 128GB?

> > The existing comment slipped past my radar, but max_bytes is not a
> > limit, it's a hint. Come to think of it, it never was a limit in the
> > normal sense, but in earlier patches it was the criteria for reporting
> > "I'm full" when asked.
>
> Updated the comment.

+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.

This is good information. I suggest this edit:

"max_bytes" is not an internally-enforced limit; it is used only as a
hint to cap the memory block size of the memory context for TID
storage. This reduces space wastage due to over-allocation. If the
caller wants to monitor memory usage, it must compare its limit with
the value reported by TidStoreMemoryUsage().

Other comments:

v79-0002 looks good to me.

v79-0003:

"With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area."

This is very subtle, but "the provided..." implies there still is one.
-> "a provided..."

+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.

-> "the memory context"

I think you can go ahead and commit 0002 and 0003/4.

v79-0005:

- bypass = (vacrel->lpdead_item_pages < threshold &&
-   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);

The parentheses look strange, and the first line shouldn't change
without a good reason.

- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+ shared->dead_items_handle);

I feel ambivalent about this comment change. The original is not very
descriptive to begin with. If we need to change at all, maybe "find
dead_items in shared memory"?

v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.

v79-0006:

vac_work_mem should also go back to being an int.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 27, 2024 at 9:25 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > - * as well we line pointers previously marked LP_DEAD.
> > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > + * we line pointers previously marked LP_DEAD.
> > >
> > > Here maybe "into dead_items".
>
> - * remaining LP_DEAD line pointers on the page in the dead_items.
> + * remaining LP_DEAD line pointers on the page into the dead_items.
>
> Let me explain. It used to be "in the dead_items array." It is not an
> array anymore, so it was changed to "in the dead_items". dead_items is
> a variable name, and names don't take "the". "into dead_items" seems
> most natural to me, but there are other possible phrasings.

Thanks for the explanation. I was distracted. Fixed in the latest patch.

>
> > > > > Did you try it with 1MB m_w_m?
> > > >
> > > > I've incorporated the above comments and test results look good to me.
> > >
> > > Could you be more specific about what the test was?
> > > Does it work with 1MB m_w_m?
> >
> > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> >
> > FYI other test cases I tested were:
> >
> > * m_w_m = 2199023254528 (maximum value)
> > initial: 1MB
> > max: 128GB
> >
> > * m_w_m = 64MB (default)
> > initial: 1MB
> > max: 8MB
>
> If the test was a vacuum, how big a table was needed to hit 128GB?

I just checked how TIdStoreCreateLocal() calculated the initial and
max segment sizes while changing m_w_m, so didn't check how big
segments are actually allocated in the maximum value test case.

>
> > > The existing comment slipped past my radar, but max_bytes is not a
> > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > normal sense, but in earlier patches it was the criteria for reporting
> > > "I'm full" when asked.
> >
> > Updated the comment.
>
> + * max_bytes is not a limit; it's used to choose the memory block sizes of
> + * a memory context for TID storage in order for the total memory consumption
> + * not to be overshot a lot. The caller can use the max_bytes as the criteria
> + * for reporting whether it's full or not.
>
> This is good information. I suggest this edit:
>
> "max_bytes" is not an internally-enforced limit; it is used only as a
> hint to cap the memory block size of the memory context for TID
> storage. This reduces space wastage due to over-allocation. If the
> caller wants to monitor memory usage, it must compare its limit with
> the value reported by TidStoreMemoryUsage().
>
> Other comments:

Thanks for the suggestion!

>
> v79-0002 looks good to me.
>
> v79-0003:
>
> "With this commit, when creating a shared TidStore, a dedicated DSA
> area is created for TID storage instead of using the provided DSA
> area."
>
> This is very subtle, but "the provided..." implies there still is one.
> -> "a provided..."
>
> + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> + * DSA area. The TID storage will live in the DSA area, and a memory
> + * context rt_context will have only meta data of the radix tree.
>
> -> "the memory context"

Fixed in the latest patch.

>
> I think you can go ahead and commit 0002 and 0003/4.

I've pushed the 0002 (dsa init and max segment size) patch, and will
push the attached 0001 patch next.

>
> v79-0005:
>
> - bypass = (vacrel->lpdead_item_pages < threshold &&
> -   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
> + bypass = (vacrel->lpdead_item_pages < threshold) &&
> + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
>
> The parentheses look strange, and the first line shouldn't change
> without a good reason.

Fixed.

>
> - /* Set dead_items space */
> - dead_items = (VacDeadItems *) shm_toc_lookup(toc,
> - PARALLEL_VACUUM_KEY_DEAD_ITEMS,
> - false);
> + /* Set dead items */
> + dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
> + shared->dead_items_handle);
>
> I feel ambivalent about this comment change. The original is not very
> descriptive to begin with. If we need to change at all, maybe "find
> dead_items in shared memory"?

Agreed.

>
> v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.
>
> v79-0006:
>
> vac_work_mem should also go back to being an int.

Fixed.

I've attached the latest patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 27, 2024 at 9:25 AM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > > - * as well we line pointers previously marked LP_DEAD.
> > > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > > + * we line pointers previously marked LP_DEAD.
> > > >
> > > > Here maybe "into dead_items".
> >
> > - * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * remaining LP_DEAD line pointers on the page into the dead_items.
> >
> > Let me explain. It used to be "in the dead_items array." It is not an
> > array anymore, so it was changed to "in the dead_items". dead_items is
> > a variable name, and names don't take "the". "into dead_items" seems
> > most natural to me, but there are other possible phrasings.
>
> Thanks for the explanation. I was distracted. Fixed in the latest patch.
>
> >
> > > > > > Did you try it with 1MB m_w_m?
> > > > >
> > > > > I've incorporated the above comments and test results look good to me.
> > > >
> > > > Could you be more specific about what the test was?
> > > > Does it work with 1MB m_w_m?
> > >
> > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> > >
> > > FYI other test cases I tested were:
> > >
> > > * m_w_m = 2199023254528 (maximum value)
> > > initial: 1MB
> > > max: 128GB
> > >
> > > * m_w_m = 64MB (default)
> > > initial: 1MB
> > > max: 8MB
> >
> > If the test was a vacuum, how big a table was needed to hit 128GB?
>
> I just checked how TIdStoreCreateLocal() calculated the initial and
> max segment sizes while changing m_w_m, so didn't check how big
> segments are actually allocated in the maximum value test case.
>
> >
> > > > The existing comment slipped past my radar, but max_bytes is not a
> > > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > > normal sense, but in earlier patches it was the criteria for reporting
> > > > "I'm full" when asked.
> > >
> > > Updated the comment.
> >
> > + * max_bytes is not a limit; it's used to choose the memory block sizes of
> > + * a memory context for TID storage in order for the total memory consumption
> > + * not to be overshot a lot. The caller can use the max_bytes as the criteria
> > + * for reporting whether it's full or not.
> >
> > This is good information. I suggest this edit:
> >
> > "max_bytes" is not an internally-enforced limit; it is used only as a
> > hint to cap the memory block size of the memory context for TID
> > storage. This reduces space wastage due to over-allocation. If the
> > caller wants to monitor memory usage, it must compare its limit with
> > the value reported by TidStoreMemoryUsage().
> >
> > Other comments:
>
> Thanks for the suggestion!
>
> >
> > v79-0002 looks good to me.
> >
> > v79-0003:
> >
> > "With this commit, when creating a shared TidStore, a dedicated DSA
> > area is created for TID storage instead of using the provided DSA
> > area."
> >
> > This is very subtle, but "the provided..." implies there still is one.
> > -> "a provided..."
> >
> > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> > + * DSA area. The TID storage will live in the DSA area, and a memory
> > + * context rt_context will have only meta data of the radix tree.
> >
> > -> "the memory context"
>
> Fixed in the latest patch.
>
> >
> > I think you can go ahead and commit 0002 and 0003/4.
>
> I've pushed the 0002 (dsa init and max segment size) patch, and will
> push the attached 0001 patch next.

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Pushed the refactoring patch.
>
> I've attached the rebased vacuum improvement patch for cfbot. I
> mentioned in the commit message that this patch eliminates the 1GB
> limitation.
>
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

I'll do another pass tomorrow, but first I wanted to get in another
slightly-challenging in-situ test. On my humble laptop, I can still
fit a table large enough to cause PG16 to choke on multiple rounds of
index cleanup:

drop table if exists test;
create unlogged table test (a int, b uuid) with (autovacuum_enabled=false);

insert into test (a,b) select i, gen_random_uuid() from
generate_series(1,1000*1000*1000) i;

create index on test (a);
create index on test (b);

delete from test;

vacuum (verbose, truncate off, parallel 2) test;

INFO:  vacuuming "john.public.test"
INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 6369427 remain, 6369427 scanned (100.00% of total)
tuples: 999997174 removed, 2826 remain, 0 are dead but not yet removable
tuples missed: 2826 dead from 18 pages not removed due to cleanup lock
contention
removable cutoff: 771, which was 0 XIDs old when operation ended
new relfrozenxid: 767, which is 4 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 6369409 pages from table (100.00% of total) had
999997174 dead item identifiers removed
index "test_a_idx": pages: 2741898 in total, 2741825 newly deleted,
2741825 currently deleted, 0 reusable
index "test_b_idx": pages: 3850387 in total, 3842056 newly deleted,
3842056 currently deleted, 0 reusable
avg read rate: 159.740 MB/s, avg write rate: 161.726 MB/s
buffer usage: 26367981 hits, 14958634 misses, 15144601 dirtied
WAL usage: 3 records, 1 full page images, 2050 bytes
system usage: CPU: user: 151.89 s, system: 193.54 s, elapsed: 731.59 s

Watching pg_stat_progress_vacuum, dead_tuple_bytes got up to 398458880.

About the "tuples missed" -- I didn't expect contention during this
test. I believe that's completely unrelated behavior, but wanted to
mention it anyway, since I found it confusing.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Mar 28, 2024 at 6:15 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Pushed the refactoring patch.
> >
> > I've attached the rebased vacuum improvement patch for cfbot. I
> > mentioned in the commit message that this patch eliminates the 1GB
> > limitation.
> >
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> I'll do another pass tomorrow, but first I wanted to get in another
> slightly-challenging in-situ test.

Thanks!

>
> About the "tuples missed" -- I didn't expect contention during this
> test. I believe that's completely unrelated behavior, but wanted to
> mention it anyway, since I found it confusing.

I don't investigate it enough but bgwriter might be related to the contention.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>

This is mentioned twice for two different GUCs -- need to remove the
other one, too. Other than that, I just have minor nits:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs

I think I've helped edit this sentence before, but I still don't quite
like it. I'm thinking now "is storage for the dead tuple IDs".

- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.

I think "maximum" is redundant with "upper bounds".

I also feel the commit message needs more "meat" -- we need to clearly
narrate the features and benefits. I've attached how I would write it,
but feel free to use what you like to match your taste.

I've marked it Ready for Committer.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Fri, Mar 29, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -1918,11 +1918,6 @@ include_dir 'conf.d'
>          too high.  It may be useful to control for this by separately
>          setting <xref linkend="guc-autovacuum-work-mem"/>.
>         </para>
> -       <para>
> -        Note that for the collection of dead tuple identifiers,
> -        <command>VACUUM</command> is only able to utilize up to a maximum of
> -        <literal>1GB</literal> of memory.
> -       </para>
>        </listitem>
>       </varlistentry>
>
> This is mentioned twice for two different GUCs -- need to remove the
> other one, too.

Good catch, removed.

> Other than that, I just have minor nits:
>
> - * The major space usage for vacuuming is storage for the array of dead TIDs
> + * The major space usage for vacuuming is TID store, a storage for dead TIDs
>
> I think I've helped edit this sentence before, but I still don't quite
> like it. I'm thinking now "is storage for the dead tuple IDs".
>
> - * set upper bounds on the number of TIDs we can keep track of at once.
> + * set upper bounds on the maximum memory that can be used for keeping track
> + * of dead TIDs at once.
>
> I think "maximum" is redundant with "upper bounds".

Fixed.

>
> I also feel the commit message needs more "meat" -- we need to clearly
> narrate the features and benefits. I've attached how I would write it,
> but feel free to use what you like to match your taste.

Well, that's much better than mine.

>
> I've marked it Ready for Committer.

Thank you! I've attached the patch that I'm going to push tomorrow.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thank you! I've attached the patch that I'm going to push tomorrow.

Excellent!

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
As discussed, this includes a macro to cap max possible offset that
can be stored in the bitmap, which I believe only reduces the valid
offset range for 32kB pages on 32-bit platforms. Even there, it allows
for more line pointers than can possibly be useful. It also splits
into two parts for readability. It would be committed in two pieces as
well, since they are independently useful.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote:
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).

And...since there's a new bump context patch, I wanted to anticipate
squeezing an update on top of that, if that gets committed. 0004/5 are
the v6 bump context, and 0006 uses it for vacuum. The rest are to show
it works -- the expected.out changes make possible problems in CI
easier to see. The allocation size is 16 bytes, so this difference is
entirely due to lack of chunk header:

aset: 6619136
bump: 5047296

(Note: assert builds still have the chunk header for sanity checking,
so this was done in a more optimized build)

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Andres Freund
Date:
Hi,

On 2024-04-01 11:53:28 +0900, Masahiko Sawada wrote:
> On Fri, Mar 29, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > I've marked it Ready for Committer.
>
> Thank you! I've attached the patch that I'm going to push tomorrow.

Locally I ran a 32bit build with ubsan enabled (by accident actually), which
complains:

performing post-bootstrap initialization ...
----------------------------------- stderr -----------------------------------
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24: runtime error: member access
withinmisaligned address 0xffb6258e for type 'struct BlocktableEntry', which requires 4 byte alignment
 
0xffb6258e: note: pointer points here
 00 00 02 00 01 40  dc e9 83 0b 80 48 70 ee  00 00 00 00 00 00 00 01  17 00 00 00 f8 d4 a6 ee  e8 25
             ^
    #0 0x814097e in TidStoreSetBlockOffsets
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
    #1 0x826560a in dead_items_add ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
    #2 0x825f8da in lazy_scan_prune
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
    #3 0x825da71 in lazy_scan_heap ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
    #4 0x825ad8f in heap_vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
    #5 0x8697e97 in table_relation_vacuum ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
    #6 0x869fca6 in vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206
    #7 0x869a0fd in vacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:622
    #8 0x869986b in ExecVacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:449
    #9 0x8e5f832 in standard_ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:859
    #10 0x8e5e5f6 in ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:523
    #11 0x8e5b71a in PortalRunUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1158
    #12 0x8e5be80 in PortalRunMulti ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1315
    #13 0x8e59f9b in PortalRun ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:791
    #14 0x8e4d5f3 in exec_simple_query ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1274
    #15 0x8e55159 in PostgresMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4680
    #16 0x8e54445 in PostgresSingleUserMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4136
    #17 0x88bb55e in main ../../../../../home/andres/src/postgresql/src/backend/main/main.c:194
    #18 0xf76f47c4  (/lib/i386-linux-gnu/libc.so.6+0x237c4) (BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf)
    #19 0xf76f4887 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x23887) (BuildId:
fe79efe6681a919714a4e119da2baac3a4953fbf)
    #20 0x80d40f7 in _start
(/srv/dev/build/postgres/m-dev-assert-32/tmp_install/srv/dev/install/postgres/m-dev-assert-32/bin/postgres+0x80d40f7)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24in
 
Aborted (core dumped)
child process exited with exit code 134
initdb: data directory "/srv/dev/build/postgres/m-dev-assert-32/tmp_install/initdb-template" not removed at user's
request


At first I was confused why CI didn't find this. Turns out that, for me, this
is only triggered without compiler optimizations, and I had used -O0 while CI
uses some optimizations.

Backtrace:
#9  0x0814097f in TidStoreSetBlockOffsets (ts=0xb8dfde4, blkno=15, offsets=0xffb6275c, num_offsets=11)
    at ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
#10 0x0826560b in dead_items_add (vacrel=0xb8df6d4, blkno=15, offsets=0xffb6275c, num_offsets=11)
    at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
#11 0x0825f8db in lazy_scan_prune (vacrel=0xb8df6d4, buf=24, blkno=15, page=0xeeb6c000 "", vmbuffer=729,
all_visible_according_to_vm=false,
    has_lpdead_items=0xffb62a1f) at
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
#12 0x0825da72 in lazy_scan_heap (vacrel=0xb8df6d4) at
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
#13 0x0825ad90 in heap_vacuum_rel (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0)
    at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
#14 0x08697e98 in table_relation_vacuum (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0)
    at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
#15 0x0869fca7 in vacuum_rel (relid=1249, relation=0x0, params=0xffb62e90, bstrategy=0xb99d5e0)
    at ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206
#16 0x0869a0fe in vacuum (relations=0xb99de08, params=0xffb62e90, bstrategy=0xb99d5e0, vac_context=0xb99d550,
isTopLevel=true)

(gdb) p/x page
$1 = 0xffb6258e


I think compiler optimizations are only tangentially involved here, they
trigger the stack frame layout to change, e.g. because some variable will just
exist in a register.


Looking at the code, the failure isn't suprising anymore:
    char        data[MaxBlocktableEntrySize];
    BlocktableEntry *page = (BlocktableEntry *) data;

'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
a char[]. You can't just do that.  Look at how we do that for
e.g. PGAlignedblock.


With the attached minimal fix, the tests pass again.

Greetings,

Andres Freund

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Apr 8, 2024 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:
>
> Looking at the code, the failure isn't suprising anymore:
>         char            data[MaxBlocktableEntrySize];
>         BlocktableEntry *page = (BlocktableEntry *) data;
>
> 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
> a char[]. You can't just do that.  Look at how we do that for
> e.g. PGAlignedblock.
>
>
> With the attached minimal fix, the tests pass again.

Thanks, will push this shortly!



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Pavel Borisov
Date:
Hi, John!

On Mon, 8 Apr 2024 at 03:13, John Naylor <johncnaylorls@gmail.com> wrote:
On Mon, Apr 8, 2024 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:
>
> Looking at the code, the failure isn't suprising anymore:
>         char            data[MaxBlocktableEntrySize];
>         BlocktableEntry *page = (BlocktableEntry *) data;
>
> 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
> a char[]. You can't just do that.  Look at how we do that for
> e.g. PGAlignedblock.
>
>
> With the attached minimal fix, the tests pass again.

Thanks, will push this shortly!
Buildfarm animal mylodon looks unhappy with this:
FAILED: src/backend/postgres_lib.a.p/access_common_tidstore.c.o 
ccache clang-14 -Isrc/backend/postgres_lib.a.p -Isrc/include -I../pgsql/src/include -I/usr/include/libxml2 -I/usr/include/security -fdiagnostics-color=never -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -O2 -g -fno-strict-aliasing -fwrapv -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wdeclaration-after-statement -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -Wno-array-bounds -std=c99 -Wc11-extensions -Werror=c11-extensions -fPIC -isystem /usr/include/mit-krb5 -pthread -DBUILDING_DLL -MD -MQ src/backend/postgres_lib.a.p/access_common_tidstore.c.o -MF src/backend/postgres_lib.a.p/access_common_tidstore.c.o.d -o src/backend/postgres_lib.a.p/access_common_tidstore.c.o -c ../pgsql/src/backend/access/common/tidstore.c
../pgsql/src/backend/access/common/tidstore.c:48:3: error: anonymous structs are a C11 extension [-Werror,-Wc11-extensions]                struct                ^ 
1 error generated.

Regards,
Pavel Borisov
Supabase 

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
> As discussed, this includes a macro to cap max possible offset that
> can be stored in the bitmap, which I believe only reduces the valid
> offset range for 32kB pages on 32-bit platforms. Even there, it allows
> for more line pointers than can possibly be useful. It also splits
> into two parts for readability. It would be committed in two pieces as
> well, since they are independently useful.

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Pavel Borisov
Date:


On Mon, 8 Apr 2024 at 16:27, John Naylor <johncnaylorls@gmail.com> wrote:
On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
> As discussed, this includes a macro to cap max possible offset that
> can be stored in the bitmap, which I believe only reduces the valid
> offset range for 32kB pages on 32-bit platforms. Even there, it allows
> for more line pointers than can possibly be useful. It also splits
> into two parts for readability. It would be committed in two pieces as
> well, since they are independently useful.

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.

Provided  uintptr_t is not accessed it might be good to get rid of it.

Maybe this patch also need correction in this:
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber))

Regards,
Pavel 

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Apr 8, 2024 at 7:42 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
>> I pushed both of these and see that mylodon complains that anonymous
>> unions are a C11 feature. I'm not actually sure that the union with
>> uintptr_t is actually needed, though, since that's not accessed as
>> such here. The simplest thing seems to get rid if the union and name
>> the inner struct "header", as in the attached.
>
>
> Provided  uintptr_t is not accessed it might be good to get rid of it.
>
> Maybe this patch also need correction in this:
> +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber))

For full context the diff was

-#define NUM_FULL_OFFSETS ((sizeof(bitmapword) - sizeof(uint16)) /
sizeof(OffsetNumber))
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) -
sizeof(int8)) / sizeof(OffsetNumber))

I wanted the former, from f35bd9bf35 , to be independently useful (in
case the commit in question had some unresolvable issue), and its
intent is to fill struct padding when the array of bitmapword happens
to have length zero. Changing to uintptr_t for the size calculation
reflects the intent to fit in a (local) pointer, regardless of the
size of a bitmapword. (If a DSA pointer happens to be a different size
for some odd platform, it should still work, BTW.)

My thinking with the union was, for big-endian, to force the 'flags'
member to where it can be set, but thinking again, it should still
work if by happenstance the header was smaller than the child pointer:
A different bit would get tagged, but I believe that's irrelevant. The
'flags' member makes sure a byte is reserved for the tag, but it may
not be where the tag is actually located, if that makes sense.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Mon, Apr 8, 2024 at 7:26 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I pushed both of these and see that mylodon complains that anonymous
> unions are a C11 feature. I'm not actually sure that the union with
> uintptr_t is actually needed, though, since that's not accessed as
> such here. The simplest thing seems to get rid if the union and name
> the inner struct "header", as in the attached.

I pushed this with some comment adjustments.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
I took a look at the coverage report from [1] and it seems pretty
good, but there are a couple more tests we could do.

- RT_KEY_GET_SHIFT is not covered for key=0:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803

That should be fairly simple to add to the tests.

- Some paths for single-value leaves are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606

However, these paths do get regression test coverage on 32-bit
machines. 64-bit builds only have leaves in the TID store, which
doesn't (currently) delete entries, and doesn't instantiate the tree
with the debug option.

- In RT_SET "if (found)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

That's because we don't yet have code that replaces an existing value
with a value of a different length.

- RT_FREE_RECURSE isn't well covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

The TID store test is pretty simple as far as distribution of block
keys, and focuses more on the offset bitmaps. We could try to cover
all branches here, but it would make the test less readable, and it's
kind of the wrong place to do that anyway. test_radixtree.c does have
a commented-out option to use shared memory, but that's for local
testing and won't be reflected in the coverage report. Maybe it's
enough.

- RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644

That should be easy to add.

- RT_DUMP_NODE is not covered, and never called by default anyway:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804

It seems we could just leave it alone since it's debug-only, but it's
also a lot of lines. One idea is to use elog with DEBUG5 instead of
commenting out the call sites, but that would cause a lot of noise.

- TidStoreCreate* has some memory clamps that are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234

Maybe we could experiment with using 1MB for shared, and something
smaller for local.

[1] https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Noah Misch
Date:
On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:
> - Some paths for single-value leaves are not covered:
> 
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
> 
> However, these paths do get regression test coverage on 32-bit
> machines. 64-bit builds only have leaves in the TID store, which
> doesn't (currently) delete entries, and doesn't instantiate the tree
> with the debug option.
> 
> - In RT_SET "if (found)" is not covered:
> 
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
> 
> That's because we don't yet have code that replaces an existing value
> with a value of a different length.

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I took a look at the coverage report from [1] and it seems pretty
> good, but there are a couple more tests we could do.

Thank you for checking!

>
> - RT_KEY_GET_SHIFT is not covered for key=0:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
>
> That should be fairly simple to add to the tests.

There are two paths to call RT_KEY_GET_SHIFT():

1. RT_SET() -> RT_KEY_GET_SHIFT()
2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()

In both cases, it's called when key > tree->ctl->max_val. Since the
minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
when key=0.

>
> - Some paths for single-value leaves are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
>
> However, these paths do get regression test coverage on 32-bit
> machines. 64-bit builds only have leaves in the TID store, which
> doesn't (currently) delete entries, and doesn't instantiate the tree
> with the debug option.

Right.

>
> - In RT_SET "if (found)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> That's because we don't yet have code that replaces an existing value
> with a value of a different length.

Noah reported an issue around that. We should incorporate the patch
and cover this code path.

>
> - RT_FREE_RECURSE isn't well covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> The TID store test is pretty simple as far as distribution of block
> keys, and focuses more on the offset bitmaps. We could try to cover
> all branches here, but it would make the test less readable, and it's
> kind of the wrong place to do that anyway. test_radixtree.c does have
> a commented-out option to use shared memory, but that's for local
> testing and won't be reflected in the coverage report. Maybe it's
> enough.

Agreed.

>
> - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
>
> That should be easy to add.

Agreed. The patch is attached.

>
> - RT_DUMP_NODE is not covered, and never called by default anyway:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804
>
> It seems we could just leave it alone since it's debug-only, but it's
> also a lot of lines. One idea is to use elog with DEBUG5 instead of
> commenting out the call sites, but that would cause a lot of noise.

I think we can leave it alone.

>
> - TidStoreCreate* has some memory clamps that are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
>
> Maybe we could experiment with using 1MB for shared, and something
> smaller for local.

I've confirmed that the local and shared tidstore with small max sizes
such as 4kB and 1MB worked. Currently the max size is hard-coded in
test_tidstore.c but if we use work_mem as the max size, we can pass
different max sizes for local and shared in the test script.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Apr 25, 2024 at 6:03 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:
> > - Some paths for single-value leaves are not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
> >
> > However, these paths do get regression test coverage on 32-bit
> > machines. 64-bit builds only have leaves in the TID store, which
> > doesn't (currently) delete entries, and doesn't instantiate the tree
> > with the debug option.
> >
> > - In RT_SET "if (found)" is not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
> >
> > That's because we don't yet have code that replaces an existing value
> > with a value of a different length.
>
> I saw a SIGSEGV there when using tidstore to write a fix for something else.
> Patch attached.

Great find, thank you for the patch!

The fix looks good to me. I think we can improve regression tests for
better coverage. In TidStore on a 64-bit machine, we can store 3
offsets in the header and these values are embedded to the leaf page.
With more than 3 offsets, the value size becomes more than 16 bytes
and a single value leaf. Therefore, if we can add the test with the
array[1,2,3,4,100], we can cover the case of replacing a single-value
leaf with a different size new single-value leaf. Now we add 9 pairs
of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

I've attached a new patch. In addition to the new test case I
mentioned, I've added some new comments and removed an unnecessary
added line in test_tidstore.sql.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I saw a SIGSEGV there when using tidstore to write a fix for something else.
> > Patch attached.
>
> Great find, thank you for the patch!

+1

(This occurred to me a few days ago, but I was far from my computer.)

With the purge function that  Noah proposed, I believe we can also get
rid of the comment at the top of the .sql test file warning of a
maintenance hazard:
..."To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers."

I found that it doesn't add any measurable time to run the test.

> The fix looks good to me. I think we can improve regression tests for
> better coverage. In TidStore on a 64-bit machine, we can store 3
> offsets in the header and these values are embedded to the leaf page.
> With more than 3 offsets, the value size becomes more than 16 bytes
> and a single value leaf. Therefore, if we can add the test with the
> array[1,2,3,4,100], we can cover the case of replacing a single-value
> leaf with a different size new single-value leaf. Now we add 9 pairs

Good idea.

> of do_gset_block_offset() and check_set_block_offsets(). If these are
> annoying, we can remove the cases of array[1] and array[1,2].

Let's keep those -- 32-bit platforms should also exercise this path.



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Apr 25, 2024 at 12:17 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I saw a SIGSEGV there when using tidstore to write a fix for something else.
> > > Patch attached.
> >
> > Great find, thank you for the patch!
>
> +1
>
> (This occurred to me a few days ago, but I was far from my computer.)
>
> With the purge function that  Noah proposed, I believe we can also get
> rid of the comment at the top of the .sql test file warning of a
> maintenance hazard:
> ..."To avoid adding duplicates,
> -- each call to do_set_block_offsets() should use different block
> -- numbers."

Good point. Removed.

>
> > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > annoying, we can remove the cases of array[1] and array[1,2].
>
> Let's keep those -- 32-bit platforms should also exercise this path.

Agreed.

I've attached a new patch. I'll push it tonight, if there is no further comment.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Thu, Apr 25, 2024 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 12:17 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I saw a SIGSEGV there when using tidstore to write a fix for something else.
> > > > Patch attached.
> > >
> > > Great find, thank you for the patch!
> >
> > +1
> >
> > (This occurred to me a few days ago, but I was far from my computer.)
> >
> > With the purge function that  Noah proposed, I believe we can also get
> > rid of the comment at the top of the .sql test file warning of a
> > maintenance hazard:
> > ..."To avoid adding duplicates,
> > -- each call to do_set_block_offsets() should use different block
> > -- numbers."
>
> Good point. Removed.
>
> >
> > > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > > annoying, we can remove the cases of array[1] and array[1,2].
> >
> > Let's keep those -- 32-bit platforms should also exercise this path.
>
> Agreed.
>
> I've attached a new patch. I'll push it tonight, if there is no further comment.
>

Pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: [PoC] Improve dead tuple storage for lazy vacuum

From
John Naylor
Date:
On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > - RT_KEY_GET_SHIFT is not covered for key=0:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> >
> > That should be fairly simple to add to the tests.
>
> There are two paths to call RT_KEY_GET_SHIFT():
>
> 1. RT_SET() -> RT_KEY_GET_SHIFT()
> 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
>
> In both cases, it's called when key > tree->ctl->max_val. Since the
> minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> when key=0.

Ah, right, so it is dead code. Nothing to worry about, but it does
point the way to some simplifications, which I've put together in the
attached.

> > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> >
> > That should be easy to add.
>
> Agreed. The patch is attached.

LGTM

> > - TidStoreCreate* has some memory clamps that are not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> >
> > Maybe we could experiment with using 1MB for shared, and something
> > smaller for local.
>
> I've confirmed that the local and shared tidstore with small max sizes
> such as 4kB and 1MB worked. Currently the max size is hard-coded in
> test_tidstore.c but if we use work_mem as the max size, we can pass
> different max sizes for local and shared in the test script.

Seems okay, do you want to try that and see how it looks?

Attachment

Re: [PoC] Improve dead tuple storage for lazy vacuum

From
Masahiko Sawada
Date:
On Wed, May 1, 2024 at 4:29 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > - RT_KEY_GET_SHIFT is not covered for key=0:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> > >
> > > That should be fairly simple to add to the tests.
> >
> > There are two paths to call RT_KEY_GET_SHIFT():
> >
> > 1. RT_SET() -> RT_KEY_GET_SHIFT()
> > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
> >
> > In both cases, it's called when key > tree->ctl->max_val. Since the
> > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> > when key=0.
>
> Ah, right, so it is dead code. Nothing to worry about, but it does
> point the way to some simplifications, which I've put together in the
> attached.

Thank you for the patch. It looks good to me.

+       /* compute the smallest shift that will allowing storing the key */
+       start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;

The comment is moved from RT_KEY_GET_SHIFT() but I think s/will
allowing storing/will allow storing/.

>
> > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> > >
> > > That should be easy to add.
> >
> > Agreed. The patch is attached.
>
> LGTM
>
> > > - TidStoreCreate* has some memory clamps that are not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> > >
> > > Maybe we could experiment with using 1MB for shared, and something
> > > smaller for local.
> >
> > I've confirmed that the local and shared tidstore with small max sizes
> > such as 4kB and 1MB worked. Currently the max size is hard-coded in
> > test_tidstore.c but if we use work_mem as the max size, we can pass
> > different max sizes for local and shared in the test script.
>
> Seems okay, do you want to try that and see how it looks?

I've attached a simple patch for this. In test_tidstore.sql, we used
to create two local tidstore and one shared tidstore. I thought of
specifying small work_mem values for these three cases but it would
remove the normal test cases. So I created separate tidstore for this
test. Also, the new test is just to check if tidstore can be created
with such a small size, but it might be a good idea to add some TIDs
to check if it really works fine.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment