Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAFBsxsG2SDMpmON_ovVY6S+oCT-Mn=9giMm4_YquHdkZfn7XBQ@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum Re: [PoC] Improve dead tuple storage for lazy vacuum |
List | pgsql-hackers |
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]
Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage.
Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. To show what I mean, consider this new table:
node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64
node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)
node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048
node256: 5 + 32 +(3)+256*8 = 2088(->4096)
Seven size classes are grouped into the four kinds.
The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like:
case FOO_KIND:
if (unlikely(count == capacity))
{
if (capacity == XYZ) /* for smaller size class of the pair */
{
<repalloc to next size class>;
capacity = next-higher-capacity;
goto do_insert;
}
else
<migrate data to next node kind>;
}
else
{
do_insert:
<...>;
break;
}
/* FALLTHROUGH */
...
One disadvantage is that this wastes some space by reserving the full set of control data in the smaller size class of the pair, but it's usually small compared to array size. Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null.
[1] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
--
John Naylor
EDB: http://www.enterprisedb.com
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]
Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage.
Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. To show what I mean, consider this new table:
node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64
node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)
node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048
node256: 5 + 32 +(3)+256*8 = 2088(->4096)
Seven size classes are grouped into the four kinds.
The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like:
case FOO_KIND:
if (unlikely(count == capacity))
{
if (capacity == XYZ) /* for smaller size class of the pair */
{
<repalloc to next size class>;
capacity = next-higher-capacity;
goto do_insert;
}
else
<migrate data to next node kind>;
}
else
{
do_insert:
<...>;
break;
}
/* FALLTHROUGH */
...
One disadvantage is that this wastes some space by reserving the full set of control data in the smaller size class of the pair, but it's usually small compared to array size. Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null.
[1] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
--
John Naylor
EDB: http://www.enterprisedb.com
pgsql-hackers by date: