Making the initial and maximum DSA segment sizes configurable - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Making the initial and maximum DSA segment sizes configurable
Date
Msg-id CAD21AoAYGGC1ePjVX0H+pp9rH=9vuPK19nNOiu12NprdV5TVJA@mail.gmail.com
Whole thread Raw
Responses Re: Making the initial and maximum DSA segment sizes configurable  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Hi all,

I started this new thread from another thread[1] where we're
discussing a new storage for TIDs, TidStore, since we found a
difficulty about the memory usage limit for TidStores on DSA.

TidStore is a new data structure to efficiently store TIDs, backed by
a radix tree. In the patch series proposed on that thread, in addition
to radix tree and TidStore, there is another patch for lazy (parallel)
vacuum to replace the array of dead tuple TIDs with a TidStore. To
support parallel vacuum, radix tree (and TidStore) can be created on a
local memory as well as on DSA. Also, it has memory usage limit
functionality; we can specify the memory limit (e.g.,
maintenance_work_mem) to TidStoreCreate() function. Once the total DSA
segment size (area->control->total_segment_size) exceeds the limit,
TidStoreIsFull() returns true. The lazy vacuum can continue scanning
heap blocks to collect dead tuple TIDs until TidStoreIsFull() returns
true. Currently lazy vacuum is the sole user of TidStore but maybe it
can be used by other codes such as tidbitmap.c where will be limited
by work_mem.

During the development, we found out that DSA memory growth is
unpredictable, leading to inefficient memory limitation.

DSA is built on top of DSM segments and it manages a set of DSM
segments, adding new segments as required and detaching them when they
are no longer needed. The DSA segment starts with 1MB in size and a
new segment size is at least big enough to follow a geometric series
that approximately doubles the total storage each time we create a new
segment. Because of this fact, it's not efficient to simply compare
the memory limit to the total segment size. For example, if
maintenance_work_mem is 512MB, the total segment size will be like:

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) = 510MB -> less than the
limit, continue heap scan.

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + 256 = 766MB -> stop (exceed  254MB).

One might think we can use dsa_set_size_limit() but it cannot; lazy
vacuum ends up with an error. If we set DSA_ALLOC_NO_OOM, we might end
up stopping the insertion halfway.

Besides excessively allocating memory, since the initial DSM segment
size is fixed 1MB, memory usage of a shared TidStore will start from
1MB+. This is higher than the minimum values of both work_mem and
maintenance_work_mem, 64kB and 1MB respectively. Increasing the
minimum m_w_m to 2MB might be acceptable but not for work_mem.

Researching possible solutions, we found that aset.c also has a
similar characteristic; allocates an 8K block (by default) upon the
first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of an idea to specify both to
DSA and both values are calculated based on m_w_m. I've attached the
patch for this idea. The changes to dsa.c are straightforward since
dsa.c already uses macros DSA_INITIAL_SEGMENT_SIZE and
DSA_MAX_SEGMENT_SIZE. I just made these values configurable.

FYI with this patch, we can create a DSA in parallel_vacuum_init()
with initial and maximum block sizes as follows:

initial block size = min(m_w_m / 4, 1MB)
max block size = max(m_w_m / 8, 8MB)

In most cases, we can start with a 1MB initial segment, the same as
before. For larger memory, the heap scan stops after DSA allocates
1.25 times more memory than m_w_m. For example, if m_w_m = 512MB, the
both initial and maximum segment sizes are 1MB and 64MB respectively,
and then DSA allocates the segments as follows until heap scanning
stops:

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 4) = 510MB -> less than the
limit, continue heap scan.

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 5) = 574MB -> stop
(allocated additional 62MB).

It also works with smaller memory; If the limit is 1MB, we start with
a 256KB initial segment and heap scanning stops after DSA allocated
1.5MB (= 256kB + 256kB + 512kB + 512kB).

There is room for considering better formulas for initial and maximum
block sizes but making both values configurable is a promising idea.
And the analogous behavior to aset could be a good thing for
readability and maintainability. There is another test result where I
used this idea on top of a radix tree[2].

We need to consider the total number of allocated DSA segments as the
total number of DSM segments available on the system is fixed[3]. But
it seems not problematic even with this patch since we allocate only a
few additional segments (in above examples 17 segs vs. 19 segs). There
was no big difference also in performance[2].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDBmD5q%3DeO%2BK%3DgyuVt53XvwpJ2dgxPwrtZ-eVOjVmtJjg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD21AoDKr%3D4YHphy6cRojE5eyT6E2ao8xb44E309eTrUEOC6xw%40mail.gmail.com
[3] from dsm.c, the total number of DSM segments available on the
system is calculated by:
#define PG_DYNSHMEM_FIXED_SLOTS         64
#define PG_DYNSHMEM_SLOTS_PER_BACKEND   5
maxitems = PG_DYNSHMEM_FIXED_SLOTS
    + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Request for comment on setting binary format output per session
Next
From: Justin Pryzby
Date:
Subject: Re: proposal: possibility to read dumped table's name from file