Re: Improve monitoring of shared memory allocations - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Improve monitoring of shared memory allocations |
Date | |
Msg-id | k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq Whole thread Raw |
In response to | Improve monitoring of shared memory allocations (Rahila Syed <rahilasyed90@gmail.com>) |
List | pgsql-hackers |
Hi, Thanks for sending these, the issues addressed here have been bugging me for a long while. On 2025-03-01 10:19:01 +0530, Rahila Syed wrote: > The 0001* patch improved the accounting for the shared memory allocated for > a hash table during hash_create. pg_shmem_allocations tracks the memory > allocated by ShmemInitStruct, which, for shared hash tables, only covers > memory allocated for the hash directory and control structure via > ShmemInitHash. The hash segments and buckets are allocated using > ShmemAllocNoError, which does not attribute the allocation to the hash table > and also does not add it to ShmemIndex. > > Therefore, these allocations are not tracked in pg_shmem_allocations. To > improve this, include the allocation of segments and buckets in the initial > allocation of the shared memory for the hash table, in ShmemInitHash. > > This will result in pg_shmem_allocations representing the total size of the > initial hash table, including all the buckets and elements, instead of just > the directory size. I think this should also be more efficient. Less space wasted on padding and fewer indirect function calls. > Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not > tracked by pg_shmem_allocations. The 0002* patch replaces most of the calls > to ShmemAlloc with ShmemInitStruct to associate a name with the allocations > and ensure that they get tracked by pg_shmem_allocations. In some of these cases it may be better to combine the allocation with a prior ShmemInitStruct instead of doing a separate ShmemInitStruct() for the allocations that are doing ShmemAlloc() right now. cfbot found a few compiler warnings: https://cirrus-ci.com/task/6526903542087680 [16:47:46.964] make -s -j${BUILD_JOBS} clean [16:47:47.452] time make -s -j${BUILD_JOBS} world-bin [16:49:10.496] lwlock.c: In function ‘CreateLWLocks’: [16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable] [16:49:10.496] 467 | bool found; [16:49:10.496] | ^~~~~ [16:49:10.496] cc1: all warnings being treated as errors [16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1 [16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2 [16:49:10.496] make[3]: *** Waiting for unfinished jobs.... [16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2 [16:49:11.881] make[2]: *** Waiting for unfinished jobs.... [16:49:20.195] dynahash.c: In function ‘hash_create’: [16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized] [16:49:20.195] 643 | curr_offset = (((char *)curr_offset) + (temp * elementSize)); [16:49:20.195] | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here [16:49:20.195] 588 | void *curr_offset; [16:49:20.195] | ^~~~~~~~~~~ [16:49:20.195] cc1: all warnings being treated as errors [16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1 > From c13a7133ed455842b685426217a7b5079e6fc869 Mon Sep 17 00:00:00 2001 > From: Rahila Syed <rahilasyed.90@gmail.com> > Date: Fri, 21 Feb 2025 15:08:12 +0530 > Subject: [PATCH 1/2] Account for initial shared memory allocated during > hash_create. > > pg_shmem_allocations tracks the memory allocated by ShmemInitStruct, > which, in case of shared hash tables, only covers memory allocated > to the hash directory and control structure. The hash segments and > buckets are allocated using ShmemAllocNoError which does not attribute > the allocations to the hash table name. Thus, these allocations are > not tracked in pg_shmem_allocations. > > Improve this include the alocation of segments and buckets or elements > in the initial allocation of shared hash directory. Since this adds numbers > to existing hash table entries, the resulting tuples in pg_shmem_allocations > represent the total size of the initial hash table including all the > buckets and the elements they contain, instead of just the directory size. > diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c > index cd5a00132f..5203f5b30b 100644 > --- a/src/backend/utils/hash/dynahash.c > +++ b/src/backend/utils/hash/dynahash.c > @@ -120,7 +120,6 @@ > * a good idea of the maximum number of entries!). For non-shared hash > * tables, the initial directory size can be left at the default. > */ > -#define DEF_SEGSIZE 256 > #define DEF_SEGSIZE_SHIFT 8 /* must be log2(DEF_SEGSIZE) */ > #define DEF_DIRSIZE 256 Why did you move this to the header? Afaict it's just used in hash_get_shared_size(), which is also in dynahash.c? > @@ -265,7 +264,7 @@ static long hash_accesses, > */ > static void *DynaHashAlloc(Size size); > static HASHSEGMENT seg_alloc(HTAB *hashp); > -static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx); > +static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement); > static bool dir_realloc(HTAB *hashp); > static bool expand_table(HTAB *hashp); > static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx); > @@ -276,11 +275,10 @@ static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn(); > static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue, > HASHBUCKET **bucketptr); > static long next_pow2_long(long num); > -static int next_pow2_int(long num); > static void register_seq_scan(HTAB *hashp); > static void deregister_seq_scan(HTAB *hashp); > static bool has_seq_scans(HTAB *hashp); > - > +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize); > > /* > * memory allocation support You removed a newline here that probably shouldn't be removed. > @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > else > hashp->keycopy = memcpy; > > - /* And select the entry allocation function, too. */ > + /* > + * And select the entry allocation function, too. XXX should this also > + * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is > + * currently only set with HASH_SHARED_MEM * > + */ > if (flags & HASH_ALLOC) > hashp->alloc = info->alloc; > else > @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > > hashp->frozen = false; > > + /* Initializing the HASHHDR variables with default values */ > hdefault(hashp); > > hctl = hashp->hctl; I assume these were just observations you made while looking into this? They seem unrelated to the change itself? > @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > freelist_partitions, > nelem_alloc, > nelem_alloc_first; > - > + void *curr_offset; > + > /* > * If hash table is partitioned, give each freelist an equal share of > * the initial allocation. Otherwise only freeList[0] is used. > @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > else > freelist_partitions = 1; > > + /* > + * If table is shared, calculate the offset at which to find the > + * the first partition of elements > + */ > + if (hashp->isshared) > + { > + int nsegs; > + int nbuckets; > + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize); > + > + curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) + > + + (sizeof(HASHBUCKET) * hctl->ssize * nsegs)); > + } > + Why only do this for shared hashtables? Couldn't we allocate the elments together with the rest for non-share hashtables too? > nelem_alloc = nelem / freelist_partitions; > if (nelem_alloc <= 0) > nelem_alloc = 1; > @@ -609,11 +627,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > for (i = 0; i < freelist_partitions; i++) > { > int temp = (i == 0) ? nelem_alloc_first : nelem_alloc; > - > - if (!element_alloc(hashp, temp, i)) > + HASHELEMENT *firstElement; > + Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize); > + > + /* Memory is allocated as part of initial allocation in ShmemInitHash */ > + if (hashp->isshared) > + firstElement = (HASHELEMENT *) curr_offset; > + else > + firstElement = NULL; > + > + if (!element_alloc(hashp, temp, i, firstElement)) > ereport(ERROR, > (errcode(ERRCODE_OUT_OF_MEMORY), > errmsg("out of memory"))); > + curr_offset = (((char *)curr_offset) + (temp * elementSize)); > } > } Seems a bit ugly to go through element_alloc() when pre-allocating. Perhaps it's the best thing we can do to avoid duplicating code, but it seems worth checking if we can do better. Perhaps we could split element_alloc() into element_alloc() and element_add() or such? With the latter doing everything after hashp->alloc(). > @@ -693,7 +720,7 @@ init_htab(HTAB *hashp, long nelem) > int nbuckets; > int nsegs; > int i; > - > + > /* > * initialize mutexes if it's a partitioned table > */ Spurious change. > @@ -701,30 +728,11 @@ init_htab(HTAB *hashp, long nelem) > for (i = 0; i < NUM_FREELISTS; i++) > SpinLockInit(&(hctl->freeList[i].mutex)); > > - /* > - * Allocate space for the next greater power of two number of buckets, > - * assuming a desired maximum load factor of 1. > - */ > - nbuckets = next_pow2_int(nelem); > - > - /* > - * In a partitioned table, nbuckets must be at least equal to > - * num_partitions; were it less, keys with apparently different partition > - * numbers would map to the same bucket, breaking partition independence. > - * (Normally nbuckets will be much bigger; this is just a safety check.) > - */ > - while (nbuckets < hctl->num_partitions) > - nbuckets <<= 1; > + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize); > > hctl->max_bucket = hctl->low_mask = nbuckets - 1; > hctl->high_mask = (nbuckets << 1) - 1; > > - /* > - * Figure number of directory segments needed, round up to a power of 2 > - */ > - nsegs = (nbuckets - 1) / hctl->ssize + 1; > - nsegs = next_pow2_int(nsegs); > - > /* > * Make sure directory is big enough. If pre-allocated directory is too > * small, choke (caller screwed up). A function called find_num_of_segs() that also sets nbuckets seems a bit confusing. I also don't like "find_*", as that sounds like it's searching some datastructure, rather than just doing a bit of math. > @@ -1689,6 +1726,7 @@ seg_alloc(HTAB *hashp) > HASHSEGMENT segp; > > CurrentDynaHashCxt = hashp->hcxt; > + > segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize); > > if (!segp) Spurious change. > @@ -1816,7 +1854,7 @@ next_pow2_long(long num) > } > > /* calculate first power of 2 >= num, bounded to what will fit in an int */ > -static int > +int > next_pow2_int(long num) > { > if (num > INT_MAX / 2) > @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth) > } > } > } Why export this? > diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h > index 932cc4f34d..5e16bd4183 100644 > --- a/src/include/utils/hsearch.h > +++ b/src/include/utils/hsearch.h > @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status); > extern void hash_freeze(HTAB *hashp); > extern Size hash_estimate_size(long num_entries, Size entrysize); > extern long hash_select_dirsize(long num_entries); > -extern Size hash_get_shared_size(HASHCTL *info, int flags); > +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size); > extern void AtEOXact_HashTables(bool isCommit); > extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth); It's imo a bit weird that we have very related logic in hash_estimate_size() and hash_get_shared_size(). Why do we need to duplicate it? > --- a/src/backend/storage/lmgr/predicate.c > +++ b/src/backend/storage/lmgr/predicate.c > @@ -1244,7 +1244,7 @@ PredicateLockShmemInit(void) > PredXact->HavePartialClearedThrough = 0; > requestSize = mul_size((Size) max_table_size, > sizeof(SERIALIZABLEXACT)); > - PredXact->element = ShmemAlloc(requestSize); > + PredXact->element = ShmemInitStruct("SerializableXactList", requestSize, &found); > /* Add all elements to available list, clean. */ > memset(PredXact->element, 0, requestSize); > for (i = 0; i < max_table_size; i++) > @@ -1299,9 +1299,11 @@ PredicateLockShmemInit(void) > * probably OK. > */ > max_table_size *= 5; > + requestSize = mul_size((Size) max_table_size, > + RWConflictDataSize); > > RWConflictPool = ShmemInitStruct("RWConflictPool", > - RWConflictPoolHeaderDataSize, > + RWConflictPoolHeaderDataSize + requestSize, > &found); > Assert(found == IsUnderPostmaster); > if (!found) > @@ -1309,9 +1311,7 @@ PredicateLockShmemInit(void) > int i; > > dlist_init(&RWConflictPool->availableList); > - requestSize = mul_size((Size) max_table_size, > - RWConflictDataSize); > - RWConflictPool->element = ShmemAlloc(requestSize); > + RWConflictPool->element = (RWConflict) ((char *)RWConflictPool + RWConflictPoolHeaderDataSize); > /* Add all elements to available list, clean. */ > memset(RWConflictPool->element, 0, requestSize); > for (i = 0; i < max_table_size; i++) These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating the additional space. The pointer math is a bit annoying, but it makes much more sense to have one entry in pg_shmem_allocations. > diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c > index 49204f91a2..e112735d93 100644 > --- a/src/backend/storage/lmgr/proc.c > +++ b/src/backend/storage/lmgr/proc.c > @@ -218,11 +218,11 @@ InitProcGlobal(void) > * how hotly they are accessed. > */ > ProcGlobal->xids = > - (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids)); > + (TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found); > MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids)); > - ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates)); > + ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates),&found); > MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates)); > - ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags)); > + ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags),&found); > MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags)); > > /* Same. Although here I'd say it's worth padding the size of each separate "allocation" by PG_CACHE_LINE_SIZE. > @@ -233,7 +233,7 @@ InitProcGlobal(void) > fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64)); > fpRelIdSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(Oid) * FP_LOCK_SLOTS_PER_GROUP); > > - fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize)); > + fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found); > MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize)); > > /* For asserts checking we did not overflow. */ This one might actually make sense to keep separate, depending on the configuration it can be reasonably big (max_connection = 1k, max_locks_per_transaction=1k results in ~5MB).. Greetings, Andres Freund
pgsql-hackers by date: