Re: Improve monitoring of shared memory allocations - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Improve monitoring of shared memory allocations
Date
Msg-id k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq
Whole thread Raw
In response to Improve monitoring of shared memory allocations  (Rahila Syed <rahilasyed90@gmail.com>)
List pgsql-hackers
Hi,

Thanks for sending these, the issues addressed here have been bugging me for a
long while.

On 2025-03-01 10:19:01 +0530, Rahila Syed wrote:
> The 0001* patch improved the accounting for the shared memory allocated for
> a hash table during hash_create.  pg_shmem_allocations tracks the memory
> allocated by ShmemInitStruct, which, for shared hash tables, only covers
> memory allocated for the hash directory and control structure via
> ShmemInitHash. The hash segments and buckets are allocated using
> ShmemAllocNoError, which does not attribute the allocation to the hash table
> and also does not add it to ShmemIndex.
> 
> Therefore, these allocations are not tracked in pg_shmem_allocations.  To
> improve this, include the allocation of segments and buckets in the initial
> allocation of the shared memory for the hash table, in ShmemInitHash.
>
> This will result in pg_shmem_allocations representing the total size of the
> initial hash table, including all the buckets and elements, instead of just
> the directory size.

I think this should also be more efficient. Less space wasted on padding and
fewer indirect function calls.


> Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
> tracked by pg_shmem_allocations.  The 0002* patch replaces most of the calls
> to ShmemAlloc with ShmemInitStruct to associate a name with the allocations
> and ensure that they get tracked by pg_shmem_allocations.

In some of these cases it may be better to combine the allocation with a prior
ShmemInitStruct instead of doing a separate ShmemInitStruct() for the
allocations that are doing ShmemAlloc() right now.


cfbot found a few compiler warnings:

https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable]
[16:49:10.496]   467 |                 bool found;
[16:49:10.496]       |                      ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195]   643 |                         curr_offset = (((char *)curr_offset) + (temp * elementSize));
[16:49:20.195]       |                         ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195]   588 |                 void *curr_offset;
[16:49:20.195]       |                       ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1


> From c13a7133ed455842b685426217a7b5079e6fc869 Mon Sep 17 00:00:00 2001
> From: Rahila Syed <rahilasyed.90@gmail.com>
> Date: Fri, 21 Feb 2025 15:08:12 +0530
> Subject: [PATCH 1/2] Account for initial shared memory allocated during
>  hash_create.
> 
> pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
> which, in case of shared hash tables, only covers memory allocated
> to the hash directory and control structure. The hash segments and
> buckets are allocated using ShmemAllocNoError which does not attribute
> the allocations to the hash table name. Thus, these allocations are
> not tracked in pg_shmem_allocations.
> 
> Improve this include the alocation of segments and buckets or elements
> in the initial allocation of shared hash directory. Since this adds numbers
> to existing hash table entries, the resulting tuples in pg_shmem_allocations
> represent the total size of the initial hash table including all the
> buckets and the elements they contain, instead of just the directory size.


> diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
> index cd5a00132f..5203f5b30b 100644
> --- a/src/backend/utils/hash/dynahash.c
> +++ b/src/backend/utils/hash/dynahash.c
> @@ -120,7 +120,6 @@
>   * a good idea of the maximum number of entries!).  For non-shared hash
>   * tables, the initial directory size can be left at the default.
>   */
> -#define DEF_SEGSIZE               256
>  #define DEF_SEGSIZE_SHIFT       8    /* must be log2(DEF_SEGSIZE) */
>  #define DEF_DIRSIZE               256

Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?


> @@ -265,7 +264,7 @@ static long hash_accesses,
>   */
>  static void *DynaHashAlloc(Size size);
>  static HASHSEGMENT seg_alloc(HTAB *hashp);
> -static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
> +static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement);
>  static bool dir_realloc(HTAB *hashp);
>  static bool expand_table(HTAB *hashp);
>  static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
> @@ -276,11 +275,10 @@ static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn();
>  static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue,
>                                    HASHBUCKET **bucketptr);
>  static long next_pow2_long(long num);
> -static int    next_pow2_int(long num);
>  static void register_seq_scan(HTAB *hashp);
>  static void deregister_seq_scan(HTAB *hashp);
>  static bool has_seq_scans(HTAB *hashp);
> -
> +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);
>  
>  /*
>   * memory allocation support

You removed a newline here that probably shouldn't be removed.



> @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>      else
>          hashp->keycopy = memcpy;
>  
> -    /* And select the entry allocation function, too. */
> +    /*
> +     * And select the entry allocation function, too. XXX should this also
> +     * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> +     * currently only set with HASH_SHARED_MEM *
> +     */
>      if (flags & HASH_ALLOC)
>          hashp->alloc = info->alloc;
>      else
> @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>  
>      hashp->frozen = false;
>  
> +    /* Initializing the HASHHDR variables with default values */
>      hdefault(hashp);
>  
>      hctl = hashp->hctl;

I assume these were just observations you made while looking into this? They
seem unrelated to the change itself?


> @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>                      freelist_partitions,
>                      nelem_alloc,
>                      nelem_alloc_first;
> -
> +        void *curr_offset;
> +    
>          /*
>           * If hash table is partitioned, give each freelist an equal share of
>           * the initial allocation.  Otherwise only freeList[0] is used.
> @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>          else
>              freelist_partitions = 1;
>  
> +        /*
> +         * If table is shared, calculate the offset at which to find the
> +         * the first partition of elements
> +         */
> +        if (hashp->isshared)
> +        {
> +            int            nsegs;
> +            int            nbuckets;
> +            nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
> +            
> +            curr_offset =  (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> +                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> +        }
> +

Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?


>          nelem_alloc = nelem / freelist_partitions;
>          if (nelem_alloc <= 0)
>              nelem_alloc = 1;
> @@ -609,11 +627,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>          for (i = 0; i < freelist_partitions; i++)
>          {
>              int            temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
> -
> -            if (!element_alloc(hashp, temp, i))
> +            HASHELEMENT *firstElement;
> +             Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
> +
> +            /* Memory is allocated as part of initial allocation in ShmemInitHash */
> +            if (hashp->isshared)
> +                firstElement = (HASHELEMENT *) curr_offset;
> +            else
> +                firstElement = NULL;        
> +                
> +            if (!element_alloc(hashp, temp, i, firstElement))
>                  ereport(ERROR,
>                          (errcode(ERRCODE_OUT_OF_MEMORY),
>                           errmsg("out of memory")));
> +            curr_offset = (((char *)curr_offset) + (temp * elementSize));
>          }
>      }

Seems a bit ugly to go through element_alloc() when pre-allocating.  Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such?  With the latter doing everything
after hashp->alloc().


> @@ -693,7 +720,7 @@ init_htab(HTAB *hashp, long nelem)
>      int            nbuckets;
>      int            nsegs;
>      int            i;
> -
> +    
>      /*
>       * initialize mutexes if it's a partitioned table
>       */

Spurious change.


> @@ -701,30 +728,11 @@ init_htab(HTAB *hashp, long nelem)
>          for (i = 0; i < NUM_FREELISTS; i++)
>              SpinLockInit(&(hctl->freeList[i].mutex));
>  
> -    /*
> -     * Allocate space for the next greater power of two number of buckets,
> -     * assuming a desired maximum load factor of 1.
> -     */
> -    nbuckets = next_pow2_int(nelem);
> -
> -    /*
> -     * In a partitioned table, nbuckets must be at least equal to
> -     * num_partitions; were it less, keys with apparently different partition
> -     * numbers would map to the same bucket, breaking partition independence.
> -     * (Normally nbuckets will be much bigger; this is just a safety check.)
> -     */
> -    while (nbuckets < hctl->num_partitions)
> -        nbuckets <<= 1;
> +    nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
>  
>      hctl->max_bucket = hctl->low_mask = nbuckets - 1;
>      hctl->high_mask = (nbuckets << 1) - 1;
>  
> -    /*
> -     * Figure number of directory segments needed, round up to a power of 2
> -     */
> -    nsegs = (nbuckets - 1) / hctl->ssize + 1;
> -    nsegs = next_pow2_int(nsegs);
> -
>      /*
>       * Make sure directory is big enough. If pre-allocated directory is too
>       * small, choke (caller screwed up).

A function called find_num_of_segs() that also sets nbuckets seems a bit
confusing.  I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.


> @@ -1689,6 +1726,7 @@ seg_alloc(HTAB *hashp)
>      HASHSEGMENT segp;
>  
>      CurrentDynaHashCxt = hashp->hcxt;
> +    
>      segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);
>  
>      if (!segp)

Spurious change.


> @@ -1816,7 +1854,7 @@ next_pow2_long(long num)
>  }
>  
>  /* calculate first power of 2 >= num, bounded to what will fit in an int */
> -static int
> +int
>  next_pow2_int(long num)
>  {
>      if (num > INT_MAX / 2)
> @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
>          }
>      }
>  }

Why export this?


> diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> index 932cc4f34d..5e16bd4183 100644
> --- a/src/include/utils/hsearch.h
> +++ b/src/include/utils/hsearch.h
> @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
>  extern void hash_freeze(HTAB *hashp);
>  extern Size hash_estimate_size(long num_entries, Size entrysize);
>  extern long hash_select_dirsize(long num_entries);
> -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
>  extern void AtEOXact_HashTables(bool isCommit);
>  extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);

It's imo a bit weird that we have very related logic in hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?



> --- a/src/backend/storage/lmgr/predicate.c
> +++ b/src/backend/storage/lmgr/predicate.c
> @@ -1244,7 +1244,7 @@ PredicateLockShmemInit(void)
>          PredXact->HavePartialClearedThrough = 0;
>          requestSize = mul_size((Size) max_table_size,
>                                 sizeof(SERIALIZABLEXACT));
> -        PredXact->element = ShmemAlloc(requestSize);
> +        PredXact->element = ShmemInitStruct("SerializableXactList", requestSize, &found);
>          /* Add all elements to available list, clean. */
>          memset(PredXact->element, 0, requestSize);
>          for (i = 0; i < max_table_size; i++)
> @@ -1299,9 +1299,11 @@ PredicateLockShmemInit(void)
>       * probably OK.
>       */
>      max_table_size *= 5;
> +    requestSize = mul_size((Size) max_table_size,
> +                        RWConflictDataSize);
>  
>      RWConflictPool = ShmemInitStruct("RWConflictPool",
> -                                     RWConflictPoolHeaderDataSize,
> +                                     RWConflictPoolHeaderDataSize + requestSize,
>                                       &found);
>      Assert(found == IsUnderPostmaster);
>      if (!found)
> @@ -1309,9 +1311,7 @@ PredicateLockShmemInit(void)
>          int            i;
>  
>          dlist_init(&RWConflictPool->availableList);
> -        requestSize = mul_size((Size) max_table_size,
> -                               RWConflictDataSize);
> -        RWConflictPool->element = ShmemAlloc(requestSize);
> +        RWConflictPool->element = (RWConflict) ((char *)RWConflictPool + RWConflictPoolHeaderDataSize); 
>          /* Add all elements to available list, clean. */
>          memset(RWConflictPool->element, 0, requestSize);
>          for (i = 0; i < max_table_size; i++)

These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.


> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index 49204f91a2..e112735d93 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -218,11 +218,11 @@ InitProcGlobal(void)
>       * how hotly they are accessed.
>       */
>      ProcGlobal->xids =
> -        (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
> +        (TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
>      MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
> -    ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> +    ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs *
sizeof(*ProcGlobal->subxidStates),&found);
 
>      MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
> -    ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
> +    ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs *
sizeof(*ProcGlobal->statusFlags),&found);
 
>      MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
>  
>      /*

Same.

Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.


> @@ -233,7 +233,7 @@ InitProcGlobal(void)
>      fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
>      fpRelIdSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(Oid) * FP_LOCK_SLOTS_PER_GROUP);
>  
> -    fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> +    fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
>      MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
>  
>      /* For asserts checking we did not overflow. */

This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: why there is not VACUUM FULL CONCURRENTLY?
Next
From: Andres Freund
Date:
Subject: Re: scalability bottlenecks with (many) partitions (and more)