Re: Improve monitoring of shared memory allocations - Mailing list pgsql-hackers

From Rahila Syed
Subject Re: Improve monitoring of shared memory allocations
Date
Msg-id CAH2L28uL-8EQTRSeyTpW1DqAXsDXRkXCkT1dus2u6p4HYDrxAg@mail.gmail.com
Whole thread Raw
In response to Re: Improve monitoring of shared memory allocations  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

Thank you for the review.


cfbot found a few compiler warnings:

https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable]
[16:49:10.496]   467 |                 bool found;
[16:49:10.496]       |                      ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195]   643 |                         curr_offset = (((char *)curr_offset) + (temp * elementSize));
[16:49:20.195]       |                         ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195]   588 |                 void *curr_offset;
[16:49:20.195]       |                       ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1

 
 Fixed these. 
 

> diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
> index cd5a00132f..5203f5b30b 100644
> --- a/src/backend/utils/hash/dynahash.c
> +++ b/src/backend/utils/hash/dynahash.c
> @@ -120,7 +120,6 @@
>   * a good idea of the maximum number of entries!).  For non-shared hash
>   * tables, the initial directory size can be left at the default.
>   */
> -#define DEF_SEGSIZE                     256
>  #define DEF_SEGSIZE_SHIFT       8    /* must be log2(DEF_SEGSIZE) */
>  #define DEF_DIRSIZE                     256

Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?

 
Yes. This was accidentally left behind by the previous version of the
patch, so I undid the change. 


>  static void register_seq_scan(HTAB *hashp);
>  static void deregister_seq_scan(HTAB *hashp);
>  static bool has_seq_scans(HTAB *hashp);
> -
> +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);

>  /*
>   * memory allocation support

You removed a newline here that probably shouldn't be removed.
 
Fixed this. 

 

> @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>       else
>               hashp->keycopy = memcpy;

> -     /* And select the entry allocation function, too. */
> +     /*
> +      * And select the entry allocation function, too. XXX should this also
> +      * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> +      * currently only set with HASH_SHARED_MEM *
> +      */
>       if (flags & HASH_ALLOC)
>               hashp->alloc = info->alloc;
>       else
> @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)

>       hashp->frozen = false;

> +     /* Initializing the HASHHDR variables with default values */
>       hdefault(hashp);

>       hctl = hashp->hctl;

I assume these were just observations you made while looking into this? They
seem unrelated to the change itself?
 
Yes. I removed the first one and left the second one as a code comment. 
 

> @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>                                       freelist_partitions,
>                                       nelem_alloc,
>                                       nelem_alloc_first;
> -
> +             void *curr_offset;
> +     
>               /*
>                * If hash table is partitioned, give each freelist an equal share of
>                * the initial allocation.  Otherwise only freeList[0] is used.
> @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>               else
>                       freelist_partitions = 1;

> +             /*
> +              * If table is shared, calculate the offset at which to find the
> +              * the first partition of elements
> +              */
> +             if (hashp->isshared)
> +             {
> +                     int                     nsegs;
> +                     int                     nbuckets;
> +                     nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
> +                     
> +                     curr_offset =  (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> +                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> +             }
> +

Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?

I think it is possible to consolidate the allocations for non-shared hash tables
too. However, initial elements are much smaller in non-shared hash tables due to
their ease of expansion. Therefore, there is probably less benefit in trying to do
that for non-shared tables.
In addition, the proposed changes are targeted to improve the monitoring in 
pg_shmem_allocations which won't be applicable to non-shared hashtables. 
While I believe it is feasible, I am uncertain about the utility of such a change. 


Seems a bit ugly to go through element_alloc() when pre-allocating.  Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such?  With the latter doing everything
after hashp->alloc().

 
Makes sense. I split the element_alloc() into element_alloc() and element_add().
 

> -
> +     
>       /*
>        * initialize mutexes if it's a partitioned table
>        */

Spurious change.

 
Fixed. 

A function called find_num_of_segs() that also sets nbuckets seems a bit
confusing.  I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.
 
 I renamed it to compute_buckets_and_segs(). I am open to better suggestions. 

>       segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);

>       if (!segp)

Spurious change.

Fixed. 


> -static int
> +int
>  next_pow2_int(long num)
>  {
>       if (num > INT_MAX / 2)
> @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
>               }
>       }
>  }

Why export this?

It was a stale change, I removed it now 

 
> diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> index 932cc4f34d..5e16bd4183 100644
> --- a/src/include/utils/hsearch.h
> +++ b/src/include/utils/hsearch.h
> @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
>  extern void hash_freeze(HTAB *hashp);
>  extern Size hash_estimate_size(long num_entries, Size entrysize);
>  extern long hash_select_dirsize(long num_entries);
> -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
>  extern void AtEOXact_HashTables(bool isCommit);
>  extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);

It's imo a bit weird that we have very related logic in hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?

 
hash_estimate_size() estimates using default values and hash_get_shared_size()
calculates using specific values depending on the flags associated with the hash
table.  For instance, segment_size used by the former is DEF_SEGSIZE  and 
the latter uses info->ssize which is set when the HASH_SEGMENT flag is true.
Hence, they might return different values for shared memory sizes.
 

These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.

Fixed accordingly.
 

> -             (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
> +             (TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
>       MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
> -     ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> +     ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates), &found);
>       MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
> -     ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
> +     ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
>       MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));

>       /*

Same.

Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.

Made this change.
 
> -     fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> +     fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
>       MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));

>       /* For asserts checking we did not overflow. */

This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..

OK 

PFA the rebased patches with the above changes.

Kindly let me know your views.

Thank you,
Rahila Syed
Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: jsonb_strip_nulls with arrays?
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: Commitfest app release on Feb 17 with many improvements