Thread: add function for creating/attaching hash table in DSM registry
Libraries commonly use shared memory to store hash tables. While it's possible to set up a dshash table using the DSM registry today, doing so is complicated; you need to set up two LWLock tranches, a DSA, and finally the dshash table. The attached patch adds a new function called GetNamedDSMHash() that makes creating/attaching a hash table in the DSM registry much easier. -- nathan
Attachment
Re: add function for creating/attaching hash table in DSM registry
From
Dagfinn Ilmari Mannsåker
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > +typedef struct NamedDSMHashState > +{ > + dsa_handle dsah; > + dshash_table_handle dshh; > + int dsa_tranche; > + char dsa_tranche_name[68]; /* name + "_dsa" */ > + int dsh_tranche; > + char dsh_tranche_name[68]; /* name + "_dsh" */ > +} NamedDSMHashState; I don't have enough knowledge to review the rest of the patch, but shouldn't this use NAMEDATALEN, rather than hard-coding the default length? - ilmari
Thanks for this patch! I have implemented this pattern several times, so this is really helpful. I have a few early comments, but I plan on trying this out next. 1/ > > +typedef struct NamedDSMHashState > > +{ > > + dsa_handle dsah; > > + dshash_table_handle dshh; > > + int dsa_tranche; > > + char dsa_tranche_name[68]; /* name + "_dsa" */ > > + int dsh_tranche; > > + char dsh_tranche_name[68]; /* name + "_dsh" */ > > +} NamedDSMHashState; > > I don't have enough knowledge to review the rest of the patch, but > shouldn't this use NAMEDATALEN, rather than hard-coding the default > length? NamedLWLockTrancheRequest uses NAMEDATALEN = 64 bytes for the tranche_name typedef struct NamedLWLockTrancheRequest { char tranche_name[NAMEDATALEN]; int num_lwlocks; } NamedLWLockTrancheRequest; but in the case here, "_dsa" or "_dsh" will occupy another 4 bytes. I think instead of hardcoding, we should #define a length, i.e. #define NAMEDDSMTRANCHELEN (NAMEDATALEN + 4) 2/ Can you group the dsa and dsh separately. I felt this was a bit difficult to read? + /* Initialize LWLock tranches for the DSA and dshash table. */ + state->dsa_tranche = LWLockNewTrancheId(); + state->dsh_tranche = LWLockNewTrancheId(); + sprintf(state->dsa_tranche_name, "%s_dsa", name); + sprintf(state->dsh_tranche_name, "%s_dsh", name); + LWLockRegisterTranche(state->dsa_tranche, state->dsa_tranche_name); + LWLockRegisterTranche(state->dsh_tranche, state->dsh_tranche_name); 3/ It will be good to "Assert(dsh)" before "return dsh;" for safety? MemoryContextSwitchTo(oldcontext); LWLockRelease(DSMRegistryLock); return dsh; } -- Sami Imseih Amazon Web Services (AWS)
On Thu, Jun 05, 2025 at 01:38:25PM -0500, Sami Imseih wrote: > I have a few early comments, but I plan on trying this out next. Thanks for reviewing. >> > +typedef struct NamedDSMHashState >> > +{ >> > + dsa_handle dsah; >> > + dshash_table_handle dshh; >> > + int dsa_tranche; >> > + char dsa_tranche_name[68]; /* name + "_dsa" */ >> > + int dsh_tranche; >> > + char dsh_tranche_name[68]; /* name + "_dsh" */ >> > +} NamedDSMHashState; >> >> I don't have enough knowledge to review the rest of the patch, but >> shouldn't this use NAMEDATALEN, rather than hard-coding the default >> length? I straightened this out in v2. I've resisted using NAMEDATALEN because this stuff is unrelated to the name type. But I have moved all the lengths and suffixes to macros. > NamedLWLockTrancheRequest uses NAMEDATALEN = 64 bytes for the > tranche_name > > typedef struct NamedLWLockTrancheRequest > { > char tranche_name[NAMEDATALEN]; > int num_lwlocks; > } NamedLWLockTrancheRequest; I think the NAMEDATALEN limit only applies to tranches requested at startup time. LWLockRegisterTranche() just saves whatever pointer you give it, so AFAICT there's no real limit there. > 2/ Can you group the dsa and dsh separately. I felt this was a bit > difficult to read? > > + /* Initialize LWLock tranches for the DSA and dshash table. */ > + state->dsa_tranche = LWLockNewTrancheId(); > + state->dsh_tranche = LWLockNewTrancheId(); > + sprintf(state->dsa_tranche_name, "%s_dsa", name); > + sprintf(state->dsh_tranche_name, "%s_dsh", name); > + LWLockRegisterTranche(state->dsa_tranche, > state->dsa_tranche_name); > + LWLockRegisterTranche(state->dsh_tranche, > state->dsh_tranche_name); Done. > 3/ It will be good to "Assert(dsh)" before "return dsh;" for safety? > > MemoryContextSwitchTo(oldcontext); > LWLockRelease(DSMRegistryLock); > > return dsh; Eh, I would expect the tests to start failing horribly if I managed to mess that up. -- nathan
Attachment
Thanks, I tested v2 a bit more and did a quick hack of pg_stat_statements just to get a feel for what it would take to use the new API to move the hash table from static to dynamic. One thing I noticed, and I’m not too fond of, is that the wait_event name shows up with the _dsh suffix: ``` postgres=# select query, wait_event, wait_event_type from pg_stat_activity ; query | wait_event | wait_event_type -------------------------------------------------------------------+------------------------ +----------------- select 1; | pg_stat_statements_dsh | LWLock ``` So the suffix isn’t just an internal name, it’s user-facing now. If we really need to keep this behavior, I think it’s important to document it clearly in the code. A few nits also: 1/ + +static dshash_table *tdr_hash; + Isn't it better to initialize this to NULL? 2/ The comment: ``` params is ignored; a new tranche ID will be generated if needed. ``` The "if needed" part isn't necessary here. A new tranche ID will always be generated, right? 3/ GetNamedDSMSegment is called with "found" as the last argument: ``` state = GetNamedDSMSegment(name, sizeof(NamedDSMHashState), NULL, found); ``` I think it should use a different bool here instead of "found", since that value isn’t really needed from GetNamedDSMSegment. Also, it refers to whether the dynamic hash was found, and is set later in the routine. -- Sami
On Mon, Jun 09, 2025 at 03:10:30PM -0500, Sami Imseih wrote: > One thing I noticed, and I´m not too fond of, is that the wait_event name shows > up with the _dsh suffix: > > ``` > postgres=# select query, wait_event, wait_event_type from pg_stat_activity ; > query | wait_event > | wait_event_type > -------------------------------------------------------------------+------------------------ > +----------------- > select 1; | pg_stat_statements_dsh > | LWLock > ``` > > So the suffix isn´t just an internal name, it´s user-facing now. If we really > need to keep this behavior, I think it´s important to document it clearly in > the code. Okay. FWIW we do use suffixes like "DSA" and "Hash" for the built-in tranche names (e.g., DSMRegistryDSA and DSMRegistryHash). > + > +static dshash_table *tdr_hash; > + > > Isn't it better to initialize this to NULL? It might be better notationally, but it'll be initialized to NULL either way. > ``` > params is ignored; a new tranche ID will be generated if needed. > ``` > > The "if needed" part isn't necessary here. A new tranche ID will always be > generated, right? Not if the dshash table already exists. > 3/ GetNamedDSMSegment is called with "found" as the last argument: > > ``` > state = GetNamedDSMSegment(name, sizeof(NamedDSMHashState), NULL, found); > ``` > > I think it should use a different bool here instead of "found", since that > value isn´t really needed from GetNamedDSMSegment. Also, it refers to > whether the dynamic hash was found, and is set later in the routine. Yeah, I might as well add another boolean variable called "unused" or something for clarity here. -- nathan
> On Mon, Jun 09, 2025 at 03:10:30PM -0500, Sami Imseih wrote: > > One thing I noticed, and I´m not too fond of, is that the wait_event name shows > > up with the _dsh suffix: > > > > ``` > > postgres=# select query, wait_event, wait_event_type from pg_stat_activity ; > > query | wait_event > > | wait_event_type > > -------------------------------------------------------------------+------------------------ > > +----------------- > > select 1; | pg_stat_statements_dsh > > | LWLock > > ``` > > > > So the suffix isn´t just an internal name, it´s user-facing now. If we really > > need to keep this behavior, I think it´s important to document it clearly in > > the code. > > Okay. FWIW we do use suffixes like "DSA" and "Hash" for the built-in > tranche names (e.g., DSMRegistryDSA and DSMRegistryHash). I was surprised that I didn’t get the "extension" wait event based on the logic in GetLWTrancheName, specifically this portion: /* It's an extension tranche, so look in LWLockTrancheNames[]. However, it's possible that the tranche has never been registered in the current process, in which case give up and return "extension". */ In my test setup, I had two backends with pg_stat_statements enabled and actively counting queries, so they both registered a dynamic hash. The backend running pg_stat_activity, however, had pg_stat_statements disabled and did not register a dynamic hash table. After enabling pg_stat_statements in the pg_stat_activity backend as well, thus registering a dynamic hash, I then saw an "extension" wait event appear. It is not expected behavior IMO, and I still need to debug this a bit more, but it may be something outside the scope of this patch that the patch just surfaced. -- Sami
> It is not expected behavior IMO, and I still need to debug this a bit more, > but it may be something outside the scope of this patch that the patch just > surfaced. It seems I got it backward. If the tranch is registered, then the wait event name is the one of the tranche, in that case we will see the name of the tranch suffixed with "_dsh". If the tranche is not registered, then the wait event name is "extension". We can end up instrumenting with 2 different wait event names, "extension" or the tranche name, for the same code path. This looks broken to me, but I am not sure what could be done yet. It should be taken up for discussion in a separate thread, as it's not the fault of this patch. Going back to the original point, DSMRegistryHash and DSMRegistryHash are built-in, and those names are well-defined and actually refer to waits related to the mechanism of registering a DSA or a HASH. I think it will be odd to append "_dsh", but we should at minimum add a comment in the GetNamedDSMHash explaining this. -- Sami
On Mon, Jun 09, 2025 at 07:14:28PM -0500, Sami Imseih wrote: > Going back to the original point, DSMRegistryHash and DSMRegistryHash > are built-in, and those names are well-defined and actually refer to > waits related to the mechanism of registering a DSA or a HASH. > I think it will be odd to append "_dsh", but we should at minimum add > a comment in the GetNamedDSMHash explaining this. This should be addressed in v3. I'm not quite following your uneasiness with the tranche names. For the dshash table, we'll need a tranche for the DSA and one for the hash table, so presumably any wait events for those locks should be named accordingly, right? -- nathan
Attachment
On 10 Jun 2025, at 7:21 PM, Nathan Bossart <nathandbossart@gmail.com> wrote:On Tue, Jun 10, 2025 at 10:38:29AM -0500, Nathan Bossart wrote:On Mon, Jun 09, 2025 at 07:14:28PM -0500, Sami Imseih wrote:Going back to the original point, DSMRegistryHash and DSMRegistryHash
are built-in, and those names are well-defined and actually refer to
waits related to the mechanism of registering a DSA or a HASH.
I think it will be odd to append "_dsh", but we should at minimum add
a comment in the GetNamedDSMHash explaining this.
This should be addressed in v3.
I'm not quite following your uneasiness with the tranche names. For the
dshash table, we'll need a tranche for the DSA and one for the hash table,
so presumably any wait events for those locks should be named accordingly,
right?
Unrelated, but it'd probably be a good idea to make sure the segment is
initialized instead of assuming it'll be zeroed out (and further assuming
that DSHASH_HANDLE_INVALID is 0)...
--
nathan
<v4-0001-simplify-creating-hash-table-in-dsm-registry.patch>
Love this new API.
Two minor things
a minor typo here
+ * current backend. This function gurantees that only one backend
Since you made the first step towards decoupling DSMR_NAME_LEN from NAMEDATALEN;
is it worth considering increasing this to 128 maybe?
I’ve used DSMR extensively for namespacing keys etc, and I’ve come close to 50-60 chars at times.
I’m not too fixed on that though.
> I'm not quite following your uneasiness with the tranche names. For the > dshash table, we'll need a tranche for the DSA and one for the hash table, > so presumably any wait events for those locks should be named accordingly, > right? I may be alone in this opinion, but I prefer the suffixless tranche name for the primary LWLock (the hash table), as this is the lock users will encounter most frequently in wait events, like when adding or looking up entries. Adding a suffix (e.g., "Hash") may be confusing to the extension. In the tests that I did with pg_stat_statements, I’d rather see "pg_stat_statements" remain as-is, rather than "pg_stat_statements Hash". On the other hand, adding a suffix to the DSA tranche name (e.g., "pg_stat_statements DSA") is necessary to differentiate it from the Hash tranche name, and that is OK because it's not likely to be a user-visible wait event. -- Sami
On Tue, Jun 10, 2025 at 02:05:16PM -0500, Sami Imseih wrote: > There is also that dynamic tranche named are stored in local backend > look-up table, so if you have some backends that attached some dynamic > hash table > and others that did not, only the ones that registered would be able to > resolve the tranche id to its name. > > This is the case which I encountered yesterday, in which I tested 2 > backends competing for a LWLock on the dshash table, but a third backend > that did not attach the hashtable reported the wait_event as "extension" > rather than the extension-specified tranche name. > > If that third backend attaches the hash table, then it's able to report > the wait_event as the tranche name specified by the extension. So that > could be confusing as 2 wait events could be reported for the same > code path. right? My initial reaction to this was "well yeah, that's how it's designed." But after some more research, I see that LWLockRegisterTranche() (commit ea9df81) predates both the removal of dynamic_shared_memory_type=none (commit bcbd940) and the introduction of DSAs (commit 13df76a). lwlock.h even still has this (arguably outdated) comment: * It may seem strange that each process using the tranche must register it * separately, but dynamic shared memory segments aren't guaranteed to be * mapped at the same address in all coordinating backends, so storing the * registration in the main shared memory segment wouldn't work for that case. So, if we were adding named LWLocks today, I suspect we might do it differently. The first thing that comes to mind is that we could store a shared LWLockTrancheNames table and stop requiring each backend to register them individually. For a concrete example, the autoprewarm shared memory initialization code would become something like: static void apw_init_state(void *ptr) { AutoPrewarmSharedState *state = (AutoPrewarmSharedState *) ptr; LWLockInitialize(&state->lock, LWLockNewTrancheId("autoprewarm")); ... } static bool apw_init_shmem(void) { bool found; apw_state = GetNamedDSMSegment("autoprewarm", sizeof(AutoPrewarmSharedState), apw_init_state, &found); return found; } In short, LWLockNewTrancheId() would gain a new name argument, and LWLockRegisterTranche() would disappear. We would probably need to be smart to avoid contention on the name table, but that feels avoidable to me. -- nathan
Hi,
Thank you for proposing this enhancement to the DSM registry. It will make it easier
Thank you for proposing this enhancement to the DSM registry. It will make it easier
to use dshash functionality.
While trying to port some existing DSMR code, I came across this limitation:
How can one dsa_allocate in the same area as the returned dshash_table ?
in other words: shouldn't the state->dsa_handle be returned somehow ?
+1. FWIW, Having used the DSA apis in my code, I think having the registry return
the mapped dsa address or dsa handle will benefit users who use dsa_allocate
to allocate smaller chunks within the dsa.
On Wed, Jun 11, 2025 at 07:15:56PM +0530, Rahila Syed wrote: >> How can one dsa_allocate in the same area as the returned dshash_table ? >> in other words: shouldn't the state->dsa_handle be returned somehow ? > > +1. FWIW, Having used the DSA apis in my code, I think having the registry > return > the mapped dsa address or dsa handle will benefit users who use dsa_allocate > to allocate smaller chunks within the dsa. I considered adding another function that would create/attach a DSA in the DSM registry, since that's already an intermediate step of dshash creation. We could then use that function to generate the DSA in GetNamedDSMHash(). Would that work for your use-cases, or do you really need to use the same DSA as the dshash table for some reason? -- nathan
> On 11 Jun 2025, at 4:57 PM, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Jun 11, 2025 at 07:15:56PM +0530, Rahila Syed wrote: >>> How can one dsa_allocate in the same area as the returned dshash_table ? >>> in other words: shouldn't the state->dsa_handle be returned somehow ? >> >> +1. FWIW, Having used the DSA apis in my code, I think having the registry >> return >> the mapped dsa address or dsa handle will benefit users who use dsa_allocate >> to allocate smaller chunks within the dsa. > > I considered adding another function that would create/attach a DSA in the > DSM registry, since that's already an intermediate step of dshash creation. > We could then use that function to generate the DSA in GetNamedDSMHash(). > Would that work for your use-cases, or do you really need to use the same > DSA as the dshash table for some reason? In my case the hashtable itself stores dsa_pointers (obviously stuff allocated in the dsa as the hash table itself) so I think I can’t avoid the necessity of having it. Unless, you see a good reason not to expose it ?
On Wed, Jun 11, 2025 at 07:15:56PM +0530, Rahila Syed wrote:
>> How can one dsa_allocate in the same area as the returned dshash_table ?
>> in other words: shouldn't the state->dsa_handle be returned somehow ?
>
> +1. FWIW, Having used the DSA apis in my code, I think having the registry
> return
> the mapped dsa address or dsa handle will benefit users who use dsa_allocate
> to allocate smaller chunks within the dsa.
I considered adding another function that would create/attach a DSA in the
DSM registry, since that's already an intermediate step of dshash creation.
We could then use that function to generate the DSA in GetNamedDSMHash().
Would that work for your use-cases, or do you really need to use the same
DSA as the dshash table for some reason?
This will work for me. Thank you for considering it.
On Wed, Jun 11, 2025 at 05:11:54PM +0300, Florents Tselai wrote: >> On 11 Jun 2025, at 4:57 PM, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I considered adding another function that would create/attach a DSA in the >> DSM registry, since that's already an intermediate step of dshash creation. >> We could then use that function to generate the DSA in GetNamedDSMHash(). >> Would that work for your use-cases, or do you really need to use the same >> DSA as the dshash table for some reason? > > In my case the hashtable itself stores dsa_pointers (obviously stuff > allocated in the dsa as the hash table itself) so I think I can’t avoid > the necessity of having it. Is there any reason these DSA pointers can't be for a separate DSA than what the dshash table is using? > Unless, you see a good reason not to expose it ? I'm not sure there's any real technical reason, but I guess it feels cleaner to me to keep the DSM-registry-managed dshash DSAs internal to the implementation. Presumably messing with that DSA introduces some risk of breakage, and it could make it more difficult to change implementation details for the named dshash code in the future. -- nathan
Here is a new patch with GetNamedDSA() added. A couple notes: * I originally wanted to use GetNamedDSA() within GetNamedDSMHash(), but that would probably lead to two registry entries per dshash table, and it didn't really save all that much code, anyway. So, I didn't do that. * Using a DSA from the registry is cumbersome. You essentially need another batch of shared memory to keep track of the pointers and do locking, so it might not be tremendously useful on its own. AFAICT the easiest thing to do is to store the DSA pointers in a dshash table, which is what I've done in the test. -- nathan
Attachment
I tested v6 and I think GetNamedDSA is a good addition. I did not find any issues with the code. However, I am still convinced that GetNamedDSMHash should not append " Hash" to the tranche name of the dshash [0]. I am ok with " DSA" because the DSA tranche is created implicitly by the API. Also, with GetNamedDSA not appending any suffixes, it will be strange to have some extensions that use both GetNamedDSA and GetNamedDSMHash finding that one API appends a suffix and the other does not. but, maybe that's only my view. [0] https://www.postgresql.org/message-id/CAA5RZ0sRkH8PfbwFPpYiqQWmSYmbH%2BBd0Vro%2BYZFvwxzed_6eQ%40mail.gmail.com -- Sami
On Wed, Jun 11, 2025 at 05:15:36PM -0500, Sami Imseih wrote: > I tested v6 and I think GetNamedDSA is a good addition. I did > not find any issues with the code. However, I am still convinced > that GetNamedDSMHash should not append " Hash" to the tranche > name of the dshash [0]. I am ok with " DSA" because the DSA tranche > is created implicitly by the API. Okay, I've done this in the attached patch. -- nathan
Attachment
Hi,
* Using a DSA from the registry is cumbersome. You essentially need
another batch of shared memory to keep track of the pointers and do
locking, so it might not be tremendously useful on its own. AFAICT the
easiest thing to do is to store the DSA pointers in a dshash table, which
is what I've done in the test.
I am considering whether it would be better to avoid creating another DSM segment to
track the DSA handle.
Would it make more sense to track the DSAs in a separate dshash registry similar
to DSM segments?
+ /* Attach to existing DSA. */
+ dsa = dsa_attach(state->dsah);
+ dsa_pin_mapping(dsa);
+
+ *found = true;
+ }
Should this also consider the case where dsa is already mapped,
to avoid the error on attaching to the DSA twice? IIUC,
that would require calling dsa equivalent of dsm_find_mapping().
Thank you,
track the DSA handle.
Would it make more sense to track the DSAs in a separate dshash registry similar
to DSM segments?
+ /* Attach to existing DSA. */
+ dsa = dsa_attach(state->dsah);
+ dsa_pin_mapping(dsa);
+
+ *found = true;
+ }
Should this also consider the case where dsa is already mapped,
to avoid the error on attaching to the DSA twice? IIUC,
that would require calling dsa equivalent of dsm_find_mapping().
Thank you,
Rahila Syed
On Fri, Jun 13, 2025 at 08:31:22PM +0530, Rahila Syed wrote: > I am considering whether it would be better to avoid creating another DSM > segment to track the DSA handle. Would it make more sense to track the > DSAs in a separate dshash registry similar to DSM segments? I don't know if it's better to manage 3 dshash tables than to manage 1 with special entries for DSAs/dshash tables. There might be some small trade-offs with each approach, but I haven't thought of anything too worrisome... > + /* Attach to existing DSA. */ > + dsa = dsa_attach(state->dsah); > + dsa_pin_mapping(dsa); > + > + *found = true; > + } > > Should this also consider the case where dsa is already mapped, to avoid > the error on attaching to the DSA twice? IIUC, that would require calling > dsa equivalent of dsm_find_mapping(). I wanted to find a way to do this, but AFAICT you can't. DSAs and dshash tables are returned in backend-local memory, so if you lose that pointer, I don't think there's a totally safe way to recover it. For now, I've documented that GetNamedDSA()/GetNamedDSMHash() should only be called for a given DSA/dshash once in each backend. One other thing we could do is add a dsa_is_attached() function and then ERROR if you try to reattach an already-attached DSA/dshash. I've done this in the attached patch. -- nathan
Attachment
Hi Nathan,
I like this idea, but I took it one step further in the attached patch and
made the registry entry struct flexible enough to store any type of entry.
Specifically, I've added a new "type" enum followed by a union of the
different structs used to store the entry data. I was originally trying to
avoid this kind of invasive change, but it's not nearly as complicated as I
feared, and there are benefits such as fewer shared memory things to juggle
and better sanity checking. It should also be easy to extend in the
future. WDYT?
Thank you for implementing these changes.
The improvements look good and enhance the feature's utility. I have already started incorporating
GetNamedDSA into my code to display memory context statistics.
A potential future enhancement could be allowing GetNamedDSHASH to accept an existing DSA name.
This would enable the DSHASH to reuse a DSA area instead of creating a new one each time.
I plan to use this registry to store DSA pointers that all belong to the same DSA area, and this enhancement
would be particularly beneficial. If you find this idea useful, I would be interested in working on it.
Thank you,
Rahila Syed