Re: shared-memory based stats collector - v66 - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: shared-memory based stats collector - v66 |
Date | |
Msg-id | 20220325.172418.1055357550509939415.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: shared-memory based stats collector - v66 (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: shared-memory based stats collector - v66
Re: shared-memory based stats collector - v66 Re: shared-memory based stats collector - v66 |
List | pgsql-hackers |
At Fri, 25 Mar 2022 14:22:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in > > On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > The biggest todos are: > > > - Address all the remaining AFIXMEs and XXXs > > > > Attached is a patch that addresses three of the existing AFIXMEs. I'd like to dump out my humble thoughts about other AFIXMEs.. > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification > for increasing it? It is 1000ms in the comment just above but actually 10000ms. The number came from a discussion that if we have 1000 clients and each backend writes stats once per 0.5 seconds, totally we flush pending data to shared area at 2000 times per second which is too frequent. I raised it to 5000ms, then 10000ms. So the expected maximum flush frequency is reduces to 100 times per second. Of course it is assuming the worst case and the 10000ms is apparently too long for the average cases. The current implement of pgstat postpones flushing if lock collision happens then postpone by at most 60s. This is a kind of auto-averaging mechanishm. It might be enough and we can reduce the PGSTAT_MIN_INTERVAL to 500ms or so. > AFIXME: architecture explanation. Mmm. next, please:p ( [PGSTAT_KIND_REPLSLOT] = {) > * AFIXME: With a bit of extra work this could now be a !fixed_amount > * stats kind. Yeah. The most bothersome point is the slot index is not persistent at all and the relationship between the index and name (or identity) is not stable even within a process life. It can be resolved by allocating an object id to every replication slot. I faintly remember of a discussion like that but I don't have a clear memory of the discussion. > static Size > pgstat_dsa_init_size(void) > { > /* > * AFIXME: What should we choose as an initial size? Should we make this > * configurable? Maybe tune based on NBuffers? > StatsShmemInit(void) > * AFIXME: we need to guarantee this can be allocated in plain shared > * memory, rather than allocating dsm segments. I'm not sure that NBuffers is the ideal base for deciding the required size since it doesn't seem to be generally in proportion with the number of database objects. If we made it manually-tunable, we will be able to emit a log when DSM segment allocation happens for this use as as the tuning aid.. WARNING: dsa allocation happened for activity statistics HINT: You might want to increase stat_dsa_initial_size if you see slow down blah.. > * AFIXME: Should all the stats drop code be moved into pgstat_drop.c? Or pgstat_xact.c? > * AFIXME: comment > * AFIXME: see notes about race conditions for functions in > * pgstat_drop_function(). > */ > void > pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid) pgstat_drop_function() doesn't seem to have such a note. I suppose the "race condition" means the case a stats entry for an object is created just after the same object is dropped on another backend. It seems to me such a race condition is eliminated by the transactional drop mechanism. Are you intending to write an explanation of that? > /* > * pgStatSharedRefAge increments quite slowly than the time the following > * loop takes so this is expected to iterate no more than twice. > * > * AFIXME: Why is this a good place to do this? > */ > while (pgstat_shared_refs_need_gc()) > pgstat_shared_refs_gc(); Is the reason for the AFIXME is you think that GC-check happens too frequently? > pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref) > { ... > * AFIXME: this probably is racy. Another backend could look up the > * stat, bump the refcount, as we free it. > if (pg_atomic_fetch_sub_u32(&shared_ref->shared_entry->refcount, 1) == 1) > { ... > /* only dropped entries can reach a 0 refcount */ > Assert(shared_ref->shared_entry->dropped); I didn't deeply examined, but is that race condition avoidable by prevent pgstat_shared_ref_get from incrementing the refcount of dropped entries? > * AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But > * it's not entirely trivial, because we can't use plain dshash_delete_entry() > * (but have to use dshash_delete_current()). > */ > static bool > pgstat_drop_stats_entry(dshash_seq_status *hstat) ... > * AFIXME: don't do this while holding the dshash lock. Is the AFIXMEs mean that we should move the call to pgstat_shared_ref_release() out of the dshash-loop (in pgstat_drop_database_and_contents) that calls this function? Is it sensible if we store the (key, ref) pairs for to-be released shared_refs then clean up them after exiting the loop? > * Database stats contain other stats. Drop those as well when > * dropping the database. AFIXME: Perhaps this should be done in a > * slightly more principled way? > */ > if (key.kind == PGSTAT_KIND_DB) > pgstat_drop_database_and_contents(key.dboid); I tend to agree to that and it is possible that we have PgStatKindInfo.drop_cascade_cb(PgStatShm_StatEntryHeader *header). But it is really needed only by PGSTAT_KIND_DB.. > * AFIXME: consistent naming > * AFIXME: deduplicate some of this code with pgstat_fetch_snapshot_build(). > * > * AFIXME: it'd be nicer if we passed .snapshot_cb() the target memory > * location, instead of putting PgStatSnapshot into pgstat_internal.h > */ > void > pgstat_snapshot_global(PgStatKind kind) Does having PGSTAT_KIND_NONE in PgStatKind or InvalidPgStatKind work for deduplication? But I'm afraid that harms in some way. For the memory location, it seems like a matter of taste, but if we don't need a multiple copies of a global snapshot, I think .snapshot_cb() doesn't need to take the target memory location. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: