Re: shared-memory based stats collector - v66 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: shared-memory based stats collector - v66
Date
Msg-id 20220329072417.er26q5pxc4pbldn7@alap3.anarazel.de
Whole thread Raw
In response to Re: shared-memory based stats collector - v66  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> I'd like to dump out my humble thoughts about other AFIXMEs..

Thanks!  Please have another look at the code in
https://github.com/anarazel/postgres/tree/shmstat I just pushed a revised
version with a lot of [a]fixmes removed.


Most importantly I did move replication slot stats into the hash table, and
just generally revised the replication slot stats code substantially. I
think it does look better now.

But also there's a new commit allowing dsm use in single user mode. To be able
to rely on stats drops we need to perform them even in single user mode. The
only reason this didn't previously fail was that we allocated enough "static"
shared memory for single user mode to never need DSMs.

Thanks to Melanie's tests, and a few more additions by myself, the code is now
reasonably well covered. The big exception to that is recovery conflict stats,
and as Melanie noticed, that was broken (somehow pgstat_database_flush_cb()
didn't sum them up)). I think she has some WIP tests...

Re the added tests: I did fix a few timing issues there. There's probably a
few more hiding somewhere.


I also found that unfortunately dshash_seq_next() as is isn't correct. I
included a workaround commit, but it's not correct. What we need to do is to
just always lock partition 0 in the initialization branch. Before we call
ensure_valid_bucket_pointers() status->hash_table->size_log2 isn't valid. And
ensure_valid_bucket_pointers can only be called with a lock...



Horiguchi-san, if you have time to look at the "XXX: The following could now be
generalized" in pgstat_read_statsfile(), pgstat_write_statsfile()... I think
that'd be nice to clean up.



> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 10000ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent.

Have you measured this (recently)? I tried to cause contention with a workload
targeted towards that, but couldn't see a problem with 1000ms. Of course
there's a problem with 1ms...

I think it's confusing to not report stats for 10s without a need.


> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

Yea, I think the 60s part under contention is fine. I'd expect that to be
rarely reached.


> 
> > AFIXME: architecture explanation.
> 
> Mmm. next, please:p

Working on it. There's one more AFIXME that I want to resolve before, so I
don't end up with old type names strewn around (the one in pgstat_internal.h).


> 
> (    [PGSTAT_KIND_REPLSLOT] = {)
> > * AFIXME: With a bit of extra work this could now be a !fixed_amount
> > * stats kind.
> 
> Yeah.  The most bothersome point is the slot index is not persistent
> at all and the relationship between the index and name (or identity)
> is not stable even within a process life.  It can be resolved by
> allocating an object id to every replication slot.  I faintly remember
> of a discussion like that but I don't have a clear memory of the
> discussion.

I think it's resolved now. pgstat_report_replslot* all get the ReplicationSlot
as a parameter. They use the new ReplicationSlotIndex() to get an index from
that. pgstat_report_replslot_(create|acquire) ensure that the relevant index
doesn't somehow contain old stats.

To deal with indexes changing / slots getting removed during restart, there's
now a new callback made during pgstat_read_statsfile() to build the key from
the serialized NameStr. That can return false if a slot of that name is not
know, or use ReplicationSlotIndex() to get the index to store in-memory stats.


> > static Size
> > pgstat_dsa_init_size(void)
> > {
> >     /*
> >      * AFIXME: What should we choose as an initial size? Should we make this
> >      * configurable? Maybe tune based on NBuffers?
> 
> > StatsShmemInit(void)
> >          * AFIXME: we need to guarantee this can be allocated in plain shared
> >          * memory, rather than allocating dsm segments.
> 
> I'm not sure that NBuffers is the ideal base for deciding the required
> size since it doesn't seem to be generally in proportion with the
> number of database objects.  If we made it manually-tunable, we will
> be able to emit a log when DSM segment allocation happens for this use
> as as the tuning aid..
> 
>    WARNING: dsa allocation happened for activity statistics
>    HINT: You might want to increase stat_dsa_initial_size if you see slow
>          down blah..

FWIW, I couldn't find any performance impact from using DSM. Because of the
"PgStatSharedRef" layer, there's not actually that much interaction with the
dsm code...

I reduced the initial allocation to 256kB. Unfortunately that's currently the
minimum that allows dshash_create() to succeed (due to dsa.c pre-allocating 16
of each allocation). I was a bit worried about that for a while, but memory
usage is still lower with the patch than before in the scenarios I tested. We
can probably improve upon that fairly easily in the future (move
dshash_table_control into static shared memory, call dsa_trim() when resizing
dshash table).


> > * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?
> 
> Or pgstat_xact.c?

Maybe. Somehow it doesn't seem *great* either.


> >  * AFIXME: comment
> >  * AFIXME: see notes about race conditions for functions in
> >  *         pgstat_drop_function().
> >  */
> > void
> > pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)
> 
> 
> pgstat_drop_function() doesn't seem to have such a note.

Yea, I fixed it in pgstat_init_function_usage(), forgetting about the node in
pgstat_schedule_stat_drop(). There's a decently long comment in
pgstat_init_function_usage() explaining the problem.


> I suppose the "race condition" means the case a stats entry for an
> object is created just after the same object is dropped on another
> backend.  It seems to me such a race condition is eliminated by the
> transactional drop mechanism.  Are you intending to write an
> explanation of that?

Yes, I definitely plan to write a bit more about that.


> 
> >    /*
> >     * pgStatSharedRefAge increments quite slowly than the time the following
> >     * loop takes so this is expected to iterate no more than twice.
> >     *
> >     * AFIXME: Why is this a good place to do this?
> >     */
> >    while (pgstat_shared_refs_need_gc())
> >        pgstat_shared_refs_gc();
> 
> Is the reason for the AFIXME is you think that GC-check happens too
> frequently?

Well, the while () loop makes me "suspicious" when looking at the code. I've
now made it an if (), I can't see a reason why we'd need a while()?

I just moved a bunch of that code around, there's probably a bit more polish
needed.


> > pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
> > {
> ...
> >         * AFIXME: this probably is racy. Another backend could look up the
> >         * stat, bump the refcount, as we free it.
> >        if (pg_atomic_fetch_sub_u32(&shared_ref->shared_entry->refcount, 1) == 1)
> >        {
> ...
> >            /* only dropped entries can reach a 0 refcount */
> >            Assert(shared_ref->shared_entry->dropped);
> 
> I didn't deeply examined, but is that race condition avoidable by
> prevent pgstat_shared_ref_get from incrementing the refcount of
> dropped entries?

I don't think the race exists anymore. I've now revised the relevant code.


> >  * AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
> >  * it's not entirely trivial, because we can't use plain dshash_delete_entry()
> >  * (but have to use dshash_delete_current()).
> >  */
> > static bool
> > pgstat_drop_stats_entry(dshash_seq_status *hstat)
> ...
> >      * AFIXME: don't do this while holding the dshash lock.
> 
> Is the AFIXMEs mean that we should move the call to
> pgstat_shared_ref_release() out of the dshash-loop (in
> pgstat_drop_database_and_contents) that calls this function?  Is it
> sensible if we store the (key, ref) pairs for to-be released
> shared_refs then clean up them after exiting the loop?

I think this is now resolved. The release now happens separately, without
nested locks. See pgstat_shared_refs_release_db() call in
pgstat_drop_database_and_contents().


> 
> >         * Database stats contain other stats. Drop those as well when
> >         * dropping the database. AFIXME: Perhaps this should be done in a
> >         * slightly more principled way?
> >         */
> >        if (key.kind == PGSTAT_KIND_DB)
> >            pgstat_drop_database_and_contents(key.dboid);
> 
> I tend to agree to that and it is possible that we have
> PgStatKindInfo.drop_cascade_cb(PgStatShm_StatEntryHeader *header). But
> it is really needed only by PGSTAT_KIND_DB..

Yea, I came to the same conclusion, namely that we don't need something better
for now.


> >  * AFIXME: consistent naming
> >  * AFIXME: deduplicate some of this code with pgstat_fetch_snapshot_build().
> >  *
> >  * AFIXME: it'd be nicer if we passed .snapshot_cb() the target memory
> >  * location, instead of putting PgStatSnapshot into pgstat_internal.h
> >  */
> > void
> > pgstat_snapshot_global(PgStatKind kind)
> 
> 
> Does having PGSTAT_KIND_NONE in PgStatKind or InvalidPgStatKind work
> for deduplication? But I'm afraid that harms in some way.

I think I made it a bit nicer now, without needing either of those. I'd like
to remove "global" from those functions, it's not actually that obvious what
it means.


> For the memory location, it seems like a matter of taste, but if we
> don't need a multiple copies of a global snapshot, I think
> .snapshot_cb() doesn't need to take the target memory location.

I think it's ok for now. It'd be a bit nicer if we didn't need PgStatSnapshot
/ stats_snapshot in pgstat_internal.h, but it's ok that way I think.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Accommodate startup process in a separate ProcState array slot instead of in MaxBackends slots.
Next
From: Andres Freund
Date:
Subject: Re: shared-memory based stats collector - v67