Re: shared-memory based stats collector - Mailing list pgsql-hackers

From Andres Freund
Subject Re: shared-memory based stats collector
Date
Msg-id 20200319200316.ugwnulgc57ichoo5@alap3.anarazel.de
Whole thread Raw
In response to Re: shared-memory based stats collector  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

On 2020-03-19 16:51:59 +1300, Thomas Munro wrote:
> On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> > Thomas, could you look at the first two patches here, and my review
> > questions?
> 
> Ack.

Thanks!


> > >               dsa_pointer item_pointer = hash_table->buckets[i];
> > > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> > >                                                               LW_EXCLUSIVE));
> > >
> > >       delete_item(hash_table, item);
> > > -     hash_table->find_locked = false;
> > > -     hash_table->find_exclusively_locked = false;
> > > -     LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > > +
> > > +     /* We need to keep partition lock while sequential scan */
> > > +     if (!hash_table->seqscan_running)
> > > +     {
> > > +             hash_table->find_locked = false;
> > > +             hash_table->find_exclusively_locked = false;
> > > +             LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > > +     }
> > >  }
> >
> > This seems like a failure prone API.
> 
> If I understand correctly, the only purpose of the seqscan_running
> variable is to control that behaviour ^^^.  That is, to make
> dshash_delete_entry() keep the partition lock if you delete an entry
> while doing a seq scan.  Why not get rid of that, and provide a
> separate interface for deleting while scanning?
> dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
> would be most common to want to delete the "current" item in the seq
> scan, but it could allow you to delete anything in the same partition,
> or any entry if using the "consistent" mode.  Oh, I see that Andres
> said the same thing later.


> > [Andres complaining about comments and language stuff]
> 
> I would be happy to proof read and maybe extend the comments (writing
> new comments will also help me understand and review the code!), and
> maybe some code changes to move this forward.  Horiguchi-san, are you
> working on another version now?  If so I'll wait for it before I do
> that.

Cool! Being ESL myself and mildly dyslexic to boot, that'd be
helpful. But I'd hold off for a moment, because I think there'll need to
be some open heart surgery on this patch (see bottom of my last email in
this thread, for minutes ago (don't yet have a message id, sorry)).


> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
> 
> Right, I also noticed that:
> 
>     /*
>      * Local table stats should be applied to both dbentry and tabentry at
>      * once. Update dbentry only if we could update tabentry.
>      */
>     if (pgstat_update_tabentry(tabhash, entry, nowait))
>     {
>         pgstat_update_dbentry(dbent, entry);
>         updated = true;
>     }
> 
> So pgstat_update_tabentry() goes to great trouble to take locks
> conditionally, but then pgstat_update_dbentry() immediately does:
> 
>     LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
>     dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
>     dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
>     dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
>     dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
>     dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
>     dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
>     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
>     LWLockRelease(&dbentry->lock);
> 
> Why can't we be "lazy" with the dbentry stats too?  Is it really
> important for the table stats and DB stats to agree with each other?

We *need* to be lazy here, I think.


> Hmm.  Even if you change the above code use a conditional lock, I am
> wondering (admittedly entirely without data) if this approach is still
> too clunky: even trying and failing to acquire the lock creates
> contention, just a bit less.  I wonder if it would make sense to make
> readers do more work, so that writers can avoid contention.  For
> example, maybe PgStat_StatDBEntry could hold an array of N sets of
> counters, and readers have to add them all up.  An advanced version of
> this idea would use a reasonably fresh copy of something like
> sched_getcpu() and numa_node_of_cpu() to select a partition to
> minimise contention and cross-node traffic, with a portable fallback
> based on PID or something.  CPU core/node awareness is something I
> haven't looked into too seriously, but it's been on my mind to solve
> some other problems.

I don't think we really need that for the per-object stats. The easier
way to address that is to instead reduce the rate of flushing to the
shared table. There's not really a problem with the shared state of the
stats lagging by a few hundred ms or so.

The amount of code complexity a scheme like you describe doesn't seem
worth it to me without very clear evidence its needed. If we didn't need
to handle the case were the "static" slots are insufficient to handle
all the stats, it'd be different. But given the number of tables etc
that can exist in systems, I don't think that's achievable.


I think we should go for per-backend counters for other parts of the
system though. I think it should basically be the default for cluster
wide stats like IO (even if we additionally flush it to per table
stats). Currently we have more complicated schemes for those. But that's
imo a separate patch.


Thanks!

Andres



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Make MemoryContextMemAllocated() more precise
Next
From: Robert Haas
Date:
Subject: Re: Make MemoryContextMemAllocated() more precise