Re: shared-memory based stats collector - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: shared-memory based stats collector
Date
Msg-id CA+hUKGJs95JaSC5G8JBQ73L=onqmJDF0tPvzxH07BC5vx8XqDQ@mail.gmail.com
Whole thread Raw
In response to Re: shared-memory based stats collector  (Andres Freund <andres@anarazel.de>)
Responses Re: shared-memory based stats collector  (Andres Freund <andres@anarazel.de>)
Re: shared-memory based stats collector  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Hi Horiguchi-san, Andres,

I tried to rebase this (see attached, no intentional changes beyond
rebasing).  Some feedback:

On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> Thomas, could you look at the first two patches here, and my review
> questions?

Ack.

> >               dsa_pointer item_pointer = hash_table->buckets[i];
> > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> >                                                               LW_EXCLUSIVE));
> >
> >       delete_item(hash_table, item);
> > -     hash_table->find_locked = false;
> > -     hash_table->find_exclusively_locked = false;
> > -     LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +
> > +     /* We need to keep partition lock while sequential scan */
> > +     if (!hash_table->seqscan_running)
> > +     {
> > +             hash_table->find_locked = false;
> > +             hash_table->find_exclusively_locked = false;
> > +             LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +     }
> >  }
>
> This seems like a failure prone API.

If I understand correctly, the only purpose of the seqscan_running
variable is to control that behaviour ^^^.  That is, to make
dshash_delete_entry() keep the partition lock if you delete an entry
while doing a seq scan.  Why not get rid of that, and provide a
separate interface for deleting while scanning?
dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
would be most common to want to delete the "current" item in the seq
scan, but it could allow you to delete anything in the same partition,
or any entry if using the "consistent" mode.  Oh, I see that Andres
said the same thing later.

> [Andres complaining about comments and language stuff]

I would be happy to proof read and maybe extend the comments (writing
new comments will also help me understand and review the code!), and
maybe some code changes to move this forward.  Horiguchi-san, are you
working on another version now?  If so I'll wait for it before I do
that.

> > + */
> > +void
> > +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> > +                             bool consistent, bool exclusive)
> > +{
>
> Why does this patch add the consistent mode? There's no users currently?
> Without it's not clear that we need a seperate _term function, I think?

+1, let's not do that if we don't need it!

> The fact that you're locking the per-database entry unconditionally once
> for each table almost guarantees contention - and you're not using the
> 'conditional lock' approach for that. I don't understand.

Right, I also noticed that:

    /*
     * Local table stats should be applied to both dbentry and tabentry at
     * once. Update dbentry only if we could update tabentry.
     */
    if (pgstat_update_tabentry(tabhash, entry, nowait))
    {
        pgstat_update_dbentry(dbent, entry);
        updated = true;
    }

So pgstat_update_tabentry() goes to great trouble to take locks
conditionally, but then pgstat_update_dbentry() immediately does:

    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
    LWLockRelease(&dbentry->lock);

Why can't we be "lazy" with the dbentry stats too?  Is it really
important for the table stats and DB stats to agree with each other?
Even if it were, your current coding doesn't achieve that: the table
stats are updated before the DB stat under different locks, so I'm not
sure why it can't wait longer.

Hmm.  Even if you change the above code use a conditional lock, I am
wondering (admittedly entirely without data) if this approach is still
too clunky: even trying and failing to acquire the lock creates
contention, just a bit less.  I wonder if it would make sense to make
readers do more work, so that writers can avoid contention.  For
example, maybe PgStat_StatDBEntry could hold an array of N sets of
counters, and readers have to add them all up.  An advanced version of
this idea would use a reasonably fresh copy of something like
sched_getcpu() and numa_node_of_cpu() to select a partition to
minimise contention and cross-node traffic, with a portable fallback
based on PID or something.  CPU core/node awareness is something I
haven't looked into too seriously, but it's been on my mind to solve
some other problems.

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Collation versioning
Next
From: Justin Pryzby
Date:
Subject: Re: error context for vacuum to include block number