Re: shared-memory based stats collector - Mailing list pgsql-hackers

From Andres Freund
Subject Re: shared-memory based stats collector
Date
Msg-id 20200922024704.yc3vwcckanpx557u@alap3.anarazel.de
Whole thread Raw
In response to Re: shared-memory based stats collector  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: shared-memory based stats collector
List pgsql-hackers
Hi,

On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:
> Locks on the shared statistics is acquired by the units of such like
> tables, functions so the expected chance of collision are not so high.

I can't really parse that...


> Furthermore, until 1 second has elapsed since the last flushing to
> shared stats, lock failure postpones stats flushing so that lock
> contention doesn't slow down transactions.

I think I commented on that before, but to me 1s seems way too low to
switch to blocking lock acquisition. What's the reason for such a low
limit?

>
>      /*
> -     * Clean up any dead statistics collector entries for this DB. We always
> +     * Clean up any dead activity statistics entries for this DB. We always
>       * want to do this exactly once per DB-processing cycle, even if we find
>       * nothing worth vacuuming in the database.
>       */

What is "activity statistics"?


> @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
>      }
>
>      /* fetch the pgstat table entry */
> -    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
> -                                         shared, dbentry);
> +    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
> +                                                   relid);

Why do all of these places deal with a snapshot? For most it seems to
make much more sense to just look up the entry and then copy that into
local memory?  There may be some place that need some sort of snapshot
behaviour that's stable until commit / pgstat_clear_snapshot(). But I
can't reallly see many?


> +#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
>
> +#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
> +                                             * updates */

These don't really seem to be in line with the commit message...




>  /*
> - * Structures in which backends store per-table info that's waiting to be
> - * sent to the collector.
> + * Enums and types to define shared statistics structure.
> + *
> + * Statistics entries for each object is stored in individual DSA-allocated
> + * memory. Every entry is pointed from the dshash pgStatSharedHash via
> + * dsa_pointer. The structure makes object-stats entries not moved by dshash
> + * resizing, and allows the dshash can release lock sooner on stats
> + * updates. Also it reduces interfering among write-locks on each stat entry by
> + * not relying on partition lock of dshash. PgStatLocalHashEntry is the
> + * local-stats equivalent of PgStatHashEntry for shared stat entries.
> + *
> + * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
> + * attribute of all kind of statistics and a LWLock lock object.
> + *
> + * Shared stats are stored as:
> + *
> + * dshash pgStatSharedHash
> + *    -> PgStatHashEntry                (dshash entry)
> + *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)

I don't like 'Envelope' that much. If I understand you correctly that's
a common prefix that's used for all types of stat objects, correct? If
so, how about just naming it PgStatEntryBase or such? I think it'd also
be useful to indicate in the "are stored as" part that PgStatEnvelope is
just the common prefix for an allocation.


>  /*
> - * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
> + * entry size lookup table of shared statistics entries corresponding to
> + * PgStatTypes
>   */
> -typedef struct TabStatHashEntry
> +static size_t pgstat_entsize[] =

> +/* Ditto for local statistics entries */
> +static size_t pgstat_localentsize[] =
> +{
> +    0,                            /* PGSTAT_TYPE_ALL: not an entry */
> +    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
> +    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
> +    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
> +};

These probably should be const as well.


>  /*
> - * Backends store per-function info that's waiting to be sent to the collector
> - * in this hash table (indexed by function OID).
> + * Stats numbers that are waiting for flushing out to shared stats are held in
> + * pgStatLocalHash,
>   */
> -static HTAB *pgStatFunctions = NULL;
> +typedef struct PgStatHashEntry
> +{
> +    PgStatHashEntryKey key;        /* hash key */
> +    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
> +}            PgStatHashEntry;
> +
> +/* struct for shared statistics entry pointed from shared hash entry. */
> +typedef struct PgStatEnvelope
> +{
> +    PgStatTypes type;            /* statistics entry type */
> +    Oid            databaseid;        /* databaseid */
> +    Oid            objectid;        /* objectid */

Do we need this information both here and in PgStatHashEntry? It's
possible that it's worthwhile, but I am not sure it is.


> +    size_t        len;            /* length of body, fixed per type. */

Why do we need this? Isn't that something that can easily be looked up
using the type?


> +    LWLock        lock;            /* lightweight lock to protect body */
> +    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
> +}            PgStatEnvelope;

What you're doing here with 'body' doesn't provide enough guarantees
about proper alignment. E.g. if one of the entry types wants to store a
double, this won't portably work, because there's platforms that have 4
byte alignment for ints, but 8 byte alignment for doubles.


Wouldn't it be better to instead embed PgStatEnvelope into the struct
that's actually stored? E.g. something like

struct PgStat_TableStatus
{
    PgStatEnvelope header; /* I'd rename the type */
    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
    ...
}

or if you don't want to do that because it'd require declaring
PgStatEnvelope in the header (not sure that'd really be worth avoiding),
you could just get rid of the body field and just do the calculation
using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))


> + * Snapshot is stats entry that is locally copied to offset stable values for a
> + * transaction.
>   */
> -static bool have_function_stats = false;
> +typedef struct PgStatSnapshot
> +{
> +    PgStatHashEntryKey key;
> +    bool        negative;
> +    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
> +}            PgStatSnapshot;
> +
> +#define PgStatSnapshotSize(bodylen)                \
> +    (offsetof(PgStatSnapshot, body) + (bodylen))
>
> -/*
> - * Info about current "snapshot" of stats file
> - */
> +/* Variables for backend status snapshot */
>  static MemoryContext pgStatLocalContext = NULL;
> -static HTAB *pgStatDBHash = NULL;
> +static MemoryContext pgStatSnapshotContext = NULL;
> +static HTAB *pgStatSnapshotHash = NULL;

>  /*
> - * Cluster wide statistics, kept in the stats collector.
> - * Contains statistics that are not collected per database
> - * or per table.
> + * Cluster wide statistics.
> + *
> + * Contains statistics that are collected not per database nor per table
> + * basis.  shared_* points to shared memory and snapshot_* are backend
> + * snapshots.
>   */
> -static PgStat_ArchiverStats archiverStats;
> -static PgStat_GlobalStats globalStats;
> -static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
> -
> -/*
> - * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
> - * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
> - * will write both that DB's data and the shared stats.
> - */
> -static List *pending_write_requests = NIL;
> +static bool global_snapshot_is_valid = false;
> +static PgStat_ArchiverStats *shared_archiverStats;
> +static PgStat_ArchiverStats snapshot_archiverStats;
> +static PgStat_GlobalStats *shared_globalStats;
> +static PgStat_GlobalStats snapshot_globalStats;
> +static PgStatSharedSLRUStats *shared_SLRUStats;
> +static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];

The amount of code needed for this snapshot stuff seems unreasonable to
me, especially because I don't see why we really need it. Is this just
so that there's no skew between all the columns of pg_stat_all_tables()
etc?

I think this needs a lot more comments explaining what it's trying to
achieve.


> +/*
> + * Newly created shared stats entries needs to be initialized before the other
> + * processes get access it. get_stat_entry() calls it for the purpose.
> + */
> +typedef void (*entry_initializer) (PgStatEnvelope * env);

I think we should try to not need it, instead declaring that all fields
are zero initialized. That fits well together with my suggestion to
avoid duplicating the database / object ids.


> +static void
> +attach_shared_stats(void)
> +{
...
> +        /* We're the first process to attach the shared stats memory */
> +        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
> +
> +        /* Initialize shared memory area */
> +        area = dsa_create(LWTRANCHE_STATS);
> +        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
> +
> +        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
> +        StatsShmem->global_stats =
> +            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
> +        StatsShmem->archiver_stats =
> +            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
> +        StatsShmem->slru_stats =
> +            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
> +        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
> +
> +        shared_globalStats = (PgStat_GlobalStats *)
> +            dsa_get_address(area, StatsShmem->global_stats);
> +        shared_archiverStats = (PgStat_ArchiverStats *)
> +            dsa_get_address(area, StatsShmem->archiver_stats);
> +
> +        shared_SLRUStats = (PgStatSharedSLRUStats *)
> +            dsa_get_address(area, StatsShmem->slru_stats);
> +        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);

I don't think it makes sense to use dsa allocations for any of the fixed
size stats (global_stats, archiver_stats, ...). They should just be
direct members of StatsShmem? Then we also don't need the shared_*
helper variables


> +        /* Load saved data if any. */
> +        pgstat_read_statsfiles();

Hm. Is it a good idea to do this as part of the shmem init function?
That's a lot more work than we normally do in these.


> +/* ----------
> + * detach_shared_stats() -
> + *
> + *    Detach shared stats. Write out to file if we're the last process and told
> + *    to do so.
> + * ----------
>   */
>  static void
> -pgstat_reset_remove_files(const char *directory)
> +detach_shared_stats(bool write_stats)

I think it'd be better to have an explicit call in the shutdown sequence
somewhere to write out the data, instead of munging detach and writing
stats out together.


>  /* ----------
>   * pgstat_report_stat() -
>   *
>   *    Must be called by processes that performs DML: tcop/postgres.c, logical
> - *    receiver processes, SPI worker, etc. to send the so far collected
> - *    per-table and function usage statistics to the collector.  Note that this
> - *    is called only when not within a transaction, so it is fair to use
> + *    receiver processes, SPI worker, etc. to apply the so far collected
> + *    per-table and function usage statistics to the shared statistics hashes.
> + *
> + *    Updates are applied not more frequent than the interval of
> + *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
> + *    failure if force is false and there's no pending updates longer than
> + *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
> + *    succeeding calls of this function.
> + *
> + *    Returns the time until the next timing when updates are applied in
> + *    milliseconds if there are no updates held for more than
> + *    PGSTAT_MIN_INTERVAL milliseconds.
> + *
> + *    Note that this is called only out of a transaction, so it is fine to use
>   *    transaction stop time as an approximation of current time.
> - * ----------
> + *    ----------
>   */
> -void
> +long
>  pgstat_report_stat(bool force)
>  {
> -    /* we assume this inits to all zeroes: */
> -    static const PgStat_TableCounts all_zeroes;
> -    static TimestampTz last_report = 0;
> -
> +    static TimestampTz next_flush = 0;
> +    static TimestampTz pending_since = 0;
> +    static long retry_interval = 0;
>      TimestampTz now;
> -    PgStat_MsgTabstat regular_msg;
> -    PgStat_MsgTabstat shared_msg;
> -    TabStatusArray *tsa;
> +    bool        nowait = !force;    /* Don't use force ever after */

> +    if (nowait)
> +    {
> +        /*
> +         * Don't flush stats too frequently.  Return the time to the next
> +         * flush.
> +         */

I think it's confusing to use nowait in the if when you actually mean
!force.


> -    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
> +    if (pgStatLocalHash)
>      {
> -        for (i = 0; i < tsa->tsa_used; i++)
> +        /* Step 1: flush out other than database stats */
> +        hash_seq_init(&scan, pgStatLocalHash);
> +        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
>          {
> -            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> -            PgStat_MsgTabstat *this_msg;
> -            PgStat_TableEntry *this_ent;
> +            bool        remove = false;
>
> -            /* Shouldn't have any pending transaction-dependent counts */
> -            Assert(entry->trans == NULL);
> +            switch (lent->env->type)
> +            {
> +                case PGSTAT_TYPE_DB:
> +                    if (ndbentries >= dbentlistlen)
> +                    {
> +                        dbentlistlen *= 2;
> +                        dbentlist = repalloc(dbentlist,
> +                                             sizeof(PgStatLocalHashEntry *) *
> +                                             dbentlistlen);
> +                    }
> +                    dbentlist[ndbentries++] = lent;
> +                    break;

Why do we need this special behaviour for database statistics?

If we need it,it'd be better to just use List here rather than open
coding a replacement (List these days basically has the same complexity
as what you do here).


> +                case PGSTAT_TYPE_TABLE:
> +                    if (flush_tabstat(lent->env, nowait))
> +                        remove = true;
> +                    break;
> +                case PGSTAT_TYPE_FUNCTION:
> +                    if (flush_funcstat(lent->env, nowait))
> +                        remove = true;
> +                    break;
> +                default:
> +                    Assert(false);

Adding a default here prevents the compiler from issuing a warning when
new types of stats are added...


> +            /* Remove the successfully flushed entry */
> +            pfree(lent->env);

Probably worth zeroing the pointer here, to make debugging a little
easier.


> +    /* Publish the last flush time */
> +    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> +    if (shared_globalStats->stats_timestamp < now)
> +        shared_globalStats->stats_timestamp = now;
> +    LWLockRelease(StatsLock);

Ugh, that seems like a fairly unnecessary global lock acquisition. What
do we need this timestamp for? Not clear to me that it's still
needed. If it is needed, it'd probably worth making this an atomic and
doing a compare-exchange loop instead.


>      /*
> -     * Send partial messages.  Make sure that any pending xact commit/abort
> -     * gets counted, even if there are no table stats to send.
> +     * If we have pending local stats, let the caller know the retry interval.
>       */
> -    if (regular_msg.m_nentries > 0 ||
> -        pgStatXactCommit > 0 || pgStatXactRollback > 0)
> -        pgstat_send_tabstat(®ular_msg);
> -    if (shared_msg.m_nentries > 0)
> -        pgstat_send_tabstat(&shared_msg);
> +    if (HAVE_ANY_PENDING_STATS())

I think this needs a comment explaining why we still may have pending
stats.



> + * flush_tabstat - flush out a local table stats entry
> + *
> + * Some of the stats numbers are copied to local database stats entry after
> + * successful flush-out.
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true.
> + *
> + * Returns true if the entry is successfully flushed out.
> + */
> +static bool
> +flush_tabstat(PgStatEnvelope * lenv, bool nowait)
> +{
> +    static const PgStat_TableCounts all_zeroes;
> +    Oid            dboid;            /* database OID of the table */
> +    PgStat_TableStatus *lstats; /* local stats entry  */
> +    PgStatEnvelope *shenv;        /* shared stats envelope */
> +    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
> +    PgStat_StatDBEntry *ldbstats;    /* local database entry */
> +    bool        found;
> +
> +    Assert(lenv->type == PGSTAT_TYPE_TABLE);
> +
> +    lstats = (PgStat_TableStatus *) &lenv->body;
> +    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
> +
> +    /*
> +     * Ignore entries that didn't accumulate any actual counts, such as
> +     * indexes that were opened by the planner but not used.
> +     */
> +    if (memcmp(&lstats->t_counts, &all_zeroes,
> +               sizeof(PgStat_TableCounts)) == 0)
> +        return true;
> +
> +    /* find shared table stats entry corresponding to the local entry */
> +    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
> +                           nowait, init_tabentry, &found);
> +
> +    /* skip if dshash failed to acquire lock */
> +    if (shenv == NULL)
> +        return false;

Could we cache the address of the shared entry in the local entry for a
while? It seems we have a bunch of contention (that I think you're
trying to address in a prototoype patch posted since) just because we
will over and over look up the same address in the shared hash table.

If we instead kept the local hashtable alive for longer and stored a
pointer to the shared entry in it, we could make this a lot
cheaper. There would be some somewhat nasty edge cases probably. Imagine
a table being dropped for which another backend still has pending
stats. But that could e.g. be addressed with a refcount.


> +    /* retrieve the shared table stats entry from the envelope */
> +    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
> +
> +    /* lock the shared entry to protect the content, skip if failed */
> +    if (!nowait)
> +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> +        return false;
> +
> +    /* add the values to the shared entry. */
> +    shtabstats->numscans += lstats->t_counts.t_numscans;
> +    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
> +    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
> +    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
> +    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
> +    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
> +    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
> +
> +    /*
> +     * If table was truncated or vacuum/analyze has ran, first reset the
> +     * live/dead counters.
> +     */
> +    if (lstats->t_counts.t_truncated ||
> +        lstats->t_counts.vacuum_count > 0 ||
> +        lstats->t_counts.analyze_count > 0 ||
> +        lstats->t_counts.autovac_vacuum_count > 0 ||
> +        lstats->t_counts.autovac_analyze_count > 0)
> +    {
> +        shtabstats->n_live_tuples = 0;
> +        shtabstats->n_dead_tuples = 0;
> +    }

> +    /* clear the change counter if requested */
> +    if (lstats->t_counts.reset_changed_tuples)
> +        shtabstats->changes_since_analyze = 0;

I know this is largely old code, but it's not obvious to me that there's
no race conditions here / that the race condition didn't get worse. What
prevents other backends to since have done a lot of inserts into this
table? Especially in case the flushes were delayed due to lock
contention.

> +    /*
> +     * Update vacuum/analyze timestamp and counters, so that the values won't
> +     * goes back.
> +     */
> +    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
> +        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;

It seems to me that if these branches are indeed a necessary branches,
my concerns above are well founded...


> +init_tabentry(PgStatEnvelope * env)
>  {
> -    int            n;
> -    int            len;
> +    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
> +
> +    /*
> +     * If it's a new table entry, initialize counters to the values we just
> +     * got.
> +     */
> +    Assert(env->type == PGSTAT_TYPE_TABLE);
> +    tabent->tableid = env->objectid;

It seems over the top to me to have the object id stored in yet another
place. It's now in the hash entry, in the envelope, and the type
specific part.


> +/*
> + * flush_funcstat - flush out a local function stats entry
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true.
> + *
> + * Returns true if the entry is successfully flushed out.
> + */
> +static bool
> +flush_funcstat(PgStatEnvelope * env, bool nowait)
> +{
> +    /* we assume this inits to all zeroes: */
> +    static const PgStat_FunctionCounts all_zeroes;
> +    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
> +    PgStatEnvelope *shenv;        /* shared stats envelope */
> +    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
> +    bool        found;
> +
> +    Assert(env->type == PGSTAT_TYPE_FUNCTION);
> +    localent = (PgStat_BackendFunctionEntry *) &env->body;
> +
> +    /* Skip it if no counts accumulated for it so far */
> +    if (memcmp(&localent->f_counts, &all_zeroes,
> +               sizeof(PgStat_FunctionCounts)) == 0)
> +        return true;

Why would we have an entry in this case?


> +    /* find shared table stats entry corresponding to the local entry */
> +    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
> +                           nowait, init_funcentry, &found);
> +    /* skip if dshash failed to acquire lock */
> +    if (shenv == NULL)
> +        return false;            /* failed to acquire lock, skip */
> +
> +    /* retrieve the shared table stats entry from the envelope */
> +    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
> +
> +    /* lock the shared entry to protect the content, skip if failed */
> +    if (!nowait)
> +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> +        return false;            /* failed to acquire lock, skip */

It doesn't seem great that we have a separate copy of all of this logic
again. It seems to me that most of the code here is (or should be)
exactly the same as in table case. I think only the the below should be
in here, rather than in common code.

> +/*
> + * flush_dbstat - flush out a local database stats entry
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true.
> + *
> + * Returns true if the entry is successfully flushed out.
> + */
> +static bool
> +flush_dbstat(PgStatEnvelope * env, bool nowait)
> +{
> +    PgStat_StatDBEntry *localent;
> +    PgStatEnvelope *shenv;
> +    PgStat_StatDBEntry *sharedent;
> +
> +    Assert(env->type == PGSTAT_TYPE_DB);
> +
> +    localent = (PgStat_StatDBEntry *) &env->body;
> +
> +    /* find shared database stats entry corresponding to the local entry */
> +    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
> +                           nowait, init_dbentry, NULL);
> +
> +    /* skip if dshash failed to acquire lock */
> +    if (!shenv)
> +        return false;
> +
> +    /* retrieve the shared stats entry from the envelope */
> +    sharedent = (PgStat_StatDBEntry *) &shenv->body;
> +
> +    /* lock the shared entry to protect the content, skip if failed */
> +    if (!nowait)
> +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> +        return false;

Dito re duplicating all of this.



> +/*
> + * Create the filename for a DB stat file; filename is output parameter points
> + * to a character buffer of length len.
> + */
> +static void
> +get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
> +{
> +    int            printed;
> +
> +    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
> +    printed = snprintf(filename, len, "%s/db_%u.%s",
> +                       PGSTAT_STAT_PERMANENT_DIRECTORY,
> +                       databaseid,
> +                       tempname ? "tmp" : "stat");
> +    if (printed >= len)
> +        elog(ERROR, "overlength pgstat path");
>  }

Do we really want database specific storage after all of these changes?
Seems like there's no point anymore?


> +    dshash_seq_init(&hstat, pgStatSharedHash, false);
> +    while ((p = dshash_seq_next(&hstat)) != NULL)
>      {
> -        Oid            tabid = tabentry->tableid;
> -
> -        CHECK_FOR_INTERRUPTS();
> -

Given that this could take a while on a database with a lot of objects
it might worth keeping the CHECK_FOR_INTERRUPTS().


>
>  /* ----------
> - * pgstat_vacuum_stat() -
> + * collect_stat_entries() -
>   *
> - *    Will tell the collector about objects he can get rid of.
> + *    Collect the shared statistics entries specified by type and dbid. Returns a
> + *  list of pointer to shared statistics in palloc'ed memory. If type is
> + *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
> + *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
> + *  PGSTAT_TYPE_DB entries.
>   * ----------
>   */
> -void
> -pgstat_vacuum_stat(void)
> +static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
>  {

> -        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
> +        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
> +            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
>              continue;

I don't like this interface much. Particularly not that it requires
adding a PGSTAT_TYPE_ALL that's otherwise not needed. And the thing
where PGSTAT_TYPE_DB doesn't actually works as one would expect isn't
nice either.

> -        /*
> -         * Not there, so add this table's Oid to the message
> -         */
> -        msg.m_tableid[msg.m_nentries++] = tabid;
> -
> -        /*
> -         * If the message is full, send it out and reinitialize to empty
> -         */
> -        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
> +        if (n >= listlen - 1)
>          {
> -            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
> -                + msg.m_nentries * sizeof(Oid);
> -
> -            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
> -            msg.m_databaseid = MyDatabaseId;
> -            pgstat_send(&msg, len);
> -
> -            msg.m_nentries = 0;
> +            listlen *= 2;
> +            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
>          }
> +        envlist[n++] = dsa_get_address(area, p->env);
>      }

I'd use List here as well.


> +    dshash_seq_term(&hstat);

Hm, I didn't immediately see which locking makes this safe? Is it just
that nobody should be attached at this point?


> +void
> +pgstat_vacuum_stat(void)
> +{
> +    HTAB       *dbids;            /* database ids */
> +    HTAB       *relids;            /* relation ids in the current database */
> +    HTAB       *funcids;        /* function ids in the current database */
> +    PgStatEnvelope **victims;    /* victim entry list */
> +    int            arraylen = 0;    /* storage size of the above */
> +    int            nvictims = 0;    /* # of entries of the above */
> +    dshash_seq_status dshstat;
> +    PgStatHashEntry *ent;
> +    int            i;
> +
> +    /* we don't collect stats under standalone mode */
> +    if (!IsUnderPostmaster)
> +        return;
> +
> +    /* collect oids of existent objects */
> +    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
> +    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
> +    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
> +
> +    /* collect victims from shared stats */
> +    arraylen = 16;
> +    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
> +    nvictims = 0;

Same List comment as before.



>  void
>  pgstat_reset_counters(void)
>  {
> -    PgStat_MsgResetcounter msg;
> +    PgStatEnvelope **envlist;
> +    PgStatEnvelope **p;
>
> -    if (pgStatSock == PGINVALID_SOCKET)
> -        return;
> +    /* Lookup the entries of the current database in the stats hash. */
> +    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
> +    for (p = envlist; *p != NULL; p++)
> +    {
> +        PgStatEnvelope *env = *p;
> +        PgStat_StatDBEntry *dbstat;
>
> -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
> -    msg.m_databaseid = MyDatabaseId;
> -    pgstat_send(&msg, sizeof(msg));
> +        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
> +

What locking prevents this entry from being freed between the
collect_stat_entries() and this LWLockAcquire?



>  /* ----------
> @@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
>  void
>  pgstat_report_autovac(Oid dboid)
>  {
> -    PgStat_MsgAutovacStart msg;
> +    PgStat_StatDBEntry *dbentry;
> +    TimestampTz ts;
>
> -    if (pgStatSock == PGINVALID_SOCKET)
> +    /* return if activity stats is not active */
> +    if (!area)
>          return;
>
> -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
> -    msg.m_databaseid = dboid;
> -    msg.m_start_time = GetCurrentTimestamp();
> +    ts = GetCurrentTimestamp();
>
> -    pgstat_send(&msg, sizeof(msg));
> +    /*
> +     * Store the last autovacuum time in the database's hash table entry.
> +     */
> +    dbentry = get_local_dbstat_entry(dboid);
> +    dbentry->last_autovac_time = ts;
>  }

Why did you introduce the local ts variable here?


>  /* --------
>   * pgstat_report_analyze() -
>   *
> - *    Tell the collector about the table we just analyzed.
> + *    Report about the table we just analyzed.
>   *
>   * Caller must provide new live- and dead-tuples estimates, as well as a
>   * flag indicating whether to reset the changes_since_analyze counter.
> @@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
>                        PgStat_Counter livetuples, PgStat_Counter deadtuples,
>                        bool resetcounter)
>  {
>  }

It seems to me that the analyze / vacuum cases would be much better
dealth with by synchronously operating on the shared entry, instead of
going through the local hash table. ISTM that that'd make it a lot
easier to avoid most of the ordering issues.




> +static PgStat_TableStatus *
> +get_local_tabstat_entry(Oid rel_id, bool isshared)
> +{
> +    PgStatEnvelope *env;
> +    PgStat_TableStatus *tabentry;
> +    bool        found;
>
> -    /*
> -     * Now we can fill the entry in pgStatTabHash.
> -     */
> -    hash_entry->tsa_entry = entry;
> +    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
> +                               isshared ? InvalidOid : MyDatabaseId,
> +                               rel_id, true, &found);
>
> -    return entry;
> +    tabentry = (PgStat_TableStatus *) &env->body;
> +
> +    if (!found)
> +    {
> +        tabentry->t_id = rel_id;
> +        tabentry->t_shared = isshared;
> +        tabentry->trans = NULL;
> +        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
> +        tabentry->vacuum_timestamp = 0;
> +        tabentry->autovac_vacuum_timestamp = 0;
> +        tabentry->analyze_timestamp = 0;
> +        tabentry->autovac_analyze_timestamp = 0;
> +    }
> +

As with shared entries, I think this should just be zero initialized
(and we should try to get rid of the duplication of t_id/t_shared).

> +    return tabentry;
>  }
>
> +
>  /*
>   * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
>   *
> - * If no entry, return NULL, don't create a new one
> + *  Find any existing PgStat_TableStatus entry for rel from the current
> + *  database then from shared tables.

What do you mean with "from the current database then from shared
tables"?


>  void
> -pgstat_send_archiver(const char *xlog, bool failed)
> +pgstat_report_archiver(const char *xlog, bool failed)
>  {
> -    PgStat_MsgArchiver msg;
> +    TimestampTz now = GetCurrentTimestamp();
>
> -    /*
> -     * Prepare and send the message
> -     */
> -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
> -    msg.m_failed = failed;
> -    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
> -    msg.m_timestamp = GetCurrentTimestamp();
> -    pgstat_send(&msg, sizeof(msg));
> +    if (failed)
> +    {
> +        /* Failed archival attempt */
> +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> +        ++shared_archiverStats->failed_count;
> +        memcpy(shared_archiverStats->last_failed_wal, xlog,
> +               sizeof(shared_archiverStats->last_failed_wal));
> +        shared_archiverStats->last_failed_timestamp = now;
> +        LWLockRelease(StatsLock);
> +    }
> +    else
> +    {
> +        /* Successful archival operation */
> +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> +        ++shared_archiverStats->archived_count;
> +        memcpy(shared_archiverStats->last_archived_wal, xlog,
> +               sizeof(shared_archiverStats->last_archived_wal));
> +        shared_archiverStats->last_archived_timestamp = now;
> +        LWLockRelease(StatsLock);
> +    }
>  }

Huh, why is this duplicating near equivalent code?

>  /* ----------
>   * pgstat_write_statsfiles() -
> - *        Write the global statistics file, as well as requested DB files.
> - *
> - *    'permanent' specifies writing to the permanent files not temporary ones.
> - *    When true (happens only when the collector is shutting down), also remove
> - *    the temporary files so that backends starting up under a new postmaster
> - *    can't read old data before the new collector is ready.
> - *
> - *    When 'allDbs' is false, only the requested databases (listed in
> - *    pending_write_requests) will be written; otherwise, all databases
> - *    will be written.
> + *        Write the global statistics file, as well as DB files.
>   * ----------
>   */
> -static void
> -pgstat_write_statsfiles(bool permanent, bool allDbs)
> +void
> +pgstat_write_statsfiles(void)
>  {

Whats the locking around this?


> -pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> +pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
>  {
> -    HASH_SEQ_STATUS tstat;
> -    HASH_SEQ_STATUS fstat;
> -    PgStat_StatTabEntry *tabentry;
> -    PgStat_StatFuncEntry *funcentry;
> +    PgStatEnvelope **envlist;
> +    PgStatEnvelope **penv;
>      FILE       *fpout;
>      int32        format_id;
>      Oid            dbid = dbentry->databaseid;
> @@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
>      char        tmpfile[MAXPGPATH];
>      char        statfile[MAXPGPATH];
>
> -    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
> -    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
> +    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
> +    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
>
>      elog(DEBUG2, "writing stats file \"%s\"", statfile);
>
> @@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
>      /*
>       * Walk through the database's access stats per table.
>       */
> -    hash_seq_init(&tstat, dbentry->tables);
> -    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
> +    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
> +    for (penv = envlist; *penv != NULL; penv++)

In several of these collect_stat_entries() callers it really bothers me
that we basically allocate an array as large as the number of objects
in the database (That's fine for databases, but for tables...). Without
much need as far as I can see.


>      {
> +        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
> +
>          fputc('T', fpout);
>          rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
>          (void) rc;                /* we'll check for error with ferror */
>      }
> +    pfree(envlist);
>
>      /*
>       * Walk through the database's function stats table.
>       */
> -    hash_seq_init(&fstat, dbentry->functions);
> -    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
> +    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
> +    for (penv = envlist; *penv != NULL; penv++)
>      {
> +        PgStat_StatFuncEntry *funcentry =
> +        (PgStat_StatFuncEntry *) &(*penv)->body;
> +
>          fputc('F', fpout);
>          rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
>          (void) rc;                /* we'll check for error with ferror */
>      }
> +    pfree(envlist);

Why do we need separate loops for every type of object here?


> +/* ----------
> + * create_missing_dbentries() -
> + *
> + *  There may be the case where database entry is missing for the database
> + *  where object stats are recorded. This function creates such missing
> + *  dbentries so that so that all stats entries can be written out to files.
> + * ----------
> + */
> +static void
> +create_missing_dbentries(void)
> +{

In which situation is this necessary?


> +static PgStatEnvelope *
> +get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
> +               bool nowait, entry_initializer initfunc, bool *found)
> +{

> +    bool        create = (initfunc != NULL);
> +    PgStatHashEntry *shent;
> +    PgStatEnvelope *shenv = NULL;
> +    PgStatHashEntryKey key;
> +    bool        myfound;
> +
> +    Assert(type != PGSTAT_TYPE_ALL);
> +
> +    key.type = type;
> +    key.databaseid = dbid;
> +    key.objectid = objid;
> +    shent = dshash_find_extended(pgStatSharedHash, &key,
> +                                 create, nowait, create, &myfound);
> +    if (shent)
>      {
> -        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
> +        if (create && !myfound)
> +        {
> +            /* Create new stats envelope. */
> +            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
> +            dsa_pointer chunk = dsa_allocate0(area, envsize);

> +            /*
> +             * The lock on dshsh is released just after. Call initializer
> +             * callback before it is exposed to other process.
> +             */
> +            if (initfunc)
> +                initfunc(shenv);
> +
> +            /* Link the new entry from the hash entry. */
> +            shent->env = chunk;
> +        }
> +        else
> +            shenv = dsa_get_address(area, shent->env);
> +
> +        dshash_release_lock(pgStatSharedHash, shent);

Doesn't this mean that by this time the entry could already have been
removed by a concurrent backend, and the dsa allocation freed?


> Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory
>
> The GUC used to specify the directory to store temporary statistics
> files. It is no longer needed by the stats collector but still used by
> the programs in bin and contrib, and maybe other extensions. Thus this
> patch removes the GUC but some backing variables and macro definitions
> are left alone for backward compatibility.

I don't see what this achieves? Which use of those variables / macros
would would be safe? I think it'd be better to just remove them.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Load TIME fields - proposed performance improvement
Next
From: Ashutosh Sharma
Date:
Subject: Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.