Prototype: Implement dead tuples xid histograms - Mailing list pgsql-hackers

From Renan Alves Fonseca
Subject Prototype: Implement dead tuples xid histograms
Date
Msg-id 87sem86jep.fsf@gmail.com
Whole thread Raw
List pgsql-hackers
Hi hackers,

in a recent hacking workshop organized by Robert Haas, we discussed
[1]. Among the autovacuum issues exposed, the useless vacuum case caught
my attention. I've started to study the respective code and I came up
with a prototype to improve the statistics system regarding dead tuples.

The attached patch implements only partially the dead tuples histogram
mentioned in [1]. But, since I'm a beginner, I thought it would be nice
to have an early feedback just to make sure I don't do anything very
wrong.

My initial idea was to implement a growing histogram with a linked list
of bins, exploiting the fact that most of dead tuples are added in the
last bin. Then, I realized that there are no other cases of dynamical
data structures in pg_stats and it would be harder to serialize
it. That's why I choose to implement the histogram in a static data
structure inside one of the pg_stats data structures. It does require a
little bit more logic to maintain the histogram but it is well
integrated in the whole pg_stats architecture.

As discussed in the hacking workshop, one of the problems is to capture
the exact xmin of the dead tuple. In my tests, I've observed that,
outside of a transaction, xmin corresponds to
GetCurrentTransactionId(). But inside a transaction, xmin receives
incremental xids on successive DM statements. Capturing xids for every
statement inside a transaction seems overkill. So, I decided to
attribute the highest xmin/xid of a transaction to all dead tuples
of that transaction.

In order to see the statistics in a table t1, we do:
select pg_stat_get_dead_tuples_xid_freqs ('t1'::regclass),
       pg_stat_get_dead_tuples_xid_bounds('t1'::regclass);

Then, to verify that the bounds make sense, I've used:
select xmin from t1;

In this version, the removal of dead tuples is not yet implemented, so
these histograms only grow.

I would really appreciate any kind of feedback.

Best regards,
Renan Fonseca

[1] How Autovacuum Goes Wrong: And Can We Please Make It Stop Doing
That? (PGConf.dev 2024)
From 396eb5ae24fc37af055551c81552f656ca628f57 Mon Sep 17 00:00:00 2001
From: "Renan A. Fonseca" <renanfonseca@gmail.com>
Date: Wed, 16 Apr 2025 10:52:16 +0200
Subject: [PATCH] Implement dead tuples xid histograms

This is one possible solution to provide finer statistics regarding
dead tuples. It is conceived to answer the following question: If I
vacuum this table now, how many dead tuples will be removed?

We implement it by adding a fixed size data structure in
PgStat_StatTabEntry representing an histogram with variable
bounds. This means that we persist this histogram using pg_catalog
machinery, and we need a pg_catalog version update.

It is out of the scope of this patch to modify the autovacuum
algorithm to use this new information. We provide sql functions to
retrieve dead tuples xid histogram of a given table so that users can
make experiments with this refined info.

Some details on the implementation can be found in comments.

TODO: update the histogram when removing dead tuples
---
 src/backend/access/transam/xact.c            |  18 +++
 src/backend/utils/activity/pgstat_relation.c | 147 ++++++++++++++++++-
 src/backend/utils/adt/pgstatfuncs.c          |  44 ++++++
 src/include/access/xact.h                    |   1 +
 src/include/catalog/pg_proc.dat              |   8 +
 src/include/pgstat.h                         |  29 +++-
 6 files changed, 242 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..87c9e6d9617 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -460,6 +460,24 @@ GetCurrentTransactionId(void)
     return XidFromFullTransactionId(s->fullTransactionId);
 }
 
+
+/*
+ *    GetCurrentTransactionMaxChildId
+ *
+ * This will return the highest XID among current transaction childs XIDs. It
+ * returns InvalidTransactionId if current transaction has no childs.
+ */
+TransactionId
+GetCurrentTransactionMaxChildId(void)
+{
+    TransactionState s = CurrentTransactionState;
+
+    if (s->nChildXids == 0)
+        return InvalidTransactionId;
+    else
+        return s->childXids[s->nChildXids - 1];
+}
+
 /*
  *    GetCurrentTransactionIdIfAny
  *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index eeb2d43cb10..08ce3d4ca85 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -448,6 +448,14 @@ pgstat_count_truncate(Relation rel)
  * recovery of "delta" dead tuples; so delta_dead_tuples decreases
  * rather than increasing, and the change goes straight into the per-table
  * counter, not into transactional state.
+ *
+ * TODO: we need to take this into account in the dead_tuples_xid
+ * histogram. Actually, this is one of the reasons for using a richer
+ * representation for XIDS in the intermediate data structure
+ * (PgStat_TableStatus.counts). Since this is called quite often in
+ * heap_page_prune_opt, we should discard the idea of having a single
+ * (delta_dead_tuple,xid). We would loose too much information leading to a
+ * fuzzy, perhaps useless, histogram.
  */
 void
 pgstat_update_heap_dead_tuples(Relation rel, int delta)
@@ -460,6 +468,101 @@ pgstat_update_heap_dead_tuples(Relation rel, int delta)
     }
 }
 
+
+/*
+ * Auxiliary function to update a generic xid histogram. The histogram
+ * represented by freqs and bounds is updated in place with the values
+ * delta_freq and xid.
+ */
+void
+pgstat_update_xid_histogram(PgStat_Counter *freqs, TransactionId *bounds, PgStat_Counter delta_freq, TransactionId
xid)
+{
+    int            bin;
+
+    /* find suitable bin */
+    for (bin = DEAD_TUPLES_HIST_SIZE - 1; bin > 0; bin--)
+        if (xid > freqs[bin - 1])
+            break;
+
+    freqs[bin] += delta_freq;
+    /* if it is the highest bin, update the upper bound accordingly */
+    if (bin == DEAD_TUPLES_HIST_SIZE - 1 && xid > bounds[bin])
+        bounds[bin] = xid;
+}
+
+/*
+ * This function updates dead_tuples_xid histogram of respective table in
+ * shared PgStat_StatTabEntry using information from local backend stats. It
+ * is executed in the pgstat_report_stat flow. Since it is not in a critical
+ * path, we allows us to balance the histogram on every execution.
+ */
+void
+pgstat_update_dead_tuples_xid(PgStat_StatTabEntry *tabentry, const PgStat_TableStatus *lstats)
+{
+    PgStat_Counter * freqs = tabentry->dead_tuples_xid_freq;
+    TransactionId * bounds = tabentry->dead_tuples_xid_bounds;
+    int            bin;
+
+    for (bin = DEAD_TUPLES_HIST_SIZE - 1; bin >= 0; bin--)
+    {
+        PgStat_Counter delta_freq = lstats->counts.dead_tuples_xid_freqs[bin];
+        TransactionId xid = lstats->counts.dead_tuples_xid_bounds[bin];
+
+        if (xid == InvalidTransactionId && delta_freq == 0)
+            /* end of histogram */
+            break;
+        pgstat_update_xid_histogram(freqs, bounds, delta_freq, xid);
+    }
+
+    pgstat_balance_dead_tuples_xid(freqs, bounds);
+
+}
+
+/*
+ * If conditions are satisfied, merge two adjacent *closed* bins and shift
+ * data to create a new *open* bin. Then, previous *open* bin becomes a
+ * *closed* one. We perform the balance operation when the freqs of the *open*
+ * bin is higher than the combined freqs of a pair of adjacent *closed*
+ * bins. The merged bin has the sum of freqs and the higher upper bound of the
+ * pair. If a histogram is not balanced often enough, the open bin will have a
+ * larger width. In this case, while we do loose granularity, the semantics of
+ * the upper bounds is not affected. If the histogram is never balanced, it
+ * behaves like a pair of (delta_dead_tuples,xid), keeping the sum of
+ * delta_dead_tuples and the max xid.
+ */
+void
+pgstat_balance_dead_tuples_xid(PgStat_Counter *freqs, TransactionId *bounds)
+{
+    PgStat_Counter open_bin_freq = freqs[DEAD_TUPLES_HIST_SIZE - 1];
+    PgStat_Counter combined_bins_freq;
+    int            bin;
+
+    /* from higher to lower, find first pair for which combined freqs are
+     * lower then the highest bin (open bin).
+     *
+     * for a more balanced histogram, we should search until the end to find
+     * the smallest pair, but we stop on first occasion to speed up. */
+    for (bin = DEAD_TUPLES_HIST_SIZE - 2; bin > 0; bin--)
+    {
+        combined_bins_freq = freqs[bin] + freqs[bin - 1];
+        if (open_bin_freq > combined_bins_freq)
+            break;
+    }
+    if (bin > 0)
+        /* then merge and shift */
+    {
+        freqs[bin - 1] = combined_bins_freq;
+        bounds[bin - 1] = bounds[bin];
+        for (; bin < DEAD_TUPLES_HIST_SIZE - 1; bin++)
+        {
+            freqs[bin] = freqs[bin + 1];
+            bounds[bin] = bounds[bin + 1];
+        }
+        freqs[bin] = 0;
+    }
+}
+
+
 /*
  * Support function for the SQL-callable pgstat* functions. Returns
  * the collected statistics for one table or NULL. NULL doesn't mean
@@ -555,6 +658,7 @@ AtEOXact_PgStat_Relations(PgStat_SubXactStatus *xact_state, bool isCommit)
     for (trans = xact_state->first; trans != NULL; trans = trans->next)
     {
         PgStat_TableStatus *tabstat;
+        PgStat_Counter dead_tuples;
 
         Assert(trans->nest_level == 1);
         Assert(trans->upper == NULL);
@@ -580,8 +684,9 @@ AtEOXact_PgStat_Relations(PgStat_SubXactStatus *xact_state, bool isCommit)
             tabstat->counts.delta_live_tuples +=
                 trans->tuples_inserted - trans->tuples_deleted;
             /* update and delete each create a dead tuple */
-            tabstat->counts.delta_dead_tuples +=
-                trans->tuples_updated + trans->tuples_deleted;
+            dead_tuples = trans->tuples_updated + trans->tuples_deleted;
+            tabstat->counts.delta_dead_tuples += dead_tuples;
+
             /* insert, update, delete each count as one change event */
             tabstat->counts.changed_tuples +=
                 trans->tuples_inserted + trans->tuples_updated +
@@ -590,10 +695,41 @@ AtEOXact_PgStat_Relations(PgStat_SubXactStatus *xact_state, bool isCommit)
         else
         {
             /* inserted tuples are dead, deleted tuples are unaffected */
-            tabstat->counts.delta_dead_tuples +=
-                trans->tuples_inserted + trans->tuples_updated;
+            dead_tuples = trans->tuples_inserted + trans->tuples_updated;
+            tabstat->counts.delta_dead_tuples += dead_tuples;
+
             /* an aborted xact generates no changed_tuple events */
         }
+
+        /* use local dead_tuples instead of tabstat->count.delta_dead_tuples
+         * to avoid noise from vacuum related stuff.
+         */
+        if (dead_tuples > 0)
+        {
+            /* inside a transaction, dead tuples can have different xmin, one
+             * for every command. we do not support that much
+             * granularity. instead we use a single xid per
+             * transaction. however, if we used the transaction proper xid,
+             * the histogram would loose consistency because xmin of dead
+             * tuples would be higher than its respective bin upper
+             * bound. instead, we use the maximum child xid of the
+             * transaction, which is a proper upper bound on its dead tuples
+             * xmin values. */
+
+            TransactionId xid = GetCurrentTransactionMaxChildId();
+            if (xid == InvalidTransactionId)
+                xid = GetCurrentTransactionIdIfAny();
+
+            pgstat_update_xid_histogram(tabstat->counts.dead_tuples_xid_freqs,
+                                        tabstat->counts.dead_tuples_xid_bounds,
+                                        dead_tuples,
+                                        xid);
+
+            /* can we avoid to balance histogram once per transaction ? */
+            pgstat_balance_dead_tuples_xid(tabstat->counts.dead_tuples_xid_freqs,
+                                           tabstat->counts.dead_tuples_xid_bounds);
+        }
+
         tabstat->trans = NULL;
     }
 }
@@ -863,10 +999,13 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
         tabentry->live_tuples = 0;
         tabentry->dead_tuples = 0;
         tabentry->ins_since_vacuum = 0;
+        MemSet(tabentry->dead_tuples_xid_freq, 0, DEAD_TUPLES_HIST_SIZE * sizeof(PgStat_Counter));
+        MemSet(tabentry->dead_tuples_xid_bounds, 0, DEAD_TUPLES_HIST_SIZE * sizeof(TransactionId));
     }
 
     tabentry->live_tuples += lstats->counts.delta_live_tuples;
     tabentry->dead_tuples += lstats->counts.delta_dead_tuples;
+    pgstat_update_dead_tuples_xid(tabentry, lstats);
     tabentry->mod_since_analyze += lstats->counts.changed_tuples;
 
     /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97af7c6554f..f688085e273 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -29,6 +29,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/timestamp.h"
 
@@ -70,6 +71,49 @@ PG_STAT_GET_RELENTRY_INT64(blocks_hit)
 /* pg_stat_get_dead_tuples */
 PG_STAT_GET_RELENTRY_INT64(dead_tuples)
 
+/* pg_stat_get_dead_tuples_xid_freqs */
+Datum
+pg_stat_get_dead_tuples_xid_freqs(FunctionCallInfo fcinfo)
+{
+    Oid            relid = DatumGetObjectId((fcinfo->args[0].value));
+    Datum       *result;
+    PgStat_StatTabEntry *tabentry;
+    int            i;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+    if (tabentry == ((void *) 0))
+        /* OID not found */
+        PG_RETURN_ARRAYTYPE_P(construct_empty_array(XIDOID));
+
+    result = palloc(DEAD_TUPLES_HIST_SIZE * sizeof(Datum));
+    for (i = 0; i < DEAD_TUPLES_HIST_SIZE; i++)
+        result[i] = Int64GetDatum(tabentry->dead_tuples_xid_freq[i]);
+
+    PG_RETURN_ARRAYTYPE_P(construct_array_builtin(result, 5, XIDOID));
+}
+
+/* pg_stat_get_dead_tuples_xid_bounds */
+Datum
+pg_stat_get_dead_tuples_xid_bounds(FunctionCallInfo fcinfo)
+{
+    Oid            relid = DatumGetObjectId((fcinfo->args[0].value));
+    Datum       *result;
+    PgStat_StatTabEntry *tabentry;
+    int            i;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+    if (tabentry == ((void *) 0))
+        /* OID not found */
+        PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT8OID));
+
+    result = palloc(DEAD_TUPLES_HIST_SIZE * sizeof(Datum));
+    for (i = 0; i < DEAD_TUPLES_HIST_SIZE; i++)
+        result[i] = TransactionIdGetDatum(tabentry->dead_tuples_xid_bounds[i]);
+
+    PG_RETURN_ARRAYTYPE_P(construct_array_builtin(result, 5, INT8OID));
+}
+
+
 /* pg_stat_get_ins_since_vacuum */
 PG_STAT_GET_RELENTRY_INT64(ins_since_vacuum)
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..1272475d3f6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -442,6 +442,7 @@ extern TransactionId GetTopTransactionId(void);
 extern TransactionId GetTopTransactionIdIfAny(void);
 extern TransactionId GetCurrentTransactionId(void);
 extern TransactionId GetCurrentTransactionIdIfAny(void);
+extern TransactionId GetCurrentTransactionMaxChildId(void);
 extern TransactionId GetStableLatestTransactionId(void);
 extern SubTransactionId GetCurrentSubTransactionId(void);
 extern FullTransactionId GetTopFullTransactionId(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..be065248f98 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5561,6 +5561,14 @@
   proname => 'pg_stat_get_dead_tuples', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_dead_tuples' },
+{ oid => '8410', descr => 'statistics: histogram of dead tuples xid',
+  proname => 'pg_stat_get_dead_tuples_xid_freqs', provolatile => 's', proparallel => 'r',
+  prorettype => 'anyarray', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_dead_tuples_xid_freqs' },
+{ oid => '8411', descr => 'statistics: histogram of dead tuples xid',
+  proname => 'pg_stat_get_dead_tuples_xid_bounds', provolatile => 's', proparallel => 'r',
+  prorettype => 'anyarray', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_dead_tuples_xid_bounds' },
 { oid => '3177',
   descr => 'statistics: number of tuples changed since last analyze',
   proname => 'pg_stat_get_mod_since_analyze', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 378f2f2c2ba..f20007fa484 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -111,6 +111,8 @@ typedef struct PgStat_BackendSubEntry
     PgStat_Counter conflict_count[CONFLICT_NUM_TYPES];
 } PgStat_BackendSubEntry;
 
+#define DEAD_TUPLES_HIST_SIZE 5
+
 /* ----------
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
@@ -131,6 +133,23 @@ typedef struct PgStat_BackendSubEntry
  * actions, regardless of whether the transaction committed.  delta_live_tuples,
  * delta_dead_tuples, and changed_tuples are set depending on commit or abort.
  * Note that delta_live_tuples and delta_dead_tuples can be negative!
+ *
+ * delta_tuples_xid_freqs and dead_tuples_xid_bounds together represent an
+ * histogram of dead tuples xmin related to a backend. dead_tuples_xid_bounds
+ * contains the upper bounds of each bin. On flush_pending_cb, this histogram
+ * is merged into a similar structure in PgStat_StatTabEntry. While we loose
+ * some information in this merge, because the bounds are not the same, we can
+ * still guarantee the semantic of an upper bound. In other words, in the
+ * final structure if the accumulated freqs up to a given upper bound is 200
+ * tuples, the actual number of dead tuples is *at least* 200, and this
+ * behavior is only due to the merging process. Of course, there can be other
+ * sources of mismatch, but this is what concerns us here. In practice, if the
+ * flush_pending_cb runs often enough so that there are no more than
+ * DEAD_TUPLES_HIST_SIZE transactions in current PgStat_TableCounts, the
+ * results will be accurate and without information loss. Finally, we could
+ * consider a much simpler structure at this point to the detriment of
+ * granularity.
+ *
  * ----------
  */
 typedef struct PgStat_TableCounts
@@ -151,6 +170,9 @@ typedef struct PgStat_TableCounts
     PgStat_Counter delta_dead_tuples;
     PgStat_Counter changed_tuples;
 
+    PgStat_Counter dead_tuples_xid_freqs[DEAD_TUPLES_HIST_SIZE];
+    TransactionId dead_tuples_xid_bounds[DEAD_TUPLES_HIST_SIZE];
+
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 } PgStat_TableCounts;
@@ -211,7 +233,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID    0x01A5BCB7
+#define PGSTAT_FILE_FORMAT_ID    0x01A5BCB8
 
 typedef struct PgStat_ArchiverStats
 {
@@ -437,6 +459,9 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter mod_since_analyze;
     PgStat_Counter ins_since_vacuum;
 
+    PgStat_Counter dead_tuples_xid_freq[DEAD_TUPLES_HIST_SIZE];
+    TransactionId dead_tuples_xid_bounds[DEAD_TUPLES_HIST_SIZE];
+
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
@@ -717,6 +742,8 @@ extern void pgstat_count_heap_update(Relation rel, bool hot, bool newpage);
 extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
+extern void pgstat_update_dead_tuples_xid(PgStat_StatTabEntry *tabentry, const PgStat_TableStatus *lstats);
+extern void pgstat_balance_dead_tuples_xid(PgStat_Counter *freqs, TransactionId *bounds);
 
 extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
                                        void *recdata, uint32 len);
-- 
2.47.0


pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: A modest proposal: make parser/rewriter/planner inputs read-only
Next
From: Bruce Momjian
Date:
Subject: Re: Performance issues with v18 SQL-language-function changes