Thread: Relstats after VACUUM FULL and CLUSTER

Relstats after VACUUM FULL and CLUSTER

From

Erik Nordström

Date:

22 May, 15:13:39

Hi all,

I noticed a potential issue with the heap cluster code used by VACUUM FULL and CLUSTER, but I am not sure so I thought I'd post the question to the list.

The code in question counts the number of tuples it processes and uses that count to update reltuples in pg_class. However, the tuple count seems off sometimes because it includes recently dead tuples (due to updates and deletes). However, the wrong reltuples count is set, AFAICT, only on tables that don't have indexes because the cluster code also later rebuilds indexes which then updates reltuples to the "correct" value.

Does this seem like a bug or is it intentional?

Best regards,

Erik

Database Architect, Timescale

Re: Relstats after VACUUM FULL and CLUSTER

From

Sami Imseih

Date:

22 May, 18:04:02

> Does this seem like a bug or is it intentional?

pg_class.reltuples/relpages are only an estimate as per documentation.

However, I cannot reproduce the situation you are talking about on HEAD.
In the below example, I create a table without indexes, then insert and
delete some rows. run vacuum to update the pg_class.reltuples, then run
another delete to generate some more "recent" dead tuples.

The result shows pg_class.reltuples with the expected value,
but maybe I did not repro the same way you did?

( I am surprised that n_live_tup, n_dead_tup is off and also that
VACUUM FULL does not appear to update the stats in pg_stat_all_tables)

```
postgres=# drop table if exists t;
create table t ( id int );
alter table t set (autovacuum_enabled = off);
insert into t select n from generate_series(1, 1000000) as n;
delete from t where id between 1 and 5000;
vacuum t;
delete from t where id between 5001 and 10000;
select reltuples::int from pg_class where relname = 't';
-- might take a bit of time for n_dead_tup to be set
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
DROP TABLE
CREATE TABLE
ALTER TABLE
INSERT 0 1000000
DELETE 5000
VACUUM
DELETE 5000
 reltuples
-----------
    995000
(1 row)

 n_dead_tup | n_live_tup
------------+------------
      10000 |     985000
(1 row)

postgres=# VACUUM (verbose, full) t;
INFO:  vacuuming "public.t"
INFO:  "public.t": found 5000 removable, 990000 nonremovable row
versions in 4425 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 0.79 s, system: 0.02 s, elapsed: 0.86 s.
VACUUM

select reltuples::int from pg_class where relname = 't';
select n_dead_tup from pg_stat_all_tables where relname = 't';

postgres=# select reltuples::int from pg_class where relname = 't';
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
 reltuples
-----------
    990000
(1 row)

postgres=# select n_dead_tup, n_live_tup from pg_stat_all_tables where
relname = 't';
 n_dead_tup | n_live_tup
------------+------------
      10000 |     985000
(1 row)

--
Sami Imseih
Amazon Web Services (AWS)

Re: Relstats after VACUUM FULL and CLUSTER

From

Erik Nordström

Date:

22 May, 20:14:31

Hi Sami,

You need a concurrent transaction to recreate the situation. I am attaching an isolation test to show the behavior, along with its output file. I ran it on PostgreSQL 17.4.

The test has two permutations, the first one runs on a table without an index and the second permutation with an index added. In the output file you can see that the VACUUM FULL on the index-less table produces reltuples count that includes all updated tuples + the old/garbage tuples. In other words, it counts all tuples visible to any currently ongoing transaction. If the table has an index the behavior is different because the reindex that happens as the last step of vacuum full overwrites the first reltuples count with the "correct" number (as visible by the transaction snapshot).

Best,

Erik

On Thu, May 22, 2025 at 5:04 PM Sami Imseih <samimseih@gmail.com> wrote:

> Does this seem like a bug or is it intentional?

pg_class.reltuples/relpages are only an estimate as per documentation.

However, I cannot reproduce the situation you are talking about on HEAD.
In the below example, I create a table without indexes, then insert and
delete some rows. run vacuum to update the pg_class.reltuples, then run
another delete to generate some more "recent" dead tuples.

The result shows pg_class.reltuples with the expected value,
but maybe I did not repro the same way you did?

( I am surprised that n_live_tup, n_dead_tup is off and also that
VACUUM FULL does not appear to update the stats in pg_stat_all_tables)

```
postgres=# drop table if exists t;
create table t ( id int );
alter table t set (autovacuum_enabled = off);
insert into t select n from generate_series(1, 1000000) as n;
delete from t where id between 1 and 5000;
vacuum t;
delete from t where id between 5001 and 10000;
select reltuples::int from pg_class where relname = 't';
-- might take a bit of time for n_dead_tup to be set
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
DROP TABLE
CREATE TABLE
ALTER TABLE
INSERT 0 1000000
DELETE 5000
VACUUM
DELETE 5000
reltuples
-----------
995000
(1 row)

n_dead_tup | n_live_tup
------------+------------
10000 | 985000
(1 row)

postgres=# VACUUM (verbose, full) t;
INFO: vacuuming "public.t"
INFO: "public.t": found 5000 removable, 990000 nonremovable row
versions in 4425 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 0.79 s, system: 0.02 s, elapsed: 0.86 s.
VACUUM

select reltuples::int from pg_class where relname = 't';
select n_dead_tup from pg_stat_all_tables where relname = 't';

postgres=# select reltuples::int from pg_class where relname = 't';
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
reltuples
-----------
990000
(1 row)

postgres=# select n_dead_tup, n_live_tup from pg_stat_all_tables where
relname = 't';
n_dead_tup | n_live_tup
------------+------------
10000 | 985000
(1 row)

--
Sami Imseih
Amazon Web Services (AWS)

Attachment

Re: Relstats after VACUUM FULL and CLUSTER

From

Sami Imseih

Date:

22 May, 23:08:28

> You need a concurrent transaction to recreate the situation. I am attaching an isolation test to show the behavior,

Thanks! That helps.
Indeed heapam_relation_copy_for_cluster and heapam_index_build_range_scan
are counting HEAPTUPLE_RECENTLY_DEAD ( tuples removed but cannot be removed )
different. In heapam_relation_copy_for_cluster they are considered live and in
heapam_index_build_range_scan they are considered dead.

in heapam_relation_copy_for_cluster
```
case HEAPTUPLE_RECENTLY_DEAD:
       *tups_recently_dead += 1;
       /* fall through */
case HEAPTUPLE_LIVE:
       /* Live or recently dead, must copy it */
       isdead = false;
        break;
```

In both cases, he recently dead tuples must be copied to the table or index, but
they should not be counted towards reltuples. So, I think we need to fix this in
heapam_relation_copy_for_cluster by probably subtracting
tups_recently_dead from num_tuples ( which is the value set in
pg_class.reltuples )
after we process all the tuples, which looks like the best fix to me.


--
Sami Imseih
Amazon Web Services (AWS)

Re: Relstats after VACUUM FULL and CLUSTER

From

Erik Nordström

Date:

23 May, 16:18:04

Hi Sami,

Your patch should correct the problem. However, given that this function is part of the tableam API, I am wondering if the fix shouldn't be outside heap's copy_for_cluster implementation? I guess it depends on the semantics of num_tuples, but the cluster code seems to allude to interpreting num_tuples as the number of non-removable tuples. If you subtract recently dead from that number within the heap implementation, then it will no longer reflect non-removable tuples and the log message in the cluster function "found %.0f removable, %.0f nonremovable row versions" will no longer be correct.

Surprisingly, tableam.h does not document the num_tuples parameter in the table_relation_copy_for_cluster() function although all other output parameters are documented. So, it is not clear what the intended semantics are. Maybe other hackers on the mailing list have opinions on how to interpret num_tuples?

In any case, assuming num_tuples is supposed to return non-removable tuples, then the fix should be to subtract recently dead tuples when updating pg_class.reltuples. Other TAM's need to treat num_tuples as non-removable tuples as well, and update recently dead if applicable.

I am attaching a patch with these changes, while also including the isolation test in that patch.

Regards,

Erik

On Thu, May 22, 2025 at 10:42 PM Sami Imseih <samimseih@gmail.com> wrote:

> In both cases, he recently dead tuples must be copied to the table or index, but
> they should not be counted towards reltuples. So, I think we need to fix this in
> heapam_relation_copy_for_cluster by probably subtracting
> tups_recently_dead from num_tuples ( which is the value set in
> pg_class.reltuples )
> after we process all the tuples, which looks like the best fix to me.

something like the attached.

--
Sami Imseih
Amazon Web Services (AWS)

Attachment

v1-0001-Fix-reltuples-stats-after-VACUUM-FULL-and-CLUSTER.patch