Estimating bloat for very large tables: what is the state of art? - Mailing list pgsql-hackers

From Dmitry Astapov
Subject Estimating bloat for very large tables: what is the state of art?
Date
Msg-id CAFQUnFih8i_Rb30hRyWGWJW0BCQdt9uYXccMSJ2KYzSwJMWyUw@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi!

I am trying to solve the problem of estimating the table bloat (and index bloat, though I am mostly focusing on tables at the moment).

After searching far and wide, it seems that the choice is to be made between two methods:
1. Slow, but very precise pgstattuple
2. Fast, but somewhat imprecise "bloat query" which is attributed to check_postgres project, though there are numerous variations in existence.

pgstattuple is beautiful and accurate but rather slow. If tables are large, pgstattuple_approx could easily take 5-10 minutes, and if that were the case, you can see pgstattuple to take 30-60 minutes on the same table easily. 

"Bloat query", on the other hand, is wonderfully fast, but rather imprecise. It tries to estimate the table data size as pg_class.reltuples * row_width, where row_width is taken, roughly, to be (24 bytes for the header + size of NULL map + (sum( (1 - null_frac)*avg_width ) for all columns in the table, as reported by pg_statistics)).

This, of course, completely ignores the question of padding and so on tables with a large number of columns the query tends to underestimate the size of live data by some 10-20% (unless schema was explicitly created to minimize padding).

I'd like to ask you:
1. Are these indeed two approaches the only options on the table, or am I missing something?

2. I am considering my own approach where, after looking at pg_attributes and pg_stats, I am constructing "an example row from this table with no nulls" (so, max amount of data + max amount of padding) and "an example row from the table with all the NULLs" (so, as little padding as possible), do pg_column_size() on both these rows (so that pg_column_size could compute size+padding for me) and then take an average between them, perhaps weighted somehow by examining null_frac of table columns. Quick experiments show that this yields a more accurate estimate of row size for tables with large numbers of columns than what the "bloat query" does.  Question: can I do anything better/easier here without sacrificing speed? 

--
D. Astapov

pgsql-hackers by date:

Previous
From: Dagfinn Ilmari Mannsåker
Date:
Subject: Re: Splitting up guc.c
Next
From: David Rowley
Date:
Subject: Re: PostgreSQL 15 release announcement draft