I am trying to solve the problem of estimating the table bloat (and index bloat, though I am mostly focusing on tables at the moment).
After searching far and wide, it seems that the choice is to be made between two methods:
1. Slow, but very precise pgstattuple
pgstattuple is beautiful and accurate but rather slow. If tables are large, pgstattuple_approx could easily take 5-10 minutes, and if that were the case, you can see pgstattuple to take 30-60 minutes on the same table easily.
"Bloat query", on the other hand, is wonderfully fast, but rather imprecise. It tries to estimate the table data size as pg_class.reltuples * row_width, where row_width is taken, roughly, to be (24 bytes for the header + size of NULL map + (sum( (1 - null_frac)*avg_width ) for all columns in the table, as reported by pg_statistics)).
This, of course, completely ignores the question of padding and so on tables with a large number of columns the query tends to underestimate the size of live data by some 10-20% (unless schema was explicitly created to minimize padding).
I'd like to ask you:
1. Are these indeed two approaches the only options on the table, or am I missing something?
2. I am considering my own approach where, after looking at pg_attributes and pg_stats, I am constructing "an example row from this table with no nulls" (so, max amount of data + max amount of padding) and "an example row from the table with all the NULLs" (so, as little padding as possible), do pg_column_size() on both these rows (so that pg_column_size could compute size+padding for me) and then take an average between them, perhaps weighted somehow by examining null_frac of table columns. Quick experiments show that this yields a more accurate estimate of row size for tables with large numbers of columns than what the "bloat query" does. Question: can I do anything better/easier here without sacrificing speed?
--