Thread: index bloat estimation
Dear Colleagues, What queries do you use to estimate index and table bloat? I've researched some on the Net and found multiple scripts mentioned in https://wiki.postgresql.org/wiki/Index_Maintenance#Index_Bloat, also in https://github.com/pgexperts/pgx_scripts etc. Most of the stuff I've looked at is pretty old, much seems unsupported. What is the current best practice? I'd be grateful if you could share your personal favourite ways of estimating bloat. -- Victor Sudakov, VAS4-RIPE, VAS47-RIPN 2:5005/49@fidonet http://vas.tomsk.ru/
Hi,
Le ven. 12 févr. 2021 à 09:26, Victor Sudakov <vas@sibptus.ru> a écrit :
Dear Colleagues,
What queries do you use to estimate index and table bloat?
I've researched some on the Net and found multiple scripts mentioned in
https://wiki.postgresql.org/wiki/Index_Maintenance#Index_Bloat, also
in https://github.com/pgexperts/pgx_scripts etc.
Most of the stuff I've looked at is pretty old, much seems unsupported.
What is the current best practice?
I'd be grateful if you could share your personal favourite ways of
estimating bloat.
I use https://github.com/ioguix/pgsql-bloat-estimation . It's pretty good at estimating bloat on tables and Btree indexes (though deduplication in v13 makes it harder).
check_pgactivitu uses the queries from this repository.
--
Guillaume.
On Fri, Feb 12, 2021 at 3:26 AM Victor Sudakov <vas@sibptus.ru> wrote:
Dear Colleagues,
What queries do you use to estimate index and table bloat?
I've researched some on the Net and found multiple scripts mentioned in
https://wiki.postgresql.org/wiki/Index_Maintenance#Index_Bloat, also
in https://github.com/pgexperts/pgx_scripts etc.
Most of the stuff I've looked at is pretty old, much seems unsupported.
What is the current best practice?
I'd be grateful if you could share your personal favourite ways of
estimating bloat.
--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/
Why estimate when you can get the exact amount? At least for b-tree indexes anyway.
This script uses the pgstattuple extension to get both table and b-tree index bloat information. Since it's actually scanning the table, it can take longer than other queries that try and do estimates based on statistics. But it does give you very accurate information. You can also just use pgstattuple directly without this script, but you do have to run it individually on the table then each index. The script can scan the table and all its indexes in one step and give you a full summary.
On Fri, 12 Feb 2021, Victor Sudakov wrote: > Most of the stuff I've looked at is pretty old, much seems unsupported. > What is the current best practice? None of the estimate queries ever really worked well. They just gave a bit more information than zero in the days before PG's internal functions were really reliable and useful for bloat measurement, something that happened in version 9.5 development. For any modern PG, it's worth the trouble to learn how to directly use pgstattuple https://www.postgresql.org/docs/current/pgstattuple.html to do this job. You run and interpret the output from pgstattuple(relation) and its faster estimate version pgstattuple_approx. It's not hard to run some simulations with deleted rows to see what bloat looks like when it builds up. There's still some need for manual estimates if you want to account for fillfactor in all cases, but I see that as a niche topic, not where people should start at. There are also wrapper scripts built on top of pgstattuple around, like the already mentioned https://github.com/keithf4/pg_bloat_check A good bit of the work done in that script is around handling multiple versions of PG and building some long-term idea of bloat state on all tables. Workloads that have a bloat problem are sometimes fixed, but in a lot of cases the best you can do is monitor them and rebuild things when it gets bad. That's one context Keith's packaging of this feature aims at. -- Greg Smith greg.smith@crunchydata.com Director of Open Source Strategy Crunchy Data https://www.crunchydata.com/
Keith Fiske wrote: > > > > What queries do you use to estimate index and table bloat? [dd] > Why estimate when you can get the exact amount? At least for b-tree indexes > anyway. > > https://github.com/keithf4/pg_bloat_check Hello Keith, Thanks for this script and for reminding about pgstattuple. The script has the drawback of requiring r/w access to the database, so it cannot be run on a replica. But the idea is excellent. At least it can show the top N bloated relations, and then I can further explore with pgstattuple(), pgstatindex() etc. -- Victor Sudakov, VAS4-RIPE, VAS47-RIPN 2:5005/49@fidonet http://vas.tomsk.ru/
On Sun, Feb 14, 2021 at 11:43 PM Victor Sudakov <vas@sibptus.ru> wrote:
Keith Fiske wrote:
> >
> > What queries do you use to estimate index and table bloat?
[dd]
> Why estimate when you can get the exact amount? At least for b-tree indexes
> anyway.
>
> https://github.com/keithf4/pg_bloat_check
Hello Keith,
Thanks for this script and for reminding about pgstattuple.
The script has the drawback of requiring r/w access to the database, so
it cannot be run on a replica. But the idea is excellent. At least it
can show the top N bloated relations, and then I can further explore with
pgstattuple(), pgstatindex() etc.
--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/
I could possibly see about letting this actually run against the replica, however this can be a rather long running transaction depending on the size of the tables involved. You can set the "--commit_rate" to avoid some of that, but if you have really large tables, it can still run quite long. So this sort of check is really best run against the primary to avoid issues around having to allow long running queries on the replica (delayed replication or even worse bloat buildup).
However, once you get bloat under control, you likely shouldn't need to be running this often, especially against the entire database. If a few tables turn out to be problematic, you can make schedules just for them. And otherwise run an entire database scan at most maybe once a month during off-peak hours.
Keith Fiske wrote: [dd] > > You cannot run the actual scan on the replica, no. But it can be set to run > against the replica and just report the statistics so you can have a cron > set up to always run on the given system in case of failover. The > "--recovery_mode_norun" can be set so it will only run if the target system > is actually a primary. > > I could possibly see about letting this actually run against the replica, > however this can be a rather long running transaction depending on the size > of the tables involved. You can set the "--commit_rate" to avoid some of > that, but if you have really large tables, it can still run quite long. So > this sort of check is really best run against the primary to avoid issues > around having to allow long running queries on the replica (delayed > replication or even worse bloat buildup). If you have a dedicated replica for OLAP, even one running from a WAL archive (not from a replication slot), this is not an issue. So running pg_bloat_check against a replica would be very useful for some of us. -- Victor Sudakov, VAS4-RIPE, VAS47-RIPN 2:5005/49@fidonet http://vas.tomsk.ru/
On Tue, Feb 16, 2021 at 9:27 PM Victor Sudakov <vas@sibptus.ru> wrote:
Keith Fiske wrote:
[dd]
>
> You cannot run the actual scan on the replica, no. But it can be set to run
> against the replica and just report the statistics so you can have a cron
> set up to always run on the given system in case of failover. The
> "--recovery_mode_norun" can be set so it will only run if the target system
> is actually a primary.
>
> I could possibly see about letting this actually run against the replica,
> however this can be a rather long running transaction depending on the size
> of the tables involved. You can set the "--commit_rate" to avoid some of
> that, but if you have really large tables, it can still run quite long. So
> this sort of check is really best run against the primary to avoid issues
> around having to allow long running queries on the replica (delayed
> replication or even worse bloat buildup).
If you have a dedicated replica for OLAP, even one running from a WAL
archive (not from a replication slot), this is not an issue.
So running pg_bloat_check against a replica would be very useful for
some of us.
--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/
The approximate/quick mode of pgstattuple could certainly help with this problem, but as the issue you opened up on the github repo pointed out, that skips over scanning toast tables (https://github.com/keithf4/pg_bloat_check/issues/22) and also does not work against indexes which are more often the problem with bloat and query performance. I have seen significant bloat forming in the toast tables (hundreds of GB) when the regular table only reports very minimal bloat. So I don't recommend relying completely on the approximate check.
--