Thread: Very slow "bloat query"

Very slow "bloat query"

From

Marcin Gozdalik

Date:

14 May 2021, 11:06:33

I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) database. It takes around 5 minutes for pgmetrics to run. I traced the problem to the "bloat query" (version of https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, doing no I/O.

I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics` does not collect bloat on `pg_catalog`):
`vacuum (full, analyze, verbose) pg_class;`
```
INFO: vacuuming "pg_catalog.pg_class"
INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions in 158870 pages
DETAIL: 7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO: analyzing "pg_catalog.pg_class"
INFO: "pg_class": scanned 60000 of 158869 pages, containing 295 live rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM
```

`pg_class` has so many dead rows because the workload is temp-table heavy (creating/destroying 1M+ temporary tables per day) and has long running analytics queries running for 24h+.

PG query planner assumes that index scan on `pg_class` will be very quick and plans Nested loop with Index scan. However, the index scan has 7M dead tuples to filter out and the query takes more than 200 seconds (https://explain.depesz.com/s/bw2G).

If I create a temp table from `pg_class` to contain only the live tuples:
```
CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;
```

and run the bloat query on `pg_class_alive` instead of `pg_class`:
```
SELECT
nn.nspname AS schemaname,
cc.relname AS tablename,
COALESCE(cc.reltuples,0) AS reltuples,
COALESCE(cc.relpages,0) AS relpages,
COALESCE(CEIL((cc.reltuples*((datahdr+8-
(CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 END))+nullhdr2+4))/(8192-20::float)),0) AS otta
FROM
pg_class_alive cc
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 'information_schema'
LEFT JOIN
(
SELECT
foo.nspname,foo.relname,
(datawidth+32)::numeric AS datahdr,
(maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 END))) AS nullhdr2
FROM (
SELECT
ns.nspname, tbl.relname,
SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
MAX(coalesce(null_frac,0)) AS maxfracsum,
23+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
) AS nullhdr
FROM pg_attribute att
JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname
WHERE att.attnum > 0 AND tbl.relkind='r'
GROUP BY 1,2
) AS foo
) AS rs
ON cc.relname = rs.relname AND nn.nspname = rs.nspname
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
```

it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH)

The rabbit hole probably goes deeper (e.g. should do the same for pg_statistic and pg_attribute and create a new pg_stats view).

I am not able (at least not quickly) change the amount of temporary tables created or make the analytics queries finish quicker. Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

Thank you,

Marcin Gozdalik

Re: Very slow "bloat query"

From

Gilles Darold

Date:

14 May 2021, 11:50:54

Le 14/05/2021 à 13:06, Marcin Gozdalik a écrit :
> Hi
>
> I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW)
> database. It takes around 5 minutes for pgmetrics to run. I traced the
> problem to the "bloat query" (version of
> https://wiki.postgresql.org/wiki/Show_database_bloat
> <https://wiki.postgresql.org/wiki/Show_database_bloat>) spinning in
> CPU, doing no I/O.
>
> I have traced the problem to the bloated `pg_class` (the irony:
> `pgmetrics` does not collect bloat on `pg_catalog`):
> `vacuum (full, analyze, verbose) pg_class;`
> ```
> INFO:  vacuuming "pg_catalog.pg_class"
> INFO:  "pg_class": found 1 removable, 7430805 nonremovable row
> versions in 158870 pages
> DETAIL:  7429943 dead row versions cannot be removed yet.
> CPU 1.36s/6.40u sec elapsed 9.85 sec.
> INFO:  analyzing "pg_catalog.pg_class"
> INFO:  "pg_class": scanned 60000 of 158869 pages, containing 295 live
> rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
> VACUUM
> ```
>
> `pg_class` has so many dead rows because the workload is temp-table
> heavy (creating/destroying 1M+ temporary tables per day) and has long
> running analytics queries running for 24h+.
>
> PG query planner assumes that index scan on `pg_class` will be very
> quick and plans Nested loop with Index scan. However, the index scan
> has 7M dead tuples to filter out and the query takes more than 200
> seconds (https://explain.depesz.com/s/bw2G
> <https://explain.depesz.com/s/bw2G>).
>
> If I create a temp table from `pg_class` to contain only the live tuples:
> ```
> CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
> CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
> CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON
> pg_class_alive(relname, relnamespace);
> CREATE INDEX pg_class_tblspc_relfilenode_index ON
> pg_class_alive(reltablespace, relfilenode);
> ANALYZE pg_class_alive;
> ```
>
> and run the bloat query on `pg_class_alive` instead of `pg_class`:
> ```
> SELECT
>    nn.nspname AS schemaname,
>    cc.relname AS tablename,
>    COALESCE(cc.reltuples,0) AS reltuples,
>    COALESCE(cc.relpages,0) AS relpages,
>    COALESCE(CEIL((cc.reltuples*((datahdr+8-
>      (CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8
> END))+nullhdr2+4))/(8192-20::float)),0) AS otta
>  FROM
>     pg_class_alive cc
>  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <>
> 'information_schema'
>  LEFT JOIN
>  (
>    SELECT
>      foo.nspname,foo.relname,
>      (datawidth+32)::numeric AS datahdr,
>      (maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE
> nullhdr%8 END))) AS nullhdr2
>    FROM (
>      SELECT
>        ns.nspname, tbl.relname,
>        SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS
> datawidth,
>        MAX(coalesce(null_frac,0)) AS maxfracsum,
>        23+(
>          SELECT 1+count(*)/8
>          FROM pg_stats s2
>          WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND
> s2.tablename = tbl.relname
>        ) AS nullhdr
>      FROM pg_attribute att
>      JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
>      JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
>      LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
>      AND s.tablename = tbl.relname
>      AND s.inherited=false
>      AND s.attname=att.attname
>      WHERE att.attnum > 0 AND tbl.relkind='r'
>      GROUP BY 1,2
>    ) AS foo
>  ) AS rs
>  ON cc.relname = rs.relname AND nn.nspname = rs.nspname
>  LEFT JOIN pg_index i ON indrelid = cc.oid
>  LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
> ```
>
> it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH
> <https://explain.depesz.com/s/K4SH>)
>
> The rabbit hole probably goes deeper (e.g. should do the same for
> pg_statistic and pg_attribute and create a new pg_stats view).
>
> I am not able (at least not quickly) change the amount of temporary
> tables created or make the analytics queries finish quicker. Apart
> from the above hack of filtering out live tuples to a separate table
> is there anything I could do?


Hi,


To avoid bloating your catalog with temporary tables you can try using
https://github.com/darold/pgtt-rsl I don't know if it will fit the
performances but at least you will not bloat the catalog anymore.


About your hack, I don't see other solution except running vacuum on the
catalog tables more often, but I guess that this is already done or not
possible. But not bloating the catalog at  such level is the right solution.


--
Gilles Darold
http://www.darold.net/

Re: Very slow "bloat query"

From

Imre Samu

Date:

14 May 2021, 12:07:49

> Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

This is the latest PG13.3 version?

IMHO: If not, maybe worth updating to the latest patch release, as soon as possible

https://www.postgresql.org/docs/release/13.3/

Release date: 2021-05-13

"Disable the vacuum_cleanup_index_scale_factor parameter and storage option (Peter Geoghegan)

The notion of tracking “stale” index statistics proved to interact badly with the autovacuum_vacuum_insert_threshold parameter, resulting in unnecessary full-index scans and consequent degradation of autovacuum performance. The latter mechanism seems superior, so remove the stale-statistics logic. The control parameter for that, vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13, it remains present to avoid breaking existing configuration files, but it no longer does anything."

best,

Imre

Marcin Gozdalik <gozdal@gmail.com> ezt írta (időpont: 2021. máj. 14., P, 13:20):

Hi

I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) database. It takes around 5 minutes for pgmetrics to run. I traced the problem to the "bloat query" (version of https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, doing no I/O.

I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics` does not collect bloat on `pg_catalog`):
`vacuum (full, analyze, verbose) pg_class;`
```
INFO: vacuuming "pg_catalog.pg_class"
INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions in 158870 pages
DETAIL: 7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO: analyzing "pg_catalog.pg_class"
INFO: "pg_class": scanned 60000 of 158869 pages, containing 295 live rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM
```

`pg_class` has so many dead rows because the workload is temp-table heavy (creating/destroying 1M+ temporary tables per day) and has long running analytics queries running for 24h+.

PG query planner assumes that index scan on `pg_class` will be very quick and plans Nested loop with Index scan. However, the index scan has 7M dead tuples to filter out and the query takes more than 200 seconds (https://explain.depesz.com/s/bw2G).

If I create a temp table from `pg_class` to contain only the live tuples:
```
CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;
```

and run the bloat query on `pg_class_alive` instead of `pg_class`:
```
SELECT
nn.nspname AS schemaname,
cc.relname AS tablename,
COALESCE(cc.reltuples,0) AS reltuples,
COALESCE(cc.relpages,0) AS relpages,
COALESCE(CEIL((cc.reltuples*((datahdr+8-
(CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 END))+nullhdr2+4))/(8192-20::float)),0) AS otta
FROM
pg_class_alive cc
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 'information_schema'
LEFT JOIN
(
SELECT
foo.nspname,foo.relname,
(datawidth+32)::numeric AS datahdr,
(maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 END))) AS nullhdr2
FROM (
SELECT
ns.nspname, tbl.relname,
SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
MAX(coalesce(null_frac,0)) AS maxfracsum,
23+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
) AS nullhdr
FROM pg_attribute att
JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname
WHERE att.attnum > 0 AND tbl.relkind='r'
GROUP BY 1,2
) AS foo
) AS rs
ON cc.relname = rs.relname AND nn.nspname = rs.nspname
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
```

it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH)

The rabbit hole probably goes deeper (e.g. should do the same for pg_statistic and pg_attribute and create a new pg_stats view).

I am not able (at least not quickly) change the amount of temporary tables created or make the analytics queries finish quicker. Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

Thank you,
Marcin Gozdalik

--
Marcin Gozdalik

Re: Very slow "bloat query"

From

Marcin Gozdalik

Date:

14 May 2021, 12:11:02

Unfortunately it's still 9.6. Upgrade to latest 13 is planned for this year.

pt., 14 maj 2021 o 12:08 Imre Samu <pella.samu@gmail.com> napisał(a):

> Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

This is the latest PG13.3 version?

IMHO: If not, maybe worth updating to the latest patch release, as soon as possible

https://www.postgresql.org/docs/release/13.3/
Release date: 2021-05-13
"Disable the vacuum_cleanup_index_scale_factor parameter and storage option (Peter Geoghegan)
The notion of tracking “stale” index statistics proved to interact badly with the autovacuum_vacuum_insert_threshold parameter, resulting in unnecessary full-index scans and consequent degradation of autovacuum performance. The latter mechanism seems superior, so remove the stale-statistics logic. The control parameter for that, vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13, it remains present to avoid breaking existing configuration files, but it no longer does anything."

best,
Imre

Marcin Gozdalik <gozdal@gmail.com> ezt írta (időpont: 2021. máj. 14., P, 13:20):
Hi

I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) database. It takes around 5 minutes for pgmetrics to run. I traced the problem to the "bloat query" (version of https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, doing no I/O.

I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics` does not collect bloat on `pg_catalog`):
`vacuum (full, analyze, verbose) pg_class;`
```
INFO: vacuuming "pg_catalog.pg_class"
INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions in 158870 pages
DETAIL: 7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO: analyzing "pg_catalog.pg_class"
INFO: "pg_class": scanned 60000 of 158869 pages, containing 295 live rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM
```

`pg_class` has so many dead rows because the workload is temp-table heavy (creating/destroying 1M+ temporary tables per day) and has long running analytics queries running for 24h+.

PG query planner assumes that index scan on `pg_class` will be very quick and plans Nested loop with Index scan. However, the index scan has 7M dead tuples to filter out and the query takes more than 200 seconds (https://explain.depesz.com/s/bw2G).

If I create a temp table from `pg_class` to contain only the live tuples:
```
CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;
```

and run the bloat query on `pg_class_alive` instead of `pg_class`:
```
SELECT
nn.nspname AS schemaname,
cc.relname AS tablename,
COALESCE(cc.reltuples,0) AS reltuples,
COALESCE(cc.relpages,0) AS relpages,
COALESCE(CEIL((cc.reltuples*((datahdr+8-
(CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 END))+nullhdr2+4))/(8192-20::float)),0) AS otta
FROM
pg_class_alive cc
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 'information_schema'
LEFT JOIN
(
SELECT
foo.nspname,foo.relname,
(datawidth+32)::numeric AS datahdr,
(maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 END))) AS nullhdr2
FROM (
SELECT
ns.nspname, tbl.relname,
SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
MAX(coalesce(null_frac,0)) AS maxfracsum,
23+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
) AS nullhdr
FROM pg_attribute att
JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname
WHERE att.attnum > 0 AND tbl.relkind='r'
GROUP BY 1,2
) AS foo
) AS rs
ON cc.relname = rs.relname AND nn.nspname = rs.nspname
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
```

it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH)

The rabbit hole probably goes deeper (e.g. should do the same for pg_statistic and pg_attribute and create a new pg_stats view).

I am not able (at least not quickly) change the amount of temporary tables created or make the analytics queries finish quicker. Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

Thank you,
Marcin Gozdalik

--
Marcin Gozdalik

Marcin Gozdalik

Re: Very slow "bloat query"

From

Tom Lane

Date:

14 May 2021, 15:04:02

Marcin Gozdalik <gozdal@gmail.com> writes:
> I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics`
> does not collect bloat on `pg_catalog`):
> `vacuum (full, analyze, verbose) pg_class;`
> ```
> INFO:  vacuuming "pg_catalog.pg_class"
> INFO:  "pg_class": found 1 removable, 7430805 nonremovable row versions in
> 158870 pages
> DETAIL:  7429943 dead row versions cannot be removed yet.

Ugh.  It's understandable that having a lot of temp-table traffic
would result in the creation of lots of dead rows in pg_class.
The question to be asking is why aren't they vacuumable?  You
must have a longstanding open transaction somewhere (perhaps
a forgotten prepared transaction?) that is holding back the
global xmin horizon.  Closing that out and then doing another
manual VACUUM FULL should help.

            regards, tom lane

Re: Very slow "bloat query"

From

Marcin Gozdalik

Date:

14 May 2021, 15:15:03

There is a long running analytics query (which is running usually for 30-40 hours). I agree that's not the best position to be in but right now can't do anything about it.

pt., 14 maj 2021 o 15:04 Tom Lane <tgl@sss.pgh.pa.us> napisał(a):

Marcin Gozdalik <gozdal@gmail.com> writes:
> I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics`
> does not collect bloat on `pg_catalog`):
> `vacuum (full, analyze, verbose) pg_class;`
> ```
> INFO: vacuuming "pg_catalog.pg_class"
> INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions in
> 158870 pages
> DETAIL: 7429943 dead row versions cannot be removed yet.

Ugh. It's understandable that having a lot of temp-table traffic
would result in the creation of lots of dead rows in pg_class.
The question to be asking is why aren't they vacuumable? You
must have a longstanding open transaction somewhere (perhaps
a forgotten prepared transaction?) that is holding back the
global xmin horizon. Closing that out and then doing another
manual VACUUM FULL should help.

regards, tom lane

Marcin Gozdalik

Re: Very slow "bloat query"

From

Imre Samu

Date:

14 May 2021, 15:44:55

> Unfortunately it's still 9.6.

And what is your "version()"?

for example:

postgres=# select version();

version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.22 on x86_64-pc-linux-gnu (Debian 9.6.22-1.pgdg110+1), compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

Imre

Marcin Gozdalik <gozdal@gmail.com> ezt írta (időpont: 2021. máj. 14., P, 14:11):

Unfortunately it's still 9.6. Upgrade to latest 13 is planned for this year.

pt., 14 maj 2021 o 12:08 Imre Samu <pella.samu@gmail.com> napisał(a):
> Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

This is the latest PG13.3 version?

IMHO: If not, maybe worth updating to the latest patch release, as soon as possible

https://www.postgresql.org/docs/release/13.3/
Release date: 2021-05-13
"Disable the vacuum_cleanup_index_scale_factor parameter and storage option (Peter Geoghegan)
The notion of tracking “stale” index statistics proved to interact badly with the autovacuum_vacuum_insert_threshold parameter, resulting in unnecessary full-index scans and consequent degradation of autovacuum performance. The latter mechanism seems superior, so remove the stale-statistics logic. The control parameter for that, vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13, it remains present to avoid breaking existing configuration files, but it no longer does anything."

best,
Imre

Marcin Gozdalik <gozdal@gmail.com> ezt írta (időpont: 2021. máj. 14., P, 13:20):
Hi

I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) database. It takes around 5 minutes for pgmetrics to run. I traced the problem to the "bloat query" (version of https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, doing no I/O.

I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics` does not collect bloat on `pg_catalog`):
`vacuum (full, analyze, verbose) pg_class;`
```
INFO: vacuuming "pg_catalog.pg_class"
INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions in 158870 pages
DETAIL: 7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO: analyzing "pg_catalog.pg_class"
INFO: "pg_class": scanned 60000 of 158869 pages, containing 295 live rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM
```

`pg_class` has so many dead rows because the workload is temp-table heavy (creating/destroying 1M+ temporary tables per day) and has long running analytics queries running for 24h+.

PG query planner assumes that index scan on `pg_class` will be very quick and plans Nested loop with Index scan. However, the index scan has 7M dead tuples to filter out and the query takes more than 200 seconds (https://explain.depesz.com/s/bw2G).

If I create a temp table from `pg_class` to contain only the live tuples:
```
CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;
```

and run the bloat query on `pg_class_alive` instead of `pg_class`:
```
SELECT
nn.nspname AS schemaname,
cc.relname AS tablename,
COALESCE(cc.reltuples,0) AS reltuples,
COALESCE(cc.relpages,0) AS relpages,
COALESCE(CEIL((cc.reltuples*((datahdr+8-
(CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 END))+nullhdr2+4))/(8192-20::float)),0) AS otta
FROM
pg_class_alive cc
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 'information_schema'
LEFT JOIN
(
SELECT
foo.nspname,foo.relname,
(datawidth+32)::numeric AS datahdr,
(maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 END))) AS nullhdr2
FROM (
SELECT
ns.nspname, tbl.relname,
SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
MAX(coalesce(null_frac,0)) AS maxfracsum,
23+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
) AS nullhdr
FROM pg_attribute att
JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname
WHERE att.attnum > 0 AND tbl.relkind='r'
GROUP BY 1,2
) AS foo
) AS rs
ON cc.relname = rs.relname AND nn.nspname = rs.nspname
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
```

it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH)

The rabbit hole probably goes deeper (e.g. should do the same for pg_statistic and pg_attribute and create a new pg_stats view).

I am not able (at least not quickly) change the amount of temporary tables created or make the analytics queries finish quicker. Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

Thank you,
Marcin Gozdalik

--
Marcin Gozdalik

--
Marcin Gozdalik

Re: Very slow "bloat query"

From

Marcin Gozdalik

Date:

14 May 2021, 15:47:20

PostgreSQL 9.6.21 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit

pt., 14 maj 2021 o 15:45 Imre Samu <pella.samu@gmail.com> napisał(a):

> Unfortunately it's still 9.6.

And what is your "version()"?

for example:
postgres=# select version();
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.22 on x86_64-pc-linux-gnu (Debian 9.6.22-1.pgdg110+1), compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

Imre

Marcin Gozdalik <gozdal@gmail.com> ezt írta (időpont: 2021. máj. 14., P, 14:11):
Unfortunately it's still 9.6. Upgrade to latest 13 is planned for this year.

pt., 14 maj 2021 o 12:08 Imre Samu <pella.samu@gmail.com> napisał(a):
> Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

This is the latest PG13.3 version?

IMHO: If not, maybe worth updating to the latest patch release, as soon as possible

https://www.postgresql.org/docs/release/13.3/
Release date: 2021-05-13
"Disable the vacuum_cleanup_index_scale_factor parameter and storage option (Peter Geoghegan)
The notion of tracking “stale” index statistics proved to interact badly with the autovacuum_vacuum_insert_threshold parameter, resulting in unnecessary full-index scans and consequent degradation of autovacuum performance. The latter mechanism seems superior, so remove the stale-statistics logic. The control parameter for that, vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In v13, it remains present to avoid breaking existing configuration files, but it no longer does anything."

best,
Imre

Marcin Gozdalik <gozdal@gmail.com> ezt írta (időpont: 2021. máj. 14., P, 13:20):
Hi

I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) database. It takes around 5 minutes for pgmetrics to run. I traced the problem to the "bloat query" (version of https://wiki.postgresql.org/wiki/Show_database_bloat) spinning in CPU, doing no I/O.

I have traced the problem to the bloated `pg_class` (the irony: `pgmetrics` does not collect bloat on `pg_catalog`):
`vacuum (full, analyze, verbose) pg_class;`
```
INFO: vacuuming "pg_catalog.pg_class"
INFO: "pg_class": found 1 removable, 7430805 nonremovable row versions in 158870 pages
DETAIL: 7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO: analyzing "pg_catalog.pg_class"
INFO: "pg_class": scanned 60000 of 158869 pages, containing 295 live rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM
```

`pg_class` has so many dead rows because the workload is temp-table heavy (creating/destroying 1M+ temporary tables per day) and has long running analytics queries running for 24h+.

PG query planner assumes that index scan on `pg_class` will be very quick and plans Nested loop with Index scan. However, the index scan has 7M dead tuples to filter out and the query takes more than 200 seconds (https://explain.depesz.com/s/bw2G).

If I create a temp table from `pg_class` to contain only the live tuples:
```
CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;
```

and run the bloat query on `pg_class_alive` instead of `pg_class`:
```
SELECT
nn.nspname AS schemaname,
cc.relname AS tablename,
COALESCE(cc.reltuples,0) AS reltuples,
COALESCE(cc.relpages,0) AS relpages,
COALESCE(CEIL((cc.reltuples*((datahdr+8-
(CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 END))+nullhdr2+4))/(8192-20::float)),0) AS otta
FROM
pg_class_alive cc
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 'information_schema'
LEFT JOIN
(
SELECT
foo.nspname,foo.relname,
(datawidth+32)::numeric AS datahdr,
(maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 END))) AS nullhdr2
FROM (
SELECT
ns.nspname, tbl.relname,
SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
MAX(coalesce(null_frac,0)) AS maxfracsum,
23+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
) AS nullhdr
FROM pg_attribute att
JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname
WHERE att.attnum > 0 AND tbl.relkind='r'
GROUP BY 1,2
) AS foo
) AS rs
ON cc.relname = rs.relname AND nn.nspname = rs.nspname
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
```

it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH)

The rabbit hole probably goes deeper (e.g. should do the same for pg_statistic and pg_attribute and create a new pg_stats view).

I am not able (at least not quickly) change the amount of temporary tables created or make the analytics queries finish quicker. Apart from the above hack of filtering out live tuples to a separate table is there anything I could do?

Thank you,
Marcin Gozdalik

--
Marcin Gozdalik

--
Marcin Gozdalik

Marcin Gozdalik