Possible important data point on stats collection, wondering aboutpossible improvement - Mailing list pgsql-hackers

From Chris Travers
Subject Possible important data point on stats collection, wondering aboutpossible improvement
Date
Msg-id CAN-RpxDpdup84Sz_rH_BdjPd9fZkKufGpsSshV6dVnby=uRZ_w@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi;

System is PostgreSQL 10.5, all partitioning done the old way (via inheritance).

Last month we had some performance issues caused by statistics being out of date and the planner choosing the wrong index for a large number of queries.  The proximal fix was to increase the stats target from 1000 to 10000 and analyze which prevents the problem from continuing to manifest.  Looking back at the graphs though I notice a number of things which make me think that maybe there might be ways of improving the situation without increasing stats targets.

I wanted to bring up the question here and see if there were opportunities to work together on improving the situation.

What I measured was the difference between the maximum value in the statistics histogram and the maximum value in the table.  The relevant field is a timestamp field, so calculating lag is straight-forward and gives us a percentage of the table that is outside stats.

What I noticed was that the major analytics tables seem to fall into two groups:
1.  One group has actual clear sampling issues as evidenced by the fact that the difference in values swung wildly around.  One of these tables had 52 million rows spread between two partitions (1M and 51M respectively).  On this group I understand the need to set stats targets up.

2.  A second group saw a more mild increase in lag between max value recorded in stats and max value in the db.  However what struck me about this was that the lag seemed fairly linear.  In other words, the fluctuations were within a relatively narrow range and the lag seemed to grow linearly with time.  These tables were actually larger (one typical one was 60M rows split between a partition of 51M rows and one of 9M rows).

The workload for the database in question is heavily update driven so I would expect fewer sampling bias problems than might happen for insert-only workloads.

The second case puzzles me.  I have been looking carefully into how the stats collector works and I cannot find anything that could account for a near-linear increase in statics missing recent data.  What starts out in a 60M row table with default stats targets (1000) ends up going about 10% off over time.

What I am wondering is whether it would make any sense whatsoever to expand the stats to include min and max values found in a scan, or whether it would make more sense to try to help the planner extrapolate from existing stats in a more efficient way.

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Function to promote standby servers
Next
From: Pierre Ducroquet
Date:
Subject: Poor plan when using EXISTS in the expression list