Merging statistics from children instead of re-sampling everything - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Merging statistics from children instead of re-sampling everything |
Date | |
Msg-id | c3162511-0e80-039e-765c-1d02d0fd0bab@enterprisedb.com Whole thread Raw |
Responses |
Re: Merging statistics from children instead of re-sampling everything
Re: Merging statistics from children instead of re-sampling everything |
List | pgsql-hackers |
Hi, While reviewing the thread about issues with auto-analyze on partitioned tables [1] I remembered that the way we build statistics on the parent is somewhat expensive, because it samples rows from the children again. It's true we sample much smaller amounts of rows from each partition (proportional to it's size), so it's not as expensive as simply running ANALYZE on all partitions individually. But it's not free either, and in some cases (e.g. with partitioning by time) it may touch old data that is not touched by regular queries, so likely not cached etc. That's a bit against the idea to use partitioning to "segregate" old data. One reason why the ANALYZE costs are not a huge issue in practice is that we're not actually triggering that - changes_since_analyze is not updated for the parent, so autoanalyze does not realize it needs to do anything, and we never rebuild the statistics :-( But as shown in [2], that may lead to poor estimates (and bad plans) in cases when we rely on the parent's stats (probably due to not expanding the list of children early enough). (Note: I wonder if we might simply not rely on the parent stats at all, and always consult directly the children stats - but with many children that's likely problematic/expensive, and it does have the same issues as the merging.) The other thread [1] attempts to fix that by incrementing the counters for the parent too, but that'll make this much worse. firstly, with multi-level partitioning we'll have to sample the children repeatedly, essentially once for each level. Secondly, it's tricky to control when exactly are the counters propagated to the parent, making the parent analyzes more frequent. Not great. Even if we do a great job in [1] and come up with smart heuristics, there will always be cases where it does not work too well and we either analyze too late or too often. Note: Propagating changes_since_analyze is only part of the story, as it does not do anything about stats after attach/detach of a partition. This attempts to approach the problem from the other end - instead of tightly controlling when to analyze the parent, it makes the analyze much cheaper. That means we don't need to worry too much about doing the analyze too often, and we can update the stats more aggressively. So, how does it work? Well, it simply fetches the statistics for all children, and merges them together. For most statistics that's fairly simple for most statistics types. 1) stawidth, stanullfrac - Those are trivial to merge. 2) stacorrelation - Not sure, I've used weighted average for now. Perhaps we could store the "internal" counters (sumX, sumX2) which would allow us to calculate regular estimate for the parent. 3) stadistinct - This is quite problematic. We only have the per-child estimates, and it's not clear if there's any overlap. For now I've just summed it up, because that's safer / similar to what we do for gather merge paths etc. Maybe we could improve this by estimating the overlap somehow (e.g. from MCV lists / histograms). But honestly, I doubt the estimates based on tiny sample of each child are any better. I suppose we could introduce a column option, determining how to combine ndistinct (similar to how we can override n_distinct itself). 4) MCV - It's trivial to build a new "parent" MCV list, although it may be too large (in which case we cut it at statistics target, and copy the remaining bits to the histogram) 5) histograms - A bit more complicated, because it requires dealing with overlapping bins, so we may have to "cut" and combine them in some way. If we assume that cutting a bin in K parts means each part has 1/K tuples (no matter where exactly we cut it), then it's trivial and it works just fine in practice. That's because with N children, each bin actually represents 1.0/(target*N) of the tuples, so the errors are quite limited. The attached patch is a PoC - it should work, but there's plenty of room for improvement. It only deals with "regular" per-column statistics, not with extended stats (but I don't see why it wouldn't work for them too). It adds a new analyze option "MERGE" which does not sample the children but instead just combines the statistics. So the example from [2] looks like this: ====================================================================== create table p (i integer, j integer) partition by list (i); create table p0 partition of p for values in (0); create table p1 partition of p for values in (1); insert into p select 0,generate_series(1,1000); insert into p select 1,generate_series(1,1000); analyze p0; analyze p1; create table q (i integer); insert into q values (0); analyze q; test=# explain select * from q join p using (i) where j between 1 and 500; QUERY PLAN --------------------------------------------------------------------- Hash Join (cost=1.02..49.82 rows=5 width=8) Hash Cond: (p.i = q.i) -> Append (cost=0.00..45.00 rows=1000 width=8) -> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Hash (cost=1.01..1.01 rows=1 width=4) -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (9 rows) test=# analyze (merge) p; ANALYZE test=# explain select * from q join p using (i) where j between 1 and 500; QUERY PLAN --------------------------------------------------------------------- Hash Join (cost=1.02..54.77 rows=500 width=8) Hash Cond: (p.i = q.i) -> Append (cost=0.00..45.00 rows=1000 width=8) -> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Hash (cost=1.01..1.01 rows=1 width=4) -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (9 rows) ====================================================================== FWIW I'm not sure we need the separate MERGE mode, but it's an easy way to allow both the regular and "merge" approach, so that it's possible to experiment and compare the behavior. One issue is that this would require some coordination between the parent/child analyzes. Essentially, any time a child is analyzed, the parent should rebuild the stats (to merge from the new child stats). This is similar to the issue of analyzing the parent too often because we don't know when exactly the counters get updated, but it's much cheaper to merge the stats, so it's much less problematic. regards [1] https://commitfest.postgresql.org/32/2492/ [2] https://www.postgresql.org/message-id/CAM-w4HO9hUHvJDVwQ8%3DFgm-znF9WNvQiWsfyBjCr-5FD7gWKGA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: