Thread: index fragmentation on insert-only table with non-unique column
Summary: Non-unique btree indices are returning CTIDs for rows with same value of indexed column not in logical order, imposing a high performance penalty. Running PG 9.5.3 now, we have a time-based partitions of append-only tables with data loaded from other sources. The tables are partitioned by time, and timestamp column has an non-unique, not-null btree index. The child tables are each ~75GB and expected to keep growing. For a child table with a week's worth of data: relpages | 11255802 reltuples | 5.90502e+07 The data is loaded shortly after it's available, so have high correlation in pg_statistic: [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename)WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head tablename | correlation | n_distinct ----------------------------------+-------------+------------ cdrs_huawei_pgwrecord | 0.999997 | 102892 cdrs_huawei_pgwrecord_2016_02_15 | 0.999658 | 96145 cdrs_huawei_pgwrecord_2016_02_22 | 0.999943 | 91916 cdrs_huawei_pgwrecord_2016_02_29 | 0.997219 | 50341 cdrs_huawei_pgwrecord_2016_03_01 | 0.999947 | 97485 But note the non-uniqueness of the index column: ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 DESC; recordopeningtime | count ---------------------+------- 2016-05-21 12:17:29 | 176 2016-05-21 12:17:25 | 171 2016-05-21 13:11:33 | 170 2016-05-21 10:20:02 | 169 2016-05-21 11:30:02 | 167 [...] We have an daily analytic query which processes the previous day's data. For new child tables, with only 1 days data loaded, this runs in ~30min, and for child tables with an entire week's worth of data loaded, takes several hours (even though both queries process the same amount of data). First, I found I was able to get 30-50min query results on full week's table by prefering a seq scan to an index scan. The row estimates seemed fine, and the only condition is the timestamp, so the planner's use of index scan is as expected. AFAICT what's happening is that the index scan was returning pages nonsequentially. strace-ing the backend showed alternating lseek()s and read()s, with the offsets not consistently increasing (nor consistently decreasing): % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773' lseek(773, 1059766272, SEEK_SET) = 1059766272 lseek(773, 824926208, SEEK_SET) = 824926208 lseek(773, 990027776, SEEK_SET) = 990027776 lseek(773, 990330880, SEEK_SET) = 990330880 lseek(773, 1038942208, SEEK_SET) = 1038942208 lseek(773, 1059856384, SEEK_SET) = 1059856384 lseek(773, 977305600, SEEK_SET) = 977305600 lseek(773, 990347264, SEEK_SET) = 990347264 lseek(773, 871096320, SEEK_SET) = 871096320 .. and consecutive read()s being rare: read(802, "g"..., 8192) = 8192 lseek(802, 918003712, SEEK_SET) = 918003712 read(802, "c"..., 8192) = 8192 lseek(802, 859136000, SEEK_SET) = 859136000 read(802, "a"..., 8192) = 8192 lseek(802, 919601152, SEEK_SET) = 919601152 read(802, "d"..., 8192) = 8192 lseek(802, 905101312, SEEK_SET) = 905101312 read(802, "c"..., 8192) = 8192 lseek(801, 507863040, SEEK_SET) = 507863040 read(801, "p"..., 8192) = 8192 lseek(802, 914235392, SEEK_SET) = 914235392 read(802, "c"..., 8192) = 8192 I was able to see great improvement without planner parameters by REINDEX the timestamp index. My theory is that the index/planner doesn't handle well the case of many tuples with same column value, and returns pages out of logical order. Reindex fixes that, rewriting the index data with pages in order (confirmed with pageinspect), which causes index scans to fetch heap data more or less monotonically (if not consecutively). strace shows that consecutive read()s are common (without intervening seeks). I gather this allows the OS readahead to kick in. Postgres seems to assume that the high degree of correlation of the table column seen in pg_stats is how it will get data from the index scan, which assumption seems to be very poor on what turns out to be a higly fragmented index. Is there a way to help it to understand otherwise?? Maybe this is a well-understood problem/deficiency; but, is there any reason why Btree scan can't sort result by ctid for index tuples with same column value (_bt_steppage() or btgettuple())? Or maybe the problem could be mitigated by changing the behavior during INESRT? In the meantime, I'll be implementing a reindex job. Thanks, Justin
On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote: > I was able to see great improvement without planner parameters by REINDEX the > timestamp index. My theory is that the index/planner doesn't handle well the > case of many tuples with same column value, and returns pages out of logical > order. Reindex fixes that, rewriting the index data with pages in order > (confirmed with pageinspect), which causes index scans to fetch heap data more > or less monotonically (if not consecutively). strace shows that consecutive > read()s are common (without intervening seeks). I gather this allows the OS > readahead to kick in. The basic problem is that the B-Tree code doesn't maintain this property. However, B-Tree index builds will create an index that initially has this property, because the tuplesort.c code happens to sort index tuples with a CTID tie-breaker. > Postgres seems to assume that the high degree of correlation of the table > column seen in pg_stats is how it will get data from the index scan, which > assumption seems to be very poor on what turns out to be a higly fragmented > index. Is there a way to help it to understand otherwise?? Your complaint is vague. Are you complaining about the planner making a poor choice? I don't think that's the issue here, because you never made any firm statement about the planner making a choice that was worth than an alternative that it had available. If you're arguing for the idea that B-Trees should reliably keep tuples in order by a tie-break condition, that seems difficult to implement, and likely not worth it in practice. -- Peter Geoghegan
* Peter Geoghegan (pg@bowt.ie) wrote: > On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote: > > I was able to see great improvement without planner parameters by REINDEX the > > timestamp index. My theory is that the index/planner doesn't handle well the > > case of many tuples with same column value, and returns pages out of logical > > order. Reindex fixes that, rewriting the index data with pages in order > > (confirmed with pageinspect), which causes index scans to fetch heap data more > > or less monotonically (if not consecutively). strace shows that consecutive > > read()s are common (without intervening seeks). I gather this allows the OS > > readahead to kick in. > > The basic problem is that the B-Tree code doesn't maintain this > property. However, B-Tree index builds will create an index that > initially has this property, because the tuplesort.c code happens to > sort index tuples with a CTID tie-breaker. > > > Postgres seems to assume that the high degree of correlation of the table > > column seen in pg_stats is how it will get data from the index scan, which > > assumption seems to be very poor on what turns out to be a higly fragmented > > index. Is there a way to help it to understand otherwise?? > > Your complaint is vague. Are you complaining about the planner making > a poor choice? I don't think that's the issue here, because you never > made any firm statement about the planner making a choice that was > worth than an alternative that it had available. > > If you're arguing for the idea that B-Trees should reliably keep > tuples in order by a tie-break condition, that seems difficult to > implement, and likely not worth it in practice. The ongoing discussion around how to effectively handle duplicate values in B-Tree seems relevant to this. In particular, if we're able to store duplicate values efficiently and all the locations under a single key are easily available then we could certainly sort those prior to going and visiting them. That's not quite the same as keeping the tuples in order in the heap, but would more-or-less achieve the effect desired, I believe? Thanks! Stephen
Attachment
Peter Geoghegan <pg@bowt.ie> writes: > The basic problem is that the B-Tree code doesn't maintain this > property. However, B-Tree index builds will create an index that > initially has this property, because the tuplesort.c code happens to > sort index tuples with a CTID tie-breaker. Yeah. I wonder what would happen if we used the same rule for index insertions. It would likely make insertions more expensive, but maybe not by much. The existing "randomization" rule for where to insert new items in a run of identical index entries would go away, because the insertion point would become deterministic. I am not sure if that's good or bad for insertion performance, but it would likely help for scan performance. regards, tom lane
On Tue, May 24, 2016 at 9:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah. I wonder what would happen if we used the same rule for index > insertions. It would likely make insertions more expensive, but maybe > not by much. The existing "randomization" rule for where to insert new > items in a run of identical index entries would go away, because the > insertion point would become deterministic. I am not sure if that's > good or bad for insertion performance, but it would likely help for > scan performance. I think that if somebody tacked on a tie-breaker in the same way as in tuplesort.c's B-Tree IndexTuple, there'd be significant negative consequences. The equal-to-high-key case gets a choice of which page to put the new IndexTuple on, and I imagine that that's quite useful when it comes up. I'd also have concerns about the key space in the index. I think that it would seriously mess with the long term utility of values in internal pages, which currently can reasonably have little to do with the values currently stored in leaf pages. They're functionally only separators of the key space that guide index scans, so it doesn't matter if the actual values are completely absent from the leaf pages/the table itself (perhaps some of the values in the internal pages were inserted years ago, and have long since been deleted and vacuumed away). Provided the distribution of values at the leaf level is still well characterized at higher levels (e.g. many string values that start with vowels, very few that start with the letters 'x' or 'z'), there should be no significant bloat. That's very valuable. Unique indexes are another problem for this naive approach. Maybe somebody could do better with a more sophisticated approach, but it's probably better to focus on duplicate storage or even leaf page compression, as Stephen mentioned. -- Peter Geoghegan
On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote: > Summary: Non-unique btree indices are returning CTIDs for rows with same > value of indexed column not in logical order, imposing a high performance > penalty. > > Running PG 9.5.3 now, we have a time-based partitions of append-only tables > with data loaded from other sources. The tables are partitioned by time, and > timestamp column has an non-unique, not-null btree index. > > The child tables are each ~75GB and expected to keep growing. For a child > table with a week's worth of data: > relpages | 11255802 > reltuples | 5.90502e+07 > > The data is loaded shortly after it's available, so have high correlation in > pg_statistic: > [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename)WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head > tablename | correlation | n_distinct > ----------------------------------+-------------+------------ > cdrs_huawei_pgwrecord | 0.999997 | 102892 > cdrs_huawei_pgwrecord_2016_02_15 | 0.999658 | 96145 > cdrs_huawei_pgwrecord_2016_02_22 | 0.999943 | 91916 > cdrs_huawei_pgwrecord_2016_02_29 | 0.997219 | 50341 > cdrs_huawei_pgwrecord_2016_03_01 | 0.999947 | 97485 > > But note the non-uniqueness of the index column: > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 DESC; > recordopeningtime | count > ---------------------+------- > 2016-05-21 12:17:29 | 176 > 2016-05-21 12:17:25 | 171 > 2016-05-21 13:11:33 | 170 > 2016-05-21 10:20:02 | 169 > 2016-05-21 11:30:02 | 167 > [...] That is not that much duplication. You aren't going to have dozens or hundreds of leaf pages all with equal values. (and you only showed the most highly duplicated ones, presumably the average is much less) > We have an daily analytic query which processes the previous day's data. For > new child tables, with only 1 days data loaded, this runs in ~30min, and for > child tables with an entire week's worth of data loaded, takes several hours > (even though both queries process the same amount of data). For an append only table, why would the first day of a new partition be any less fragmented than that same day would be a week from now? Are you sure it isn't just that your week-old data has all been aged out of the cache? > First, I found I was able to get 30-50min query results on full week's table by > prefering a seq scan to an index scan. The row estimates seemed fine, and the > only condition is the timestamp, so the planner's use of index scan is as > expected. Can you show us the query? I would expect a bitmap scan of the index (which would do what you want, but even more so), instead. > > AFAICT what's happening is that the index scan was returning pages > nonsequentially. strace-ing the backend showed alternating lseek()s and > read()s, with the offsets not consistently increasing (nor consistently > decreasing): > % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773' > lseek(773, 1059766272, SEEK_SET) = 1059766272 > lseek(773, 824926208, SEEK_SET) = 824926208 > lseek(773, 990027776, SEEK_SET) = 990027776 > lseek(773, 990330880, SEEK_SET) = 990330880 > lseek(773, 1038942208, SEEK_SET) = 1038942208 > lseek(773, 1059856384, SEEK_SET) = 1059856384 > lseek(773, 977305600, SEEK_SET) = 977305600 > lseek(773, 990347264, SEEK_SET) = 990347264 > lseek(773, 871096320, SEEK_SET) = 871096320 > > .. and consecutive read()s being rare: > read(802, "g"..., 8192) = 8192 > lseek(802, 918003712, SEEK_SET) = 918003712 > read(802, "c"..., 8192) = 8192 > lseek(802, 859136000, SEEK_SET) = 859136000 > read(802, "a"..., 8192) = 8192 > lseek(802, 919601152, SEEK_SET) = 919601152 > read(802, "d"..., 8192) = 8192 > lseek(802, 905101312, SEEK_SET) = 905101312 > read(802, "c"..., 8192) = 8192 > lseek(801, 507863040, SEEK_SET) = 507863040 > read(801, "p"..., 8192) = 8192 > lseek(802, 914235392, SEEK_SET) = 914235392 > read(802, "c"..., 8192) = 8192 Which of those are the table, and which the index? Something doesn't add up here. How could an index of an append-only table possibly become that fragmented, when the highest amount of key duplication is about 170? Cheers, Jeff
On Tue, May 24, 2016 at 09:16:20PM -0700, Peter Geoghegan wrote: > On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote: > > Postgres seems to assume that the high degree of correlation of the table > > column seen in pg_stats is how it will get data from the index scan, which > > assumption seems to be very poor on what turns out to be a higly fragmented > > index. Is there a way to help it to understand otherwise?? > > Your complaint is vague. Are you complaining about the planner making > a poor choice? I don't think that's the issue here, because you never > made any firm statement about the planner making a choice that was > worth than an alternative that it had available. I was thinking there a few possible places to make improvements: the planner could have understood that scans of non-unique indices don't result in strictly sequential scans of the table, the degree of non-sequentialness being determined by the column statistics, and perhaps by properties of the index itself. Or the INSERT code or btree scan could improve on this, even if tuples aren't fully ordered. > If you're arguing for the idea that B-Trees should reliably keep > tuples in order by a tie-break condition, that seems difficult to > implement, and likely not worth it in practice. I had the awful idea to change the index to use (recordopeningtime,ctid). Maybe somebody will convince me otherwise, but may actually work better than trying to reindex this table daily by 4am. Thanks, Justin
On Tue, May 24, 2016 at 11:23:48PM -0700, Jeff Janes wrote: > > But note the non-uniqueness of the index column: > > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 DESC; > > recordopeningtime | count > > ---------------------+------- > > 2016-05-21 12:17:29 | 176 > > 2016-05-21 12:17:25 | 171 > > 2016-05-21 13:11:33 | 170 > > 2016-05-21 10:20:02 | 169 > > 2016-05-21 11:30:02 | 167 > > [...] > > That is not that much duplication. You aren't going to have dozens or > hundreds of leaf pages all with equal values. (and you only showed > the most highly duplicated ones, presumably the average is much less) Point taken, but it's not that great of a range either: ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 LIMIT 19; recordopeningtime | count ---------------------+------- 2016-05-21 03:10:05 | 44 2016-05-21 03:55:05 | 44 2016-05-21 04:55:05 | 45 ts=# SELECT count(distinct recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'; -[ RECORD 1 ] count | 86400 ts=# SELECT count(recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'; -[ RECORD 1 ]-- count | 8892865 > > We have an daily analytic query which processes the previous day's data. For > > new child tables, with only 1 days data loaded, this runs in ~30min, and for > > child tables with an entire week's worth of data loaded, takes several hours > > (even though both queries process the same amount of data). > > For an append only table, why would the first day of a new partition > be any less fragmented than that same day would be a week from now? > Are you sure it isn't just that your week-old data has all been aged > out of the cache? I don't think it's cache effect, since we're not using the source table for (maybe anything) else the entire rest of the day. Server has 72GB RAM, same size one the largest of the tables being joined (beginning) at 4am. I didn't mean that a given day is more fragmented now than it was last week (but I don't know, either). I guess when we do a query on the table with ~32 hours of data in, it might do a seq scan rather than index scan, too. Compare the end of month partition tables: ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_29_recordopeningtime_idx'); leaf_fragmentation | 48.6 ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_29_recordopeningtime_idx'); leaf_fragmentation | 48.38 ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_29_recordopeningtime_idx'); leaf_fragmentation | 48.6 ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_22_recordopeningtime_idx'); leaf_fragmentation | 48.66 ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_22_recordopeningtime_idx'); leaf_fragmentation | 48.27 ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_22_recordopeningtime_idx'); leaf_fragmentation | 48 This one I reindexed as a test: ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx'); leaf_fragmentation | 0.01 .. and query ran in ~30min (reran a 2nd time, with cache effects: 25min). > > First, I found I was able to get 30-50min query results on full week's table by > > prefering a seq scan to an index scan. The row estimates seemed fine, and the > > only condition is the timestamp, so the planner's use of index scan is as > > expected. > > Can you show us the query? I would expect a bitmap scan of the index > (which would do what you want, but even more so), instead. See explain, also showing additional tables/views being joined. It's NOT doing a bitmap scan though, and I'd be interested to find why; I'm sure that would've improved this query enough so it never would've been an issue. https://explain.depesz.com/s/s8KP -> Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01 (cost=0.56..1601734.57rows=8943848 width=349) Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-0800:00:00'::timestamp without time zone)) > > AFAICT what's happening is that the index scan was returning pages > > nonsequentially. strace-ing the backend showed alternating lseek()s and > > read()s, with the offsets not consistently increasing (nor consistently > > decreasing): .. > > Which of those are the table, and which the index? Those weren't necessarily strace of the same process; I believe both of these were table data/heap, and didn't include any index access. > Something doesn't add up here. How could an index of an append-only > table possibly become that fragmented, when the highest amount of key > duplication is about 170? I'm certainly opened to alternate interpretations / conclusions :) Justin
On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby <pryzby@telsasoft.com> wrote: >> > First, I found I was able to get 30-50min query results on full week's table by >> > prefering a seq scan to an index scan. The row estimates seemed fine, and the >> > only condition is the timestamp, so the planner's use of index scan is as >> > expected. >> >> Can you show us the query? I would expect a bitmap scan of the index >> (which would do what you want, but even more so), instead. > See explain, also showing additional tables/views being joined. It's NOT doing > a bitmap scan though, and I'd be interested to find why; I'm sure that would've > improved this query enough so it never would've been an issue. > https://explain.depesz.com/s/s8KP > > -> Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01 (cost=0.56..1601734.57rows=8943848 width=349) > Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND (recordopeningtime <'2016-05-08 00:00:00'::timestamp without time zone)) Please show your guc settings ( see https://wiki.postgresql.org/wiki/Server_Configuration ) A plan node like that, if it would result in I/O, with proper configuration should have selected a bitmap index/heap scan. If it didn't, it probably thinks it has more cache than it really does, and that would mean the wrong setting was set in effective_cache_size.
On Fri, Jun 03, 2016 at 06:26:33PM -0300, Claudio Freire wrote: > On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby <pryzby@telsasoft.com> wrote: > >> > First, I found I was able to get 30-50min query results on full week's table by > >> > prefering a seq scan to an index scan. The row estimates seemed fine, and the > >> > only condition is the timestamp, so the planner's use of index scan is as > >> > expected. > >> > >> Can you show us the query? I would expect a bitmap scan of the index > >> (which would do what you want, but even more so), instead. > > See explain, also showing additional tables/views being joined. It's NOT doing > > a bitmap scan though, and I'd be interested to find why; I'm sure that would've > > improved this query enough so it never would've been an issue. > > https://explain.depesz.com/s/s8KP > > > > -> Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01 (cost=0.56..1601734.57rows=8943848 width=349) > > Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-08 00:00:00'::timestamp without time zone)) > > Please show your guc settings ( see > https://wiki.postgresql.org/wiki/Server_Configuration ) > > A plan node like that, if it would result in I/O, with proper > configuration should have selected a bitmap index/heap scan. If it > didn't, it probably thinks it has more cache than it really does, and > that would mean the wrong setting was set in effective_cache_size. ts=# SELECT name, current_setting(name), SOURCE FROM pg_settings WHERE SOURCE='configuration file'; dynamic_shared_memory_type | posix | configuration file effective_cache_size | 64GB | configuration file effective_io_concurrency | 8 | configuration file huge_pages | try | configuration file log_autovacuum_min_duration | 0 | configuration file log_checkpoints | on | configuration file maintenance_work_mem | 6GB | configuration file max_connections | 200 | configuration file max_wal_size | 4GB | configuration file min_wal_size | 6GB | configuration file shared_buffers | 8GB | configuration file wal_compression | on | configuration file work_mem | 1GB | configuration file I changed at least maintenance_work_mem since I originally wrote, to try to avoid tempfiles during REINDEX (though I'm not sure it matters, as the tempfiles are effective cached and may never actually be written). It's entirely possible those settings aren't ideal. The server has 72GB RAM. There are usually very few (typically n<3 but at most a handful) nontrivial queries running at once, if at all. I wouldn't expect any data that's not recent (table data last 2 days or index from this month) to be cached, and wouldn't expect that to be entirely cached, either: ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_2016_05_..$'; gb | 425.783050537109 ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_2016_05_...*idx'; gb | 60.0909423828125 ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_201605.*idx'; gb | 4.85528564453125 ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_201605$'; gb | 86.8688049316406 As a test, I did SET effective_cache_size='1MB', before running explain, and still does: | -> Index Scan using cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_29 (cost=0.44..1526689.49rows=8342796 width=355) | Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) I Set enable_indexscan=0, and got: | -> Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29 (cost=168006.10..4087526.04 rows=8342796 width=355) | Recheck Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) | -> Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx (cost=0.00..165920.40 rows=8342796width=0) | Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) Here's a minimal query which seems to isolate the symptom: ts=# explain (analyze,buffers) SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_22 WHERE recordopeningtime>='2016-05-22'AND recordopeningtime<'2016-05-23'; | Aggregate (cost=2888731.67..2888731.68 rows=1 width=8) (actual time=388661.892..388661.892 rows=1 loops=1) | Buffers: shared hit=4058501 read=1295147 written=35800 | -> Index Scan using cdrs_huawei_pgwrecord_2016_05_22_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_22 (cost=0.56..2867075.33rows=8662534 w |idth=8) (actual time=0.036..379332.910 rows=8575673 loops=1) | Index Cond: ((recordopeningtime >= '2016-05-22 00:00:00'::timestamp without time zone) AND (recordopeningtime <'2016-05-23 00:00:00'::timestamp | without time zone)) | Buffers: shared hit=4058501 read=1295147 written=35800 | Planning time: 0.338 ms | Execution time: 388661.947 ms And here's an older one to avoid cache, with enable_indexscan=0 |ts=# explain (analyze,buffers) SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_08 WHERE recordopeningtime>='2016-05-08'AND recordopeningtime<'2016-05-09'; | Aggregate (cost=10006286.58..10006286.59 rows=1 width=8) (actual time=44219.156..44219.156 rows=1 loops=1) | Buffers: shared hit=118 read=1213887 written=50113 | -> Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_08 (cost=85142.24..9985848.96 rows=8175048 width=8) (actual time=708.024..40106.062rows=8179338 loops=1) | Recheck Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) | Rows Removed by Index Recheck: 74909 | Heap Blocks: lossy=1213568 | Buffers: shared hit=118 read=1213887 written=50113 | -> Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_08_recordopeningtime_idx1 (cost=0.00..83098.48 rows=8175048width=0) (actual time=706.557..706.557 rows=12135680 loops=1) | Index Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) | Buffers: shared hit=117 read=320 | Planning time: 214.786 ms | Execution time: 44228.874 ms |(12 rows) Thanks for your help. Justin
On Fri, Jun 3, 2016 at 8:54 PM, Justin Pryzby <pryzby@telsasoft.com> wrote: > As a test, I did SET effective_cache_size='1MB', before running explain, and > still does: > > | -> Index Scan using cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_29 (cost=0.44..1526689.49 rows=8342796 width=355) > | Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) > > I Set enable_indexscan=0, and got: > > | -> Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29 (cost=168006.10..4087526.04 rows=8342796 width=355) > | Recheck Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) > | -> Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx (cost=0.00..165920.40 rows=8342796width=0) > | Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) > > Here's a minimal query which seems to isolate the symptom: > > ts=# explain (analyze,buffers) SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_22 WHERE recordopeningtime>='2016-05-22'AND recordopeningtime<'2016-05-23'; > | Aggregate (cost=2888731.67..2888731.68 rows=1 width=8) (actual time=388661.892..388661.892 rows=1 loops=1) > | Buffers: shared hit=4058501 read=1295147 written=35800 > | -> Index Scan using cdrs_huawei_pgwrecord_2016_05_22_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_22 (cost=0.56..2867075.33rows=8662534 w > |idth=8) (actual time=0.036..379332.910 rows=8575673 loops=1) > | Index Cond: ((recordopeningtime >= '2016-05-22 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-23 00:00:00'::timestamp > | without time zone)) > | Buffers: shared hit=4058501 read=1295147 written=35800 > | Planning time: 0.338 ms > | Execution time: 388661.947 ms > > And here's an older one to avoid cache, with enable_indexscan=0 > |ts=# explain (analyze,buffers) SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_08 WHERE recordopeningtime>='2016-05-08'AND recordopeningtime<'2016-05-09'; > | Aggregate (cost=10006286.58..10006286.59 rows=1 width=8) (actual time=44219.156..44219.156 rows=1 loops=1) > | Buffers: shared hit=118 read=1213887 written=50113 > | -> Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_08 (cost=85142.24..9985848.96 rows=8175048 width=8) (actualtime=708.024..40106.062 rows=8179338 loops=1) > | Recheck Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) > | Rows Removed by Index Recheck: 74909 > | Heap Blocks: lossy=1213568 > | Buffers: shared hit=118 read=1213887 written=50113 > | -> Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_08_recordopeningtime_idx1 (cost=0.00..83098.48 rows=8175048width=0) (actual time=706.557..706.557 rows=12135680 loops=1) > | Index Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND (recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) > | Buffers: shared hit=117 read=320 > | Planning time: 214.786 ms > | Execution time: 44228.874 ms > |(12 rows) Correct me if I'm wrong, but this looks like the planner not accounting for correlation when using bitmap heap scans. Checking the source, it really doesn't. So correlated index scans look extra favourable vs bitmap index scans because bitmap heap scans consider random page costs sans correlation effects (even though correlation applies to bitmap heap scans as well). While that sounds desirable a priori, it seems it's hurting this case quite badly. I'm not sure there's any simple way of working around that.
On Fri, Jun 3, 2016 at 6:54 PM, Justin Pryzby <pryzby@telsasoft.com> wrote: > max_wal_size | 4GB | configuration file > min_wal_size | 6GB | configuration file Just a minor digression -- it generally doesn't make sense to configure the minimum for something greater than the maximum for that same thing. That should have no bearing the performance issue raised on the thread, but you might want to fix it anyway. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Claudio Freire <klaussfreire@gmail.com> writes: > So correlated index scans look extra favourable vs bitmap index scans > because bitmap heap scans consider random page costs sans correlation > effects (even though correlation applies to bitmap heap scans as > well). Really? How? The index ordering has nothing to do with the order in which heap tuples will be visited. regards, tom lane
On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> So correlated index scans look extra favourable vs bitmap index scans >> because bitmap heap scans consider random page costs sans correlation >> effects (even though correlation applies to bitmap heap scans as >> well). > > Really? How? The index ordering has nothing to do with the order in > which heap tuples will be visited. It is not the order itself, but the density. If the index is read in a range scan (as opposed to =ANY scan), and the index lead column is correlated with the table ordering, then the parts of the table that need to be visited will be much denser than if there were no correlation. But Claudio is saying that this is not being accounted for. Cheers, Jeff
Regarding this earlier thread: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote: > Summary: Non-unique btree indices are returning CTIDs for rows with same > value of indexed column not in logical order, imposing a high performance > penalty. I have to point out that by "logical" I clearly meant "physical", hopefully nobody was too misled.. On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote: > On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Claudio Freire <klaussfreire@gmail.com> writes: > >> So correlated index scans look extra favourable vs bitmap index scans > >> because bitmap heap scans consider random page costs sans correlation > >> effects (even though correlation applies to bitmap heap scans as > >> well). > > > > Really? How? The index ordering has nothing to do with the order in > > which heap tuples will be visited. > > It is not the order itself, but the density. > > If the index is read in a range scan (as opposed to =ANY scan), and > the index lead column is correlated with the table ordering, then the > parts of the table that need to be visited will be much denser than if > there were no correlation. But Claudio is saying that this is not > being accounted for. I didn't completely understand Claudio/Jeff here, and not sure if we're on the same page. For queries on these tables, the index scan was very slow, due to fragmented index on non-unique column, and seq scan would have been (was) faster (even if it means reading 70GB and filtering out 6 of 7 days' data). That was resolved by added a nightly reindex job (.. which sometimes competes with other maintenance and has trouble running every table every night). But I did find that someone else had previously reported this problem (in a strikingly similar context and message, perhaps clearer than mine): https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com I also found this older thread: https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou1kn@4ax.com There was mention of a TODO item: * Compute index correlation on CREATE INDEX and ANALYZE, use it for index * scan cost estimation .. but perhaps I misunderstand and that's long since resolved ? Justin
On Sat, Aug 13, 2016 at 3:54 PM, Justin Pryzby <pryzby@telsasoft.com> wrote: > On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote: >> On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Claudio Freire <klaussfreire@gmail.com> writes: >> >> So correlated index scans look extra favourable vs bitmap index scans >> >> because bitmap heap scans consider random page costs sans correlation >> >> effects (even though correlation applies to bitmap heap scans as >> >> well). >> > >> > Really? How? The index ordering has nothing to do with the order in >> > which heap tuples will be visited. >> >> It is not the order itself, but the density. >> >> If the index is read in a range scan (as opposed to =ANY scan), and >> the index lead column is correlated with the table ordering, then the >> parts of the table that need to be visited will be much denser than if >> there were no correlation. But Claudio is saying that this is not >> being accounted for. > > I didn't completely understand Claudio/Jeff here, and not sure if we're on the > same page. For queries on these tables, the index scan was very slow, due to > fragmented index on non-unique column, and seq scan would have been (was) > faster (even if it means reading 70GB and filtering out 6 of 7 days' data). > That was resolved by added a nightly reindex job (.. which sometimes competes > with other maintenance and has trouble running every table every night). Yes, but a bitmap index scan should be faster than both, but the planner is discarding it beause it estimates it will be slower, because it doesn't account for correlation between index keys and physical location. And, while what you clarify there would indeed affect the estimation for index scans, it would only make the issue worse: the planner thinks the index scan will be better than it really is, because it's expecting correlation, but the "fragmentation" of same-key runs destroys that correlation. A bitmap index scan would restore it, though, so the bitmap index scan would be that much better.
estimate correlation of index separately from table (Re: [PERFORM]index fragmentation on insert-only table with non-unique column)
From
Justin Pryzby
Date:
Months ago I reported an issue with very slow index scan of tables with high "correlation" of its indexed column, due to (we concluded at the time) duplicate/repeated values of that column causing many lseek()s. References: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou1kn@4ax.com My interim workarounds were an reindex job and daily granularity partitions (despite having an excessive number of child tables) to query execution time]] to encourage seq scans for daily analysis jobs rather than idx scan. I've now cobbled together a patch to throw around and see if we can improve on that. I tried several changes, hoping to discourage index scan. The logic that seems to most accurately reflect costs has two changes: Postgres' existing behavior stores table correlation (heap value vs. position) but doesn't look at index correlation, so can't distinguish between a just-built index, and a highly fragmented index, or one with highly-nonsequential TIDs. My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine correlation of heap TID vs index tuple logical location (as opposed to the table correlation, computed as: heap TID vs. heap value). The second change averages separate correlation values of small lists of 300 consecutive TIDs, rather than the course-granularity/aggregate correlation of a small sample of pages across the entire index. Postgres' existing sampling is designed to give an even sample across all rows. An issue with this course-granularity correlation is that the index can have a broad correlation (to physical heap location), but with many small-scale deviations, which don't show up due to sampling a relatively small fraction of a large table; and/or the effect of the deviations is insignificant/noise and correlation is still computed near 1. I believe the "large scale" correlation that postgres computes from block sample fails to well represent small-scale uncorrelated reads which contribute large number of random reads, but not included in planner cost. Not only are the index reads highly random (which the planner already assumes), but the CTIDs referenced within a given btree leaf page are also substantially non-sequential. It seems to affect INSERTs which, over a short interval of time have a small-moderate range of column values, but which still have a strong overall trend in that column WRT time (and heap location). Think of inserting a "normal" distribution of timestamps centered around now(). My original report shows lseek() for nearly every read on the *heap*. The original problem was on a table of telecommunications CDRs, indexed by "record opening time" (start time) and with high correlation value in pg_stats. We import data records from a file, which is probably more or less in order of "end time". That still displays broad correlation on "start time", but individual TIDs are nowhere near sequential. Two phone calls which end in the same 1 second interval are not unlikely to have started many minutes apart... but two calls which end within the same second are very likely to have started within an hour of each other.. since typical duration is <1h. But, insert volume is high, and there are substantial repeated keys, so the portion of an index scan returning CTIDs for some certain key value (timestamp with second resolution in our case) ends up reading heap tuples for a non-negligible fraction of the table: maybe only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ... which is still 300 seeks, and takes long enough that cache is inadequate to substantially mitigate the cost. It's not clear to me that there's a good way to evenly *sample* a fraction of the index blocks in a manner which is agnostic to different AMs. Scanning the entirety of all indices on relations during (auto) analyze may be too expensive. So I implemented index scan of the MCV list. I'm guessing this might cause the correlation to be under-estimated, and prefer bitmap scans somewhat more than justified, due to btree insertion logic for repeated keys, to reduce O(n^2) behavior. MCV list isn't perfect since that can happen with eg. normally distributed floating point values (or timestamp with fractional seconds). I ran pageinspect on a recent child of the table that triggered the original report: ts=# SELECT itemoffset, ctid FROM bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 22 OFFSET1; itemoffset | ctid ------------+-------- 2 | (81,4) 3 | (5,6) 4 | (6,5) 5 | (9,1) 6 | (10,1) 7 | (14,1) 8 | (21,8) 9 | (25,1) 10 | (30,5) 11 | (41,1) 12 | (48,7) 13 | (49,3) 14 | (50,2) 15 | (51,1) 16 | (54,1) 17 | (57,7) 18 | (58,6) 19 | (59,6) 20 | (63,5) 21 | (71,6) 22 | (72,5) 23 | (78,7) (22 rows) => 22 adjacent tuples referencing 22 different heap pages, only 6 of which are sequential => 16 lseek() aka random page cost. To generate data demonstrating this problem you can do things like: | CREATE TABLE t(i int, j int); | CREATE INDEX ON t(i); \! time for a in `seq 99`; do psql -qc 'INSERT INTO t SELECT * FROM generate_series(1,99)'; done -- or, or: | INSERT INTO t SELECT (0.001*a+9*(random()-0.5))::int FROM generate_series(1,99999) a; -- or, or this one, perhaps closest to our problem case: | INSERT INTO t SELECT a FROM generate_series(1,999) a, generate_series(1,999) b ORDER BY a+b/9.9; Note, I was able to create a case using floats without repeated keys: | INSERT INTO w SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,999999) i ORDER BY i | ANALYZE t; -- note: the correlation is even higher for larger tables with same number of -- repeated keys, which is bad since the cost should go up linearly with the -- size and associated number of lseek(). That's one component of the problem -- and I think a necessarily component of any solution. postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%'; tablename | attname | correlation | array_length -----------+---------+-------------+-------------- t | i | 0.995212 | 87 t | j | | t_i_idx | i | -0.892874 | 87 ^^^^^^^ ... this last line is added by my patch. That's a bad combination, high table correlation means the planner thinks only a fraction of the heap will be read, and sequentially, so isn't discouraged from doing index scan. But index TIDs are much less well correlated (0.89 vs 0.99). Note: the negative correlation at tuple-level seems to be related to this comment: src/backend/access/nbtree/nbtinsert.c- * Once we have chosen the page to put the key on, we'll insert it before src/backend/access/nbtree/nbtinsert.c: * any existing equal keys because of the way _bt_binsrch() works. Note also: |postgres=# SELECT leaf_fragmentation FROM pgstatindex('t_i_idx'); |leaf_fragmentation | 67.63 .. but keep in mind that leaf_fragmentation only checks leaf page order, and not CTID order within index pages. The planner already assumes index pages are random reads. Maybe it's a good indicator (?), but doesn't lend itself to an accurate cost estimation. For testing purposes, with: | shared_buffers | 128kB | public | t | table | pryzbyj | 35 MB | | SET enable_bitmapscan=off; | SET enable_seqscan=off; | SELECT pg_backend_pid(); | SELECT sum(j) FROM t WHERE i<99; | Time: 3974.205 ms % sudo strace -p 530 2>/tmp/strace-pg % sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg |sort |uniq -c |sort -nr |head 39474 lseek(41, 359 lseek(44, 8 lseek(18, => 40k seeks on an 35MB table, that's 10 seeks per heap page! open("base/12411/16634", O_RDWR) = 41 open("base/12411/16636", O_RDWR) = 44 postgres=# SELECT relfilenode, relname FROM pg_class WHERE relfilenode>16630; 16634 | t 16636 | t_i_idx 2017-07-07 17:45:54.075 CDT [6360] WARNING: HERE 1222: csquared=0.797225 minIO/R-P-C=109.750000 maxIO/R-P-C=4416.000000costsize.c cost_index 702 With my patch, index scan estimated to take ~4400 seeks, rather than actual 40k, probably because max_IO_cost assumes that a given heap page will be visited only once. But in the worst case each of its tuples would require a separate lseek().... I'm not suggesting to change that, since making max_IO 100x higher will probably change query plans more dramatically than desired.. But also note that, unpatched, with table correlation >0.99, postgres would've under-estimated min_IO_cost not by a factor of 10x but by 400x. | postgres=# REINDEX INDEX t_i_idx; | postgres=# ANALYZE t; | postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE't%'; | tablename | attname | correlation | array_length | -----------+---------+-------------+-------------- | t | i | 0.995235 | 67 | t_i_idx | i | 1 | 67 => Correctly distinguishes freshly reindexed table. % sudo strace -p 6428 2>/tmp/strace-pg8 % sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg8 |sort |uniq -c |sort -nr 99 lseek(37, 2017-07-07 17:49:47.745 CDT [6428] WARNING: HERE 1222: csquared=1.000000 minIO/R-P-C=108.500000 maxIO/R-P-C=4416.000000costsize.c cost_index 702 => csquared=1, so min_IO cost is used instead of something close to max_IO (this patch doesn't change the computation of those values at all, just changes correlation, which squared value is used to interpolate between correlated/min cost and uncorrelated/max cost). correlated estimate: 108 seeks vs 99 actual, is essentially what unpatched postgres would've computed by using the table correlation of 0.9952, implicitly assuming the index to be similarly correlated. I hope that demonstrates the need to distinguish index correlation from table correlation. I'm hoping for comments on the existing patch, specifically if there's a better way to sample the index than "MCVs or partial index scan". I've left some fragments in place of earlier implementation involving btree page level fragmentation (like statindex). Also: does it make sense to keeping MCV/histogram for non-expression index which duplicates table column ? The stats lookup in selfuncs.c btcostestimate() would have to check for correlation from index, and rest of stats from its table. A bitmap layer adds overhead, which should be avoided if not needed. But it shouldn't be a huge impact, and I think its relative effect is only high when returning a small number of rows. I'm thinking of a few cases. - unique / low cardinality index scans or without duplicate keys: should have low index_pages_fetched(), so max_IO_cost should not be very different from min_io_cost, and new index-based correlation shouldn't have much effect different from table correlation. - unclustered/uncorrelated tables: tables whose heap have low correlation already discouraged from index scan; this includes tables whose column is UPDATEd and not just INSERTed; - table with correlated heap AND index: csquared should still be ~0.99 and not change much; - correlated table with uncorrelated index: this is the target case with intended behavior change My apologies: I think that's a bit of a brain dump. Justin
Attachment
Re: estimate correlation of index separately from table (Re:[PERFORM] index fragmentation on insert-only table with non-unique column)
From
Peter Geoghegan
Date:
On Fri, Jul 7, 2017 at 4:41 PM, Justin Pryzby <pryzby@telsasoft.com> wrote: > The second change averages separate correlation values of small lists of 300 > consecutive TIDs, rather than the course-granularity/aggregate correlation of a > small sample of pages across the entire index. Postgres' existing sampling is > designed to give an even sample across all rows. An issue with this > course-granularity correlation is that the index can have a broad correlation > (to physical heap location), but with many small-scale deviations, which don't > show up due to sampling a relatively small fraction of a large table; and/or > the effect of the deviations is insignificant/noise and correlation is still > computed near 1. > > I believe the "large scale" correlation that postgres computes from block > sample fails to well represent small-scale uncorrelated reads which contribute > large number of random reads, but not included in planner cost. All of that may well be true, but I've actually come around to the view that we should treat TID as a first class part of the keyspace, that participates in comparisons as an implicit last column in all cases (not just when B-Trees are built within tuplesort.c). That would probably obviate the need for a patch like this entirely, because pg_stats.correlation would be unaffected by duplicates. I think that this would have a number of advantages, some of which might be significant. For example, I think it could make LP_DEAD cleanup within nbtree more effective for some workloads, especially workloads where it is important for HOT pruning and LP_DEAD cleanup to work in concert -- locality of access probably matters with that. Also, with every entry in the index guaranteed to be unique, we can imagine VACUUM being much more selective with killing index entries, when the TID array it uses happens to be very small. With the freeze map stuff that's now in place, being able to do that matters more than before. The original Lehman & Yao algorithm is supposed to have unique keys in all cases, but we don't follow that in order to make duplicates work, which is implemented by changing an invariant (see nbtree README for details). So, this could simplify some aspects of how binary searches must work in the face of having to deal with non-unique keys. I think that there are cases where many non-HOT UPDATEs have to go through a bunch of duplicate leaf tuples and do visibility checking on old versions for no real benefit. With the TID as part of the keyspace, we could instead tell the B-Tree code to insert a new tuple as part of an UPDATE while using the TID as part of its insertion scan key, so rummaging through many duplicates is avoided. That having been said, it would be hard to do this for all the reasons I went into in that thread you referenced [1]. If you're going to treat TID as a part of the keyspace, you have to totally embrace that, which means that the internal pages need to have heap TIDs too (not just pointers to the lower levels in the B-Tree, which the IndexTuple's t_tid pointer is used for there). Those are the place where you need to literally append this new, implicit heap TID column as if it was just another user-visible attribute, since that information isn't stored in the internal pages today at all. Doing that has a cost, which isn't going to be acceptable if we naively append a heap TID to every internal page IndexTuple. With a type like int4, you're going to completely bloat those pages, with big consequences for fan-in. So, really, what needs to happen to make it work is someone needs to write a suffix truncation patch, which implies that only those cases that actually benefit from increasing the width of internal page IndexTuples (those with many "would-be duplicates") pay the cost. This is a classic technique, that I've actually already prototyped, though that's extremely far from being worth posting here. That was just to verify my understanding. I think that I should write and put up for discussion a design document for various nbtree enhancements. These include internal page suffix truncation, prefix compression, and key normalization. I'm probably not going to take any of these projects on, but it would be useful if there was at least a little bit of clarity about how they could be implemented. Maybe we could reach some kind of weak consensus without going to great lengths. These optimizations are closely intertwined things, and the lack of clarity on how they all fit together is probably holding back an implementation of any one of them. [1] https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com -- Peter Geoghegan