Thread: index fragmentation on insert-only table with non-unique column

index fragmentation on insert-only table with non-unique column

From
Justin Pryzby
Date:
Summary: Non-unique btree indices are returning CTIDs for rows with same
value of indexed column not in logical order, imposing a high performance
penalty.

Running PG 9.5.3 now, we have a time-based partitions of append-only tables
with data loaded from other sources.  The tables are partitioned by time, and
timestamp column has an non-unique, not-null btree index.

The child tables are each ~75GB and expected to keep growing.  For a child
table with a week's worth of data:
relpages  | 11255802
reltuples | 5.90502e+07

The data is loaded shortly after it's available, so have high correlation in
pg_statistic:
[pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON
(c.relname=s.tablename)WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head 
            tablename             | correlation | n_distinct
----------------------------------+-------------+------------
 cdrs_huawei_pgwrecord            |    0.999997 |     102892
 cdrs_huawei_pgwrecord_2016_02_15 |    0.999658 |      96145
 cdrs_huawei_pgwrecord_2016_02_22 |    0.999943 |      91916
 cdrs_huawei_pgwrecord_2016_02_29 |    0.997219 |      50341
 cdrs_huawei_pgwrecord_2016_03_01 |    0.999947 |      97485

But note the non-uniqueness of the index column:
ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 DESC; 
  recordopeningtime  | count
---------------------+-------
 2016-05-21 12:17:29 |   176
 2016-05-21 12:17:25 |   171
 2016-05-21 13:11:33 |   170
 2016-05-21 10:20:02 |   169
 2016-05-21 11:30:02 |   167
[...]

We have an daily analytic query which processes the previous day's data.  For
new child tables, with only 1 days data loaded, this runs in ~30min, and for
child tables with an entire week's worth of data loaded, takes several hours
(even though both queries process the same amount of data).

First, I found I was able to get 30-50min query results on full week's table by
prefering a seq scan to an index scan.  The row estimates seemed fine, and the
only condition is the timestamp, so the planner's use of index scan is as
expected.

AFAICT what's happening is that the index scan was returning pages
nonsequentially.  strace-ing the backend showed alternating lseek()s and
read()s, with the offsets not consistently increasing (nor consistently
decreasing):
% sudo strace -p 25588 2>&1 |grep -m9 'lseek(773'
lseek(773, 1059766272, SEEK_SET)        = 1059766272
lseek(773, 824926208, SEEK_SET)         = 824926208
lseek(773, 990027776, SEEK_SET)         = 990027776
lseek(773, 990330880, SEEK_SET)         = 990330880
lseek(773, 1038942208, SEEK_SET)        = 1038942208
lseek(773, 1059856384, SEEK_SET)        = 1059856384
lseek(773, 977305600, SEEK_SET)         = 977305600
lseek(773, 990347264, SEEK_SET)         = 990347264
lseek(773, 871096320, SEEK_SET)         = 871096320

.. and consecutive read()s being rare:
read(802, "g"..., 8192)                 = 8192
lseek(802, 918003712, SEEK_SET)         = 918003712
read(802, "c"..., 8192)                 = 8192
lseek(802, 859136000, SEEK_SET)         = 859136000
read(802, "a"..., 8192)                 = 8192
lseek(802, 919601152, SEEK_SET)         = 919601152
read(802, "d"..., 8192)                 = 8192
lseek(802, 905101312, SEEK_SET)         = 905101312
read(802, "c"..., 8192)                 = 8192
lseek(801, 507863040, SEEK_SET)         = 507863040
read(801, "p"..., 8192)                 = 8192
lseek(802, 914235392, SEEK_SET)         = 914235392
read(802, "c"..., 8192)                 = 8192

I was able to see great improvement without planner parameters by REINDEX the
timestamp index.  My theory is that the index/planner doesn't handle well the
case of many tuples with same column value, and returns pages out of logical
order.  Reindex fixes that, rewriting the index data with pages in order
(confirmed with pageinspect), which causes index scans to fetch heap data more
or less monotonically (if not consecutively).  strace shows that consecutive
read()s are common (without intervening seeks).  I gather this allows the OS
readahead to kick in.

Postgres seems to assume that the high degree of correlation of the table
column seen in pg_stats is how it will get data from the index scan, which
assumption seems to be very poor on what turns out to be a higly fragmented
index.  Is there a way to help it to understand otherwise??

Maybe this is a well-understood problem/deficiency; but, is there any reason
why Btree scan can't sort result by ctid for index tuples with same column
value (_bt_steppage() or btgettuple())?  Or maybe the problem could be
mitigated by changing the behavior during INESRT?  In the meantime, I'll be
implementing a reindex job.

Thanks,
Justin


Re: index fragmentation on insert-only table with non-unique column

From
Peter Geoghegan
Date:
On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> I was able to see great improvement without planner parameters by REINDEX the
> timestamp index.  My theory is that the index/planner doesn't handle well the
> case of many tuples with same column value, and returns pages out of logical
> order.  Reindex fixes that, rewriting the index data with pages in order
> (confirmed with pageinspect), which causes index scans to fetch heap data more
> or less monotonically (if not consecutively).  strace shows that consecutive
> read()s are common (without intervening seeks).  I gather this allows the OS
> readahead to kick in.

The basic problem is that the B-Tree code doesn't maintain this
property. However, B-Tree index builds will create an index that
initially has this property, because the tuplesort.c code happens to
sort index tuples with a CTID tie-breaker.

> Postgres seems to assume that the high degree of correlation of the table
> column seen in pg_stats is how it will get data from the index scan, which
> assumption seems to be very poor on what turns out to be a higly fragmented
> index.  Is there a way to help it to understand otherwise??

Your complaint is vague. Are you complaining about the planner making
a poor choice? I don't think that's the issue here, because you never
made any firm statement about the planner making a choice that was
worth than an alternative that it had available.

If you're arguing for the idea that B-Trees should reliably keep
tuples in order by a tie-break condition, that seems difficult to
implement, and likely not worth it in practice.

--
Peter Geoghegan


Re: index fragmentation on insert-only table with non-unique column

From
Stephen Frost
Date:
* Peter Geoghegan (pg@bowt.ie) wrote:
> On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I was able to see great improvement without planner parameters by REINDEX the
> > timestamp index.  My theory is that the index/planner doesn't handle well the
> > case of many tuples with same column value, and returns pages out of logical
> > order.  Reindex fixes that, rewriting the index data with pages in order
> > (confirmed with pageinspect), which causes index scans to fetch heap data more
> > or less monotonically (if not consecutively).  strace shows that consecutive
> > read()s are common (without intervening seeks).  I gather this allows the OS
> > readahead to kick in.
>
> The basic problem is that the B-Tree code doesn't maintain this
> property. However, B-Tree index builds will create an index that
> initially has this property, because the tuplesort.c code happens to
> sort index tuples with a CTID tie-breaker.
>
> > Postgres seems to assume that the high degree of correlation of the table
> > column seen in pg_stats is how it will get data from the index scan, which
> > assumption seems to be very poor on what turns out to be a higly fragmented
> > index.  Is there a way to help it to understand otherwise??
>
> Your complaint is vague. Are you complaining about the planner making
> a poor choice? I don't think that's the issue here, because you never
> made any firm statement about the planner making a choice that was
> worth than an alternative that it had available.
>
> If you're arguing for the idea that B-Trees should reliably keep
> tuples in order by a tie-break condition, that seems difficult to
> implement, and likely not worth it in practice.

The ongoing discussion around how to effectively handle duplicate values
in B-Tree seems relevant to this.  In particular, if we're able to store
duplicate values efficiently and all the locations under a single key
are easily available then we could certainly sort those prior to going
and visiting them.

That's not quite the same as keeping the tuples in order in the heap,
but would more-or-less achieve the effect desired, I believe?

Thanks!

Stephen

Attachment

Re: index fragmentation on insert-only table with non-unique column

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> The basic problem is that the B-Tree code doesn't maintain this
> property. However, B-Tree index builds will create an index that
> initially has this property, because the tuplesort.c code happens to
> sort index tuples with a CTID tie-breaker.

Yeah.  I wonder what would happen if we used the same rule for index
insertions.  It would likely make insertions more expensive, but maybe
not by much.  The existing "randomization" rule for where to insert new
items in a run of identical index entries would go away, because the
insertion point would become deterministic.  I am not sure if that's
good or bad for insertion performance, but it would likely help for
scan performance.

            regards, tom lane


Re: index fragmentation on insert-only table with non-unique column

From
Peter Geoghegan
Date:
On Tue, May 24, 2016 at 9:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  I wonder what would happen if we used the same rule for index
> insertions.  It would likely make insertions more expensive, but maybe
> not by much.  The existing "randomization" rule for where to insert new
> items in a run of identical index entries would go away, because the
> insertion point would become deterministic.  I am not sure if that's
> good or bad for insertion performance, but it would likely help for
> scan performance.

I think that if somebody tacked on a tie-breaker in the same way as in
tuplesort.c's B-Tree IndexTuple, there'd be significant negative
consequences.

The equal-to-high-key case gets a choice of which page to put the new
IndexTuple on, and I imagine that that's quite useful when it comes
up. I'd also have concerns about the key space in the index. I think
that it would seriously mess with the long term utility of values in
internal pages, which currently can reasonably have little to do with
the values currently stored in leaf pages. They're functionally only
separators of the key space that guide index scans, so it doesn't
matter if the actual values are completely absent from the leaf
pages/the table itself (perhaps some of the values in the internal
pages were inserted years ago, and have long since been deleted and
vacuumed away). Provided the distribution of values at the leaf level
is still well characterized at higher levels (e.g. many string values
that start with vowels, very few that start with the letters 'x' or
'z'), there should be no significant bloat. That's very valuable.
Unique indexes are another problem for this naive approach.

Maybe somebody could do better with a more sophisticated approach, but
it's probably better to focus on duplicate storage or even leaf page
compression, as Stephen mentioned.

--
Peter Geoghegan


Re: index fragmentation on insert-only table with non-unique column

From
Jeff Janes
Date:
On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> Summary: Non-unique btree indices are returning CTIDs for rows with same
> value of indexed column not in logical order, imposing a high performance
> penalty.
>
> Running PG 9.5.3 now, we have a time-based partitions of append-only tables
> with data loaded from other sources.  The tables are partitioned by time, and
> timestamp column has an non-unique, not-null btree index.
>
> The child tables are each ~75GB and expected to keep growing.  For a child
> table with a week's worth of data:
> relpages  | 11255802
> reltuples | 5.90502e+07
>
> The data is loaded shortly after it's available, so have high correlation in
> pg_statistic:
> [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON
(c.relname=s.tablename)WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head 
>             tablename             | correlation | n_distinct
> ----------------------------------+-------------+------------
>  cdrs_huawei_pgwrecord            |    0.999997 |     102892
>  cdrs_huawei_pgwrecord_2016_02_15 |    0.999658 |      96145
>  cdrs_huawei_pgwrecord_2016_02_22 |    0.999943 |      91916
>  cdrs_huawei_pgwrecord_2016_02_29 |    0.997219 |      50341
>  cdrs_huawei_pgwrecord_2016_03_01 |    0.999947 |      97485
>
> But note the non-uniqueness of the index column:
> ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 DESC; 
>   recordopeningtime  | count
> ---------------------+-------
>  2016-05-21 12:17:29 |   176
>  2016-05-21 12:17:25 |   171
>  2016-05-21 13:11:33 |   170
>  2016-05-21 10:20:02 |   169
>  2016-05-21 11:30:02 |   167
> [...]

That is not that much duplication.  You aren't going to have dozens or
hundreds of leaf pages all with equal values.   (and you only showed
the most highly duplicated ones, presumably the average is much less)


> We have an daily analytic query which processes the previous day's data.  For
> new child tables, with only 1 days data loaded, this runs in ~30min, and for
> child tables with an entire week's worth of data loaded, takes several hours
> (even though both queries process the same amount of data).

For an append only table, why would the first day of a new partition
be any less fragmented than that same day would be a week from now?
Are you sure it isn't just that your week-old data has all been aged
out of the cache?


> First, I found I was able to get 30-50min query results on full week's table by
> prefering a seq scan to an index scan.  The row estimates seemed fine, and the
> only condition is the timestamp, so the planner's use of index scan is as
> expected.

Can you show us the query?  I would expect a bitmap scan of the index
(which would do what you want, but even more so), instead.

>
> AFAICT what's happening is that the index scan was returning pages
> nonsequentially.  strace-ing the backend showed alternating lseek()s and
> read()s, with the offsets not consistently increasing (nor consistently
> decreasing):
> % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773'
> lseek(773, 1059766272, SEEK_SET)        = 1059766272
> lseek(773, 824926208, SEEK_SET)         = 824926208
> lseek(773, 990027776, SEEK_SET)         = 990027776
> lseek(773, 990330880, SEEK_SET)         = 990330880
> lseek(773, 1038942208, SEEK_SET)        = 1038942208
> lseek(773, 1059856384, SEEK_SET)        = 1059856384
> lseek(773, 977305600, SEEK_SET)         = 977305600
> lseek(773, 990347264, SEEK_SET)         = 990347264
> lseek(773, 871096320, SEEK_SET)         = 871096320
>
> .. and consecutive read()s being rare:
> read(802, "g"..., 8192)                 = 8192
> lseek(802, 918003712, SEEK_SET)         = 918003712
> read(802, "c"..., 8192)                 = 8192
> lseek(802, 859136000, SEEK_SET)         = 859136000
> read(802, "a"..., 8192)                 = 8192
> lseek(802, 919601152, SEEK_SET)         = 919601152
> read(802, "d"..., 8192)                 = 8192
> lseek(802, 905101312, SEEK_SET)         = 905101312
> read(802, "c"..., 8192)                 = 8192
> lseek(801, 507863040, SEEK_SET)         = 507863040
> read(801, "p"..., 8192)                 = 8192
> lseek(802, 914235392, SEEK_SET)         = 914235392
> read(802, "c"..., 8192)                 = 8192


Which of those are the table, and which the index?

Something doesn't add up here.  How could an index of an append-only
table possibly become that fragmented, when the highest amount of key
duplication is about 170?

Cheers,

Jeff


Re: index fragmentation on insert-only table with non-unique column

From
Justin Pryzby
Date:
On Tue, May 24, 2016 at 09:16:20PM -0700, Peter Geoghegan wrote:
> On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> > Postgres seems to assume that the high degree of correlation of the table
> > column seen in pg_stats is how it will get data from the index scan, which
> > assumption seems to be very poor on what turns out to be a higly fragmented
> > index.  Is there a way to help it to understand otherwise??
>
> Your complaint is vague. Are you complaining about the planner making
> a poor choice? I don't think that's the issue here, because you never
> made any firm statement about the planner making a choice that was
> worth than an alternative that it had available.

I was thinking there a few possible places to make improvements: the planner
could have understood that scans of non-unique indices don't result in strictly
sequential scans of the table, the degree of non-sequentialness being
determined by the column statistics, and perhaps by properties of the index
itself.

Or the INSERT code or btree scan could improve on this, even if tuples aren't
fully ordered.

> If you're arguing for the idea that B-Trees should reliably keep
> tuples in order by a tie-break condition, that seems difficult to
> implement, and likely not worth it in practice.

I had the awful idea to change the index to use (recordopeningtime,ctid).
Maybe somebody will convince me otherwise, but may actually work better than
trying to reindex this table daily by 4am.

Thanks,
Justin


Re: index fragmentation on insert-only table with non-unique column

From
Justin Pryzby
Date:
On Tue, May 24, 2016 at 11:23:48PM -0700, Jeff Janes wrote:
> > But note the non-uniqueness of the index column:
> > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 DESC; 
> >   recordopeningtime  | count
> > ---------------------+-------
> >  2016-05-21 12:17:29 |   176
> >  2016-05-21 12:17:25 |   171
> >  2016-05-21 13:11:33 |   170
> >  2016-05-21 10:20:02 |   169
> >  2016-05-21 11:30:02 |   167
> > [...]
>
> That is not that much duplication.  You aren't going to have dozens or
> hundreds of leaf pages all with equal values.   (and you only showed
> the most highly duplicated ones, presumably the average is much less)

Point taken, but it's not that great of a range either:

ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 LIMIT 19; 
  recordopeningtime  | count
---------------------+-------
 2016-05-21 03:10:05 |    44
 2016-05-21 03:55:05 |    44
 2016-05-21 04:55:05 |    45

ts=# SELECT count(distinct recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22';
-[ RECORD 1 ]
count | 86400

ts=# SELECT count(recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22';
-[ RECORD 1 ]--
count | 8892865

> > We have an daily analytic query which processes the previous day's data.  For
> > new child tables, with only 1 days data loaded, this runs in ~30min, and for
> > child tables with an entire week's worth of data loaded, takes several hours
> > (even though both queries process the same amount of data).
>
> For an append only table, why would the first day of a new partition
> be any less fragmented than that same day would be a week from now?
> Are you sure it isn't just that your week-old data has all been aged
> out of the cache?
I don't think it's cache effect, since we're not using the source table for
(maybe anything) else the entire rest of the day.  Server has 72GB RAM, same
size one the largest of the tables being joined (beginning) at 4am.

I didn't mean that a given day is more fragmented now than it was last week
(but I don't know, either).  I guess when we do a query on the table with ~32
hours of data in, it might do a seq scan rather than index scan, too.

Compare the end of month partition tables:
ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_29_recordopeningtime_idx');
leaf_fragmentation | 48.38
ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_22_recordopeningtime_idx');
leaf_fragmentation | 48.66
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_22_recordopeningtime_idx');
leaf_fragmentation | 48.27
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_22_recordopeningtime_idx');
leaf_fragmentation | 48

This one I reindexed as a test:
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx');
leaf_fragmentation | 0.01

.. and query ran in ~30min (reran a 2nd time, with cache effects: 25min).

> > First, I found I was able to get 30-50min query results on full week's table by
> > prefering a seq scan to an index scan.  The row estimates seemed fine, and the
> > only condition is the timestamp, so the planner's use of index scan is as
> > expected.
>
> Can you show us the query?  I would expect a bitmap scan of the index
> (which would do what you want, but even more so), instead.
See explain, also showing additional tables/views being joined.  It's NOT doing
a bitmap scan though, and I'd be interested to find why; I'm sure that would've
improved this query enough so it never would've been an issue.
https://explain.depesz.com/s/s8KP

 ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01
(cost=0.56..1601734.57rows=8943848 width=349) 
       Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND (recordopeningtime <
'2016-05-0800:00:00'::timestamp without time zone)) 

> > AFAICT what's happening is that the index scan was returning pages
> > nonsequentially.  strace-ing the backend showed alternating lseek()s and
> > read()s, with the offsets not consistently increasing (nor consistently
> > decreasing):
..
>
> Which of those are the table, and which the index?
Those weren't necessarily strace of the same process; I believe both of these
were table data/heap, and didn't include any index access.

> Something doesn't add up here.  How could an index of an append-only
> table possibly become that fragmented, when the highest amount of key
> duplication is about 170?

I'm certainly opened to alternate interpretations / conclusions :)

Justin


Re: index fragmentation on insert-only table with non-unique column

From
Claudio Freire
Date:
On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
>> > First, I found I was able to get 30-50min query results on full week's table by
>> > prefering a seq scan to an index scan.  The row estimates seemed fine, and the
>> > only condition is the timestamp, so the planner's use of index scan is as
>> > expected.
>>
>> Can you show us the query?  I would expect a bitmap scan of the index
>> (which would do what you want, but even more so), instead.
> See explain, also showing additional tables/views being joined.  It's NOT doing
> a bitmap scan though, and I'd be interested to find why; I'm sure that would've
> improved this query enough so it never would've been an issue.
> https://explain.depesz.com/s/s8KP
>
>  ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01
(cost=0.56..1601734.57rows=8943848 width=349) 
>        Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND (recordopeningtime
<'2016-05-08 00:00:00'::timestamp without time zone)) 

Please show your guc settings ( see
https://wiki.postgresql.org/wiki/Server_Configuration )

A plan node like that, if it would result in I/O, with proper
configuration should have selected a bitmap index/heap scan. If it
didn't, it probably thinks it has more cache than it really does, and
that would mean the wrong setting was set in effective_cache_size.


Re: index fragmentation on insert-only table with non-unique column

From
Justin Pryzby
Date:
On Fri, Jun 03, 2016 at 06:26:33PM -0300, Claudio Freire wrote:
> On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> >> > First, I found I was able to get 30-50min query results on full week's table by
> >> > prefering a seq scan to an index scan.  The row estimates seemed fine, and the
> >> > only condition is the timestamp, so the planner's use of index scan is as
> >> > expected.
> >>
> >> Can you show us the query?  I would expect a bitmap scan of the index
> >> (which would do what you want, but even more so), instead.
> > See explain, also showing additional tables/views being joined.  It's NOT doing
> > a bitmap scan though, and I'd be interested to find why; I'm sure that would've
> > improved this query enough so it never would've been an issue.
> > https://explain.depesz.com/s/s8KP
> >
> >  ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01
(cost=0.56..1601734.57rows=8943848 width=349) 
> >        Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-08 00:00:00'::timestamp without time zone)) 
>
> Please show your guc settings ( see
> https://wiki.postgresql.org/wiki/Server_Configuration )
>
> A plan node like that, if it would result in I/O, with proper
> configuration should have selected a bitmap index/heap scan. If it
> didn't, it probably thinks it has more cache than it really does, and
> that would mean the wrong setting was set in effective_cache_size.

ts=# SELECT name, current_setting(name), SOURCE FROM pg_settings WHERE SOURCE='configuration file';
 dynamic_shared_memory_type  | posix               | configuration file
 effective_cache_size        | 64GB                | configuration file
 effective_io_concurrency    | 8                   | configuration file
 huge_pages                  | try                 | configuration file
 log_autovacuum_min_duration | 0                   | configuration file
 log_checkpoints             | on                  | configuration file
 maintenance_work_mem        | 6GB                 | configuration file
 max_connections             | 200                 | configuration file
 max_wal_size                | 4GB                 | configuration file
 min_wal_size                | 6GB                 | configuration file
 shared_buffers              | 8GB                 | configuration file
 wal_compression             | on                  | configuration file
 work_mem                    | 1GB                 | configuration file

I changed at least maintenance_work_mem since I originally wrote, to try to
avoid tempfiles during REINDEX (though I'm not sure it matters, as the
tempfiles are effective cached and may never actually be written).

It's entirely possible those settings aren't ideal.  The server has 72GB RAM.
There are usually very few (typically n<3 but at most a handful) nontrivial
queries running at once, if at all.

I wouldn't expect any data that's not recent (table data last 2 days or index
from this month) to be cached, and wouldn't expect that to be entirely cached,
either:

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_2016_05_..$';
gb | 425.783050537109

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_2016_05_...*idx';
gb | 60.0909423828125

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_201605.*idx';
gb | 4.85528564453125

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_201605$';
gb | 86.8688049316406

As a test, I did SET effective_cache_size='1MB', before running explain, and
still does:

|     ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_29
(cost=0.44..1526689.49rows=8342796 width=355) 
|           Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) 

I Set enable_indexscan=0, and got:

|     ->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29  (cost=168006.10..4087526.04 rows=8342796 width=355)
|           Recheck Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) 
|           ->  Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx  (cost=0.00..165920.40
rows=8342796width=0) 
|             Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) 

Here's a minimal query which seems to isolate the symptom:

ts=# explain (analyze,buffers) SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_22 WHERE
recordopeningtime>='2016-05-22'AND recordopeningtime<'2016-05-23'; 
| Aggregate  (cost=2888731.67..2888731.68 rows=1 width=8) (actual time=388661.892..388661.892 rows=1 loops=1)
|   Buffers: shared hit=4058501 read=1295147 written=35800
|   ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_22_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_22
(cost=0.56..2867075.33rows=8662534 w 
|idth=8) (actual time=0.036..379332.910 rows=8575673 loops=1)
|         Index Cond: ((recordopeningtime >= '2016-05-22 00:00:00'::timestamp without time zone) AND (recordopeningtime
<'2016-05-23 00:00:00'::timestamp 
| without time zone))
|         Buffers: shared hit=4058501 read=1295147 written=35800
| Planning time: 0.338 ms
| Execution time: 388661.947 ms

And here's an older one to avoid cache, with enable_indexscan=0
|ts=# explain (analyze,buffers)  SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_08 WHERE
recordopeningtime>='2016-05-08'AND recordopeningtime<'2016-05-09'; 
| Aggregate  (cost=10006286.58..10006286.59 rows=1 width=8) (actual time=44219.156..44219.156 rows=1 loops=1)
|   Buffers: shared hit=118 read=1213887 written=50113
|   ->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_08  (cost=85142.24..9985848.96 rows=8175048 width=8) (actual
time=708.024..40106.062rows=8179338 loops=1) 
|         Recheck Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) 
|         Rows Removed by Index Recheck: 74909
|         Heap Blocks: lossy=1213568
|         Buffers: shared hit=118 read=1213887 written=50113
|         ->  Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_08_recordopeningtime_idx1  (cost=0.00..83098.48
rows=8175048width=0) (actual time=706.557..706.557 rows=12135680 loops=1) 
|               Index Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) 
|               Buffers: shared hit=117 read=320
| Planning time: 214.786 ms
| Execution time: 44228.874 ms
|(12 rows)

Thanks for your help.

Justin


Re: index fragmentation on insert-only table with non-unique column

From
Claudio Freire
Date:
On Fri, Jun 3, 2016 at 8:54 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> As a test, I did SET effective_cache_size='1MB', before running explain, and
> still does:
>
> |        ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on
cdrs_huawei_pgwrecord_2016_05_29 (cost=0.44..1526689.49 rows=8342796 width=355) 
> |              Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) 
>
> I Set enable_indexscan=0, and got:
>
> |        ->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29  (cost=168006.10..4087526.04 rows=8342796
width=355)
> |              Recheck Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) 
> |              ->  Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx  (cost=0.00..165920.40
rows=8342796width=0) 
> |                    Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-30 00:00:00'::timestamp without time zone)) 
>
> Here's a minimal query which seems to isolate the symptom:
>
> ts=# explain (analyze,buffers) SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_22 WHERE
recordopeningtime>='2016-05-22'AND recordopeningtime<'2016-05-23'; 
> | Aggregate  (cost=2888731.67..2888731.68 rows=1 width=8) (actual time=388661.892..388661.892 rows=1 loops=1)
> |   Buffers: shared hit=4058501 read=1295147 written=35800
> |   ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_22_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_22
(cost=0.56..2867075.33rows=8662534 w 
> |idth=8) (actual time=0.036..379332.910 rows=8575673 loops=1)
> |         Index Cond: ((recordopeningtime >= '2016-05-22 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-23 00:00:00'::timestamp 
> | without time zone))
> |         Buffers: shared hit=4058501 read=1295147 written=35800
> | Planning time: 0.338 ms
> | Execution time: 388661.947 ms
>
> And here's an older one to avoid cache, with enable_indexscan=0
> |ts=# explain (analyze,buffers)  SELECT sum(duration) FROM cdrs_huawei_pgwrecord_2016_05_08 WHERE
recordopeningtime>='2016-05-08'AND recordopeningtime<'2016-05-09'; 
> | Aggregate  (cost=10006286.58..10006286.59 rows=1 width=8) (actual time=44219.156..44219.156 rows=1 loops=1)
> |   Buffers: shared hit=118 read=1213887 written=50113
> |   ->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_08  (cost=85142.24..9985848.96 rows=8175048 width=8)
(actualtime=708.024..40106.062 rows=8179338 loops=1) 
> |         Recheck Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) 
> |         Rows Removed by Index Recheck: 74909
> |         Heap Blocks: lossy=1213568
> |         Buffers: shared hit=118 read=1213887 written=50113
> |         ->  Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_08_recordopeningtime_idx1  (cost=0.00..83098.48
rows=8175048width=0) (actual time=706.557..706.557 rows=12135680 loops=1) 
> |               Index Cond: ((recordopeningtime >= '2016-05-08 00:00:00'::timestamp without time zone) AND
(recordopeningtime< '2016-05-09 00:00:00'::timestamp without time zone)) 
> |               Buffers: shared hit=117 read=320
> | Planning time: 214.786 ms
> | Execution time: 44228.874 ms
> |(12 rows)


Correct me if I'm wrong, but this looks like the planner not
accounting for correlation when using bitmap heap scans.

Checking the source, it really doesn't.

So correlated index scans look extra favourable vs bitmap index scans
because bitmap heap scans consider random page costs sans correlation
effects (even though correlation applies to bitmap heap scans as
well). While that sounds desirable a priori, it seems it's hurting
this case quite badly.

I'm not sure there's any simple way of working around that.


Re: index fragmentation on insert-only table with non-unique column

From
Kevin Grittner
Date:
On Fri, Jun 3, 2016 at 6:54 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

>  max_wal_size                | 4GB                 | configuration file
>  min_wal_size                | 6GB                 | configuration file

Just a minor digression -- it generally doesn't make sense to
configure the minimum for something greater than the maximum for
that same thing.  That should have no bearing the performance issue
raised on the thread, but you might want to fix it anyway.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: index fragmentation on insert-only table with non-unique column

From
Tom Lane
Date:
Claudio Freire <klaussfreire@gmail.com> writes:
> So correlated index scans look extra favourable vs bitmap index scans
> because bitmap heap scans consider random page costs sans correlation
> effects (even though correlation applies to bitmap heap scans as
> well).

Really?  How?  The index ordering has nothing to do with the order in
which heap tuples will be visited.

            regards, tom lane


Re: index fragmentation on insert-only table with non-unique column

From
Jeff Janes
Date:
On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfreire@gmail.com> writes:
>> So correlated index scans look extra favourable vs bitmap index scans
>> because bitmap heap scans consider random page costs sans correlation
>> effects (even though correlation applies to bitmap heap scans as
>> well).
>
> Really?  How?  The index ordering has nothing to do with the order in
> which heap tuples will be visited.


It is not the order itself, but the density.

If the index is read in a range scan (as opposed to =ANY scan), and
the index lead column is correlated with the table ordering, then the
parts of the table that need to be visited will be much denser than if
there were no correlation.  But Claudio is saying that this is not
being accounted for.


Cheers,

Jeff


Re: index fragmentation on insert-only table with non-unique column

From
Justin Pryzby
Date:
Regarding this earlier thread:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com

On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> Summary: Non-unique btree indices are returning CTIDs for rows with same
> value of indexed column not in logical order, imposing a high performance
> penalty.

I have to point out that by "logical" I clearly meant "physical", hopefully
nobody was too misled..

On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote:
> On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Claudio Freire <klaussfreire@gmail.com> writes:
> >> So correlated index scans look extra favourable vs bitmap index scans
> >> because bitmap heap scans consider random page costs sans correlation
> >> effects (even though correlation applies to bitmap heap scans as
> >> well).
> >
> > Really?  How?  The index ordering has nothing to do with the order in
> > which heap tuples will be visited.
>
> It is not the order itself, but the density.
>
> If the index is read in a range scan (as opposed to =ANY scan), and
> the index lead column is correlated with the table ordering, then the
> parts of the table that need to be visited will be much denser than if
> there were no correlation.  But Claudio is saying that this is not
> being accounted for.

I didn't completely understand Claudio/Jeff here, and not sure if we're on the
same page.  For queries on these tables, the index scan was very slow, due to
fragmented index on non-unique column, and seq scan would have been (was)
faster (even if it means reading 70GB and filtering out 6 of 7 days' data).
That was resolved by added a nightly reindex job (..  which sometimes competes
with other maintenance and has trouble running every table every night).

But I did find that someone else had previously reported this problem (in a
strikingly similar context and message, perhaps clearer than mine):
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com

I also found this older thread:

https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou1kn@4ax.com

There was mention of a TODO item:

    * Compute index correlation on CREATE INDEX and ANALYZE, use it for index
    * scan cost estimation

.. but perhaps I misunderstand and that's long since resolved ?

Justin


Re: index fragmentation on insert-only table with non-unique column

From
Claudio Freire
Date:
On Sat, Aug 13, 2016 at 3:54 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote:
>> On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Claudio Freire <klaussfreire@gmail.com> writes:
>> >> So correlated index scans look extra favourable vs bitmap index scans
>> >> because bitmap heap scans consider random page costs sans correlation
>> >> effects (even though correlation applies to bitmap heap scans as
>> >> well).
>> >
>> > Really?  How?  The index ordering has nothing to do with the order in
>> > which heap tuples will be visited.
>>
>> It is not the order itself, but the density.
>>
>> If the index is read in a range scan (as opposed to =ANY scan), and
>> the index lead column is correlated with the table ordering, then the
>> parts of the table that need to be visited will be much denser than if
>> there were no correlation.  But Claudio is saying that this is not
>> being accounted for.
>
> I didn't completely understand Claudio/Jeff here, and not sure if we're on the
> same page.  For queries on these tables, the index scan was very slow, due to
> fragmented index on non-unique column, and seq scan would have been (was)
> faster (even if it means reading 70GB and filtering out 6 of 7 days' data).
> That was resolved by added a nightly reindex job (..  which sometimes competes
> with other maintenance and has trouble running every table every night).

Yes, but a bitmap index scan should be faster than both, but the
planner is discarding it beause it estimates it will be slower,
because it doesn't account for correlation between index keys and
physical location.

And, while what you clarify there would indeed affect the estimation
for index scans, it would only make the issue worse: the planner
thinks the index scan will be better than it really is, because it's
expecting correlation, but the "fragmentation" of same-key runs
destroys that correlation. A bitmap index scan would restore it,
though, so the bitmap index scan would be that much better.


Months ago I reported an issue with very slow index scan of tables with high
"correlation" of its indexed column, due to (we concluded at the time)
duplicate/repeated values of that column causing many lseek()s.  References:

https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com

https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou1kn@4ax.com

My interim workarounds were an reindex job and daily granularity partitions
(despite having an excessive number of child tables) to query execution time]]
to encourage seq scans for daily analysis jobs rather than idx scan.  I've now
cobbled together a patch to throw around and see if we can improve on that.  I
tried several changes, hoping to discourage index scan.  The logic that seems
to most accurately reflect costs has two changes:

Postgres' existing behavior stores table correlation (heap value vs. position)
but doesn't look at index correlation, so can't distinguish between a just-built
index, and a highly fragmented index, or one with highly-nonsequential TIDs.
My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine
correlation of heap TID vs index tuple logical location (as opposed to the
table correlation, computed as: heap TID vs. heap value).

The second change averages separate correlation values of small lists of 300
consecutive TIDs, rather than the course-granularity/aggregate correlation of a
small sample of pages across the entire index.  Postgres' existing sampling is
designed to give an even sample across all rows.  An issue with this
course-granularity correlation is that the index can have a broad correlation
(to physical heap location), but with many small-scale deviations, which don't
show up due to sampling a relatively small fraction of a large table; and/or
the effect of the deviations is insignificant/noise and correlation is still
computed near 1.

I believe the "large scale" correlation that postgres computes from block
sample fails to well represent small-scale uncorrelated reads which contribute
large number of random reads, but not included in planner cost.

Not only are the index reads highly random (which the planner already assumes),
but the CTIDs referenced within a given btree leaf page are also substantially
non-sequential.  It seems to affect INSERTs which, over a short interval of
time have a small-moderate range of column values, but which still have a
strong overall trend in that column WRT time (and heap location).  Think of
inserting a "normal" distribution of timestamps centered around now().

My original report shows lseek() for nearly every read on the *heap*.  The
original problem was on a table of telecommunications CDRs, indexed by "record
opening time" (start time) and with high correlation value in pg_stats.  We
import data records from a file, which is probably more or less in order of
"end time".

That still displays broad correlation on "start time", but individual TIDs are
nowhere near sequential.  Two phone calls which end in the same 1 second
interval are not unlikely to have started many minutes apart...  but two calls
which end within the same second are very likely to have started within an hour
of each other.. since typical duration is <1h.  But, insert volume is high, and
there are substantial repeated keys, so the portion of an index scan returning
CTIDs for some certain key value (timestamp with second resolution in our case)
ends up reading heap tuples for a non-negligible fraction of the table: maybe
only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ...
which is still 300 seeks, and takes long enough that cache is inadequate to
substantially mitigate the cost.

It's not clear to me that there's a good way to evenly *sample* a fraction of
the index blocks in a manner which is agnostic to different AMs.  Scanning the
entirety of all indices on relations during (auto) analyze may be too
expensive.  So I implemented index scan of the MCV list.  I'm guessing this
might cause the correlation to be under-estimated, and prefer bitmap scans
somewhat more than justified, due to btree insertion logic for repeated keys,
to reduce O(n^2) behavior.  MCV list isn't perfect since that can happen with
eg. normally distributed floating point values (or timestamp with fractional
seconds).

I ran pageinspect on a recent child of the table that triggered the original
report:

ts=# SELECT itemoffset, ctid FROM bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 22
OFFSET1; 
 itemoffset |  ctid
------------+--------
          2 | (81,4)
          3 | (5,6)
          4 | (6,5)
          5 | (9,1)
          6 | (10,1)
          7 | (14,1)
          8 | (21,8)
          9 | (25,1)
         10 | (30,5)
         11 | (41,1)
         12 | (48,7)
         13 | (49,3)
         14 | (50,2)
         15 | (51,1)
         16 | (54,1)
         17 | (57,7)
         18 | (58,6)
         19 | (59,6)
         20 | (63,5)
         21 | (71,6)
         22 | (72,5)
         23 | (78,7)
(22 rows)

=> 22 adjacent tuples referencing 22 different heap pages, only 6 of which are sequential => 16 lseek() aka random page
cost.

To generate data demonstrating this problem you can do things like:
| CREATE TABLE t(i int, j int);
| CREATE INDEX ON t(i);
\! time for a in `seq 99`; do psql -qc 'INSERT INTO t SELECT * FROM generate_series(1,99)'; done -- or,
or:
| INSERT INTO t SELECT (0.001*a+9*(random()-0.5))::int FROM generate_series(1,99999) a; -- or,

or this one, perhaps closest to our problem case:
| INSERT INTO t SELECT a FROM generate_series(1,999) a, generate_series(1,999) b ORDER BY a+b/9.9;

Note, I was able to create a case using floats without repeated keys:
| INSERT INTO w SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,999999) i ORDER BY i

| ANALYZE t;
-- note: the correlation is even higher for larger tables with same number of
-- repeated keys, which is bad since the cost should go up linearly with the
-- size and associated number of lseek().  That's one component of the problem
-- and I think a necessarily component of any solution.
postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE
't%';
 tablename | attname | correlation | array_length
-----------+---------+-------------+--------------
 t         | i       |    0.995212 |           87
 t         | j       |             |
 t_i_idx   | i       |   -0.892874 |           87
 ^^^^^^^
... this last line is added by my patch.

That's a bad combination, high table correlation means the planner thinks only
a fraction of the heap will be read, and sequentially, so isn't discouraged
from doing index scan.  But index TIDs are much less well correlated (0.89 vs
0.99).

Note: the negative correlation at tuple-level seems to be related to this comment:
src/backend/access/nbtree/nbtinsert.c- *                Once we have chosen the page to put the key on, we'll insert it
before
src/backend/access/nbtree/nbtinsert.c: *                any existing equal keys because of the way _bt_binsrch() works.

Note also:
|postgres=# SELECT leaf_fragmentation FROM pgstatindex('t_i_idx');
|leaf_fragmentation | 67.63
.. but keep in mind that leaf_fragmentation only checks leaf page order, and
not CTID order within index pages.  The planner already assumes index pages are
random reads.  Maybe it's a good indicator (?), but doesn't lend itself to an
accurate cost estimation.

For testing purposes, with:
| shared_buffers | 128kB
|  public | t    | table | pryzbyj | 35 MB |
| SET enable_bitmapscan=off;
| SET enable_seqscan=off;
| SELECT pg_backend_pid();
| SELECT sum(j) FROM t WHERE i<99;
| Time: 3974.205 ms

% sudo strace -p 530 2>/tmp/strace-pg
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg |sort |uniq -c |sort -nr |head
  39474 lseek(41,
    359 lseek(44,
      8 lseek(18,
=> 40k seeks on an 35MB table, that's 10 seeks per heap page!

open("base/12411/16634", O_RDWR)        = 41
open("base/12411/16636", O_RDWR)        = 44

postgres=# SELECT relfilenode, relname FROM pg_class WHERE relfilenode>16630;
       16634 | t
       16636 | t_i_idx

2017-07-07 17:45:54.075 CDT [6360] WARNING:  HERE 1222: csquared=0.797225 minIO/R-P-C=109.750000
maxIO/R-P-C=4416.000000costsize.c cost_index 702 

With my patch, index scan estimated to take ~4400 seeks, rather than actual
40k, probably because max_IO_cost assumes that a given heap page will be
visited only once.  But in the worst case each of its tuples would require a
separate lseek()....  I'm not suggesting to change that, since making max_IO
100x higher will probably change query plans more dramatically than desired..
But also note that, unpatched, with table correlation >0.99, postgres would've
under-estimated min_IO_cost not by a factor of 10x but by 400x.

| postgres=# REINDEX INDEX t_i_idx;
| postgres=# ANALYZE t;
| postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename
LIKE't%'; 
|  tablename | attname | correlation | array_length
| -----------+---------+-------------+--------------
|  t         | i       |    0.995235 |           67
|  t_i_idx   | i       |           1 |           67

=> Correctly distinguishes freshly reindexed table.

% sudo strace -p 6428 2>/tmp/strace-pg8
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg8 |sort |uniq -c |sort -nr
     99 lseek(37,

2017-07-07 17:49:47.745 CDT [6428] WARNING:  HERE 1222: csquared=1.000000 minIO/R-P-C=108.500000
maxIO/R-P-C=4416.000000costsize.c cost_index 702 
=> csquared=1, so min_IO cost is used instead of something close to max_IO
(this patch doesn't change the computation of those values at all, just changes
correlation, which squared value is used to interpolate between correlated/min
cost and uncorrelated/max cost).

correlated estimate: 108 seeks vs 99 actual, is essentially what unpatched
postgres would've computed by using the table correlation of 0.9952, implicitly
assuming the index to be similarly correlated.

I hope that demonstrates the need to distinguish index correlation from table
correlation.  I'm hoping for comments on the existing patch, specifically if
there's a better way to sample the index than "MCVs or partial index scan".
I've left some fragments in place of earlier implementation involving btree
page level fragmentation (like statindex).  Also: does it make sense to keeping
MCV/histogram for non-expression index which duplicates table column ?  The
stats lookup in selfuncs.c btcostestimate() would have to check for correlation
from index, and rest of stats from its table.

A bitmap layer adds overhead, which should be avoided if not needed.  But it
shouldn't be a huge impact, and I think its relative effect is only high when
returning a small number of rows.  I'm thinking of a few cases.

 - unique / low cardinality index scans or without duplicate keys: should have
   low index_pages_fetched(), so max_IO_cost should not be very different from
   min_io_cost, and new index-based correlation shouldn't have much effect
   different from table correlation.
 - unclustered/uncorrelated tables: tables whose heap have low correlation
   already discouraged from index scan; this includes tables whose column is
   UPDATEd and not just INSERTed;
 - table with correlated heap AND index: csquared should still be ~0.99 and not
   change much;
 - correlated table with uncorrelated index: this is the target case with
   intended behavior change

My apologies: I think that's a bit of a brain dump.

Justin

Attachment
On Fri, Jul 7, 2017 at 4:41 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> The second change averages separate correlation values of small lists of 300
> consecutive TIDs, rather than the course-granularity/aggregate correlation of a
> small sample of pages across the entire index.  Postgres' existing sampling is
> designed to give an even sample across all rows.  An issue with this
> course-granularity correlation is that the index can have a broad correlation
> (to physical heap location), but with many small-scale deviations, which don't
> show up due to sampling a relatively small fraction of a large table; and/or
> the effect of the deviations is insignificant/noise and correlation is still
> computed near 1.
>
> I believe the "large scale" correlation that postgres computes from block
> sample fails to well represent small-scale uncorrelated reads which contribute
> large number of random reads, but not included in planner cost.

All of that may well be true, but I've actually come around to the
view that we should treat TID as a first class part of the keyspace,
that participates in comparisons as an implicit last column in all
cases (not just when B-Trees are built within tuplesort.c). That would
probably obviate the need for a patch like this entirely, because
pg_stats.correlation would be unaffected by duplicates. I think that
this would have a number of advantages, some of which might be
significant. For example, I think it could make LP_DEAD cleanup within
nbtree more effective for some workloads, especially workloads where
it is important for HOT pruning and LP_DEAD cleanup to work in concert
-- locality of access probably matters with that. Also, with every
entry in the index guaranteed to be unique, we can imagine VACUUM
being much more selective with killing index entries, when the TID
array it uses happens to be very small. With the freeze map stuff
that's now in place, being able to do that matters more than before.

The original Lehman & Yao algorithm is supposed to have unique keys in
all cases, but we don't follow that in order to make duplicates work,
which is implemented by changing an invariant (see nbtree README for
details). So, this could simplify some aspects of how binary searches
must work in the face of having to deal with non-unique keys. I think
that there are cases where many non-HOT UPDATEs have to go through a
bunch of duplicate leaf tuples and do visibility checking on old
versions for no real benefit. With the TID as part of the keyspace, we
could instead tell the B-Tree code to insert a new tuple as part of an
UPDATE while using the TID as part of its insertion scan key, so
rummaging through many duplicates is avoided.

That having been said, it would be hard to do this for all the reasons
I went into in that thread you referenced [1]. If you're going to
treat TID as a part of the keyspace, you have to totally embrace that,
which means that the internal pages need to have heap TIDs too (not
just pointers to the lower levels in the B-Tree, which the
IndexTuple's t_tid pointer is used for there). Those are the place
where you need to literally append this new, implicit heap TID column
as if it was just another user-visible attribute, since that
information isn't stored in the internal pages today at all. Doing
that has a cost, which isn't going to be acceptable if we naively
append a heap TID to every internal page IndexTuple. With a type like
int4, you're going to completely bloat those pages, with big
consequences for fan-in.

So, really, what needs to happen to make it work is someone needs to
write a suffix truncation patch, which implies that only those cases
that actually benefit from increasing the width of internal page
IndexTuples (those with many "would-be duplicates") pay the cost. This
is a classic technique, that I've actually already prototyped, though
that's extremely far from being worth posting here. That was just to
verify my understanding.

I think that I should write and put up for discussion a design
document for various nbtree enhancements. These include internal page
suffix truncation, prefix compression, and key normalization. I'm
probably not going to take any of these projects on, but it would be
useful if there was at least a little bit of clarity about how they
could be implemented. Maybe we could reach some kind of weak consensus
without going to great lengths. These optimizations are closely
intertwined things, and the lack of clarity on how they all fit
together is probably holding back an implementation of any one of
them.

[1]
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com
--
Peter Geoghegan