Re: index fragmentation on insert-only table with non-unique column - Mailing list pgsql-performance

From Justin Pryzby
Subject Re: index fragmentation on insert-only table with non-unique column
Date
Msg-id 20160525140034.GB21220@telsasoft.com
Whole thread Raw
In response to Re: index fragmentation on insert-only table with non-unique column  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: index fragmentation on insert-only table with non-unique column
List pgsql-performance
On Tue, May 24, 2016 at 11:23:48PM -0700, Jeff Janes wrote:
> > But note the non-uniqueness of the index column:
> > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 DESC; 
> >   recordopeningtime  | count
> > ---------------------+-------
> >  2016-05-21 12:17:29 |   176
> >  2016-05-21 12:17:25 |   171
> >  2016-05-21 13:11:33 |   170
> >  2016-05-21 10:20:02 |   169
> >  2016-05-21 11:30:02 |   167
> > [...]
>
> That is not that much duplication.  You aren't going to have dozens or
> hundreds of leaf pages all with equal values.   (and you only showed
> the most highly duplicated ones, presumably the average is much less)

Point taken, but it's not that great of a range either:

ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22'GROUP BY 1 ORDER BY 2 LIMIT 19; 
  recordopeningtime  | count
---------------------+-------
 2016-05-21 03:10:05 |    44
 2016-05-21 03:55:05 |    44
 2016-05-21 04:55:05 |    45

ts=# SELECT count(distinct recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22';
-[ RECORD 1 ]
count | 86400

ts=# SELECT count(recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND
recordopeningtime<'2016-05-22';
-[ RECORD 1 ]--
count | 8892865

> > We have an daily analytic query which processes the previous day's data.  For
> > new child tables, with only 1 days data loaded, this runs in ~30min, and for
> > child tables with an entire week's worth of data loaded, takes several hours
> > (even though both queries process the same amount of data).
>
> For an append only table, why would the first day of a new partition
> be any less fragmented than that same day would be a week from now?
> Are you sure it isn't just that your week-old data has all been aged
> out of the cache?
I don't think it's cache effect, since we're not using the source table for
(maybe anything) else the entire rest of the day.  Server has 72GB RAM, same
size one the largest of the tables being joined (beginning) at 4am.

I didn't mean that a given day is more fragmented now than it was last week
(but I don't know, either).  I guess when we do a query on the table with ~32
hours of data in, it might do a seq scan rather than index scan, too.

Compare the end of month partition tables:
ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_29_recordopeningtime_idx');
leaf_fragmentation | 48.38
ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_22_recordopeningtime_idx');
leaf_fragmentation | 48.66
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_22_recordopeningtime_idx');
leaf_fragmentation | 48.27
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_22_recordopeningtime_idx');
leaf_fragmentation | 48

This one I reindexed as a test:
ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx');
leaf_fragmentation | 0.01

.. and query ran in ~30min (reran a 2nd time, with cache effects: 25min).

> > First, I found I was able to get 30-50min query results on full week's table by
> > prefering a seq scan to an index scan.  The row estimates seemed fine, and the
> > only condition is the timestamp, so the planner's use of index scan is as
> > expected.
>
> Can you show us the query?  I would expect a bitmap scan of the index
> (which would do what you want, but even more so), instead.
See explain, also showing additional tables/views being joined.  It's NOT doing
a bitmap scan though, and I'd be interested to find why; I'm sure that would've
improved this query enough so it never would've been an issue.
https://explain.depesz.com/s/s8KP

 ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01
(cost=0.56..1601734.57rows=8943848 width=349) 
       Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND (recordopeningtime <
'2016-05-0800:00:00'::timestamp without time zone)) 

> > AFAICT what's happening is that the index scan was returning pages
> > nonsequentially.  strace-ing the backend showed alternating lseek()s and
> > read()s, with the offsets not consistently increasing (nor consistently
> > decreasing):
..
>
> Which of those are the table, and which the index?
Those weren't necessarily strace of the same process; I believe both of these
were table data/heap, and didn't include any index access.

> Something doesn't add up here.  How could an index of an append-only
> table possibly become that fragmented, when the highest amount of key
> duplication is about 170?

I'm certainly opened to alternate interpretations / conclusions :)

Justin


pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: index fragmentation on insert-only table with non-unique column
Next
From: Vladimir Borodin
Date:
Subject: 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6