Thread: Re: Adding additional index causes 20,000x slowdown for certain select queries - postgres 9.0.3

Forgot to include our non-default config settings and server info, not that it probably makes a difference for this.

from pg_settings:
 name                         | current_setting
 version                      | PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
 bytea_output                 | escape
 checkpoint_completion_target | 0.9
 checkpoint_segments          | 24
 effective_cache_size         | 24GB
 effective_io_concurrency     | 2
 lc_collate                   | en_US.utf8
 lc_ctype                     | en_US.utf8
 listen_addresses             | *
 log_checkpoints              | on
 log_connections              | on
 log_disconnections           | on
 log_hostname                 | on
 log_line_prefix              | %t
 logging_collector            | on
 maintenance_work_mem         | 256MB
 max_connections              | 120
 max_stack_depth              | 2MB
 port                         | 5432
 server_encoding              | UTF8
 shared_buffers               | 4GB
 synchronous_commit           | off
 tcp_keepalives_idle          | 180
 TimeZone                     | US/Eastern
 track_activity_query_size    | 8192
 wal_buffers                  | 16MB
 wal_writer_delay             | 330ms
 work_mem                     | 512MB

This is a dual dual-core 64bit intel machine (hyperthreaded so 8 logical cpus) with 24GB of memory running basically just the db against a raid 5 diskarray.

Tim

On Tue, Mar 15, 2011 at 2:23 PM, Timothy Garnett <tgarnett@panjiva.com> wrote:
Hi all,

We added an index to a table (to support some different functionality) then ran into cases where the new index (on month, bl_number in the schema below) made performance of some existing queries ~20,000 times worse.  While we do have a workaround (using a CTE to force the proper index to be used) that gets us down to ~twice the original performance (i.e. without the new index), I'm wondering if there's a better workaround that can get us closer to the original performance. It also seems like kind of a strange case so I'm wondering if there's something weird going on in the optimizer.  The # of rows estimates are pretty accurate so it's guessing that about right, but the planner seems to be putting way too much weight on using a sorted index vs. looking up. This is all after an analyze.

Near as I can guess the planner seems to be weighting scanning what should be an expected 100k rows (though in practice it will have to do about 35 million, because the assumption of independence between columns is incorrect) given an expected selectivity of 48K rows out of 45 million over scanning ~48k rows (using the index) and doing a top-n 100 sort on them (actual row count is 43k so pretty close on that).  Even giving the optimizer the benefit of column independence I don't see how that first plan could possibly come out ahead.  It would really help if explain would print out the number of rows it expects to scan and analyze would print out the number of rows it actually scanned (instead of just the number that matched the filter/limit), see the expensive query explain analyze output below.

At the bottom I have some info on the contents and probability.


## The original Query:

explain analyze
SELECT customs_records.* FROM "customs_records" WHERE (((buyer_id IS NOT NULL AND buyer_id IN (1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585))))  ORDER BY month DESC LIMIT 100 OFFSET 0;
-----------------------------------
 Limit  (cost=184626.64..184626.89 rows=100 width=908) (actual time=102.630..102.777 rows=100 loops=1)
   ->  Sort  (cost=184626.64..184748.19 rows=48623 width=908) (actual time=102.628..102.683 rows=100 loops=1)
         Sort Key: month
         Sort Method:  top-N heapsort  Memory: 132kB
         ->  Bitmap Heap Scan on customs_records  (cost=1054.22..182768.30 rows=48623 width=908) (actual time=5.809..44.832 rows=43352 loops=1)
               Recheck Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
               ->  Bitmap Index Scan on index_customs_records_on_buyer_id  (cost=0.00..1042.07 rows=48623 width=0) (actual time=4.588..4.588 rows=43352 loops=1)
                     Index Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
 Total runtime: 102.919 ms


## Same query after adding the new index
### NOTE - it would be very useful here if explain would print out the number of rows it expects to scan in the index and analyze dumped out the number of rows actually scanned.  Instead analyze is printing the rows actually outputed and explain appears to be outputting the number of rows expected to match the filter ignoring the limit... (it exactly matches the row count in the query above)
##

explain analyze
SELECT customs_records.* FROM "customs_records" WHERE (((buyer_id IS NOT NULL AND buyer_id IN (1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585))))  ORDER BY month DESC LIMIT 100 OFFSET 0;
--------------------------------------------
 Limit  (cost=0.00..161295.58 rows=100 width=908) (actual time=171344.185..3858893.743 rows=100 loops=1)
   ->  Index Scan Backward using index_customs_records_on_month_and_bl_number on customs_records  (cost=0.00..78426750.74 rows=48623 width=908) (actual time=171344.182..3858893.588 rows=100 loops=1)
         Filter: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
 Total runtime: 3858893.908 ms


############################################################
My workaround is to use a CTE query to force the planner to not use the month index for sorting (using a subselect is not enough since the planner is too smart for that). However this is still twice as slow as the original query...
############################################################

explain analyze
with foo as (select customs_records.* FROM "customs_records" WHERE (((buyer_id IS NOT NULL AND buyer_id IN (1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585))))) select * from foo order by month desc limit 100 ;
-----------------------------------------------------------
 Limit  (cost=185599.10..185599.35 rows=100 width=5325) (actual time=196.968..197.105 rows=100 loops=1)
   CTE foo
     ->  Bitmap Heap Scan on customs_records  (cost=1054.22..182768.30 rows=48623 width=908) (actual time=5.765..43.489 rows=43352 loops=1)
           Recheck Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
           ->  Bitmap Index Scan on index_customs_records_on_buyer_id  (cost=0.00..1042.07 rows=48623 width=0) (actual time=4.544..4.544 rows=43352 loops=1)
                 Index Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
   ->  Sort  (cost=2830.80..2952.35 rows=48623 width=5325) (actual time=196.966..197.029 rows=100 loops=1)
         Sort Key: foo.month
         Sort Method:  top-N heapsort  Memory: 132kB
         ->  CTE Scan on foo  (cost=0.00..972.46 rows=48623 width=5325) (actual time=5.770..153.322 rows=43352 loops=1)
 Total runtime: 207.282 ms


#### Table information

Table information - the schema is the table below (with some columns removed for succinctness).  There are ~45 million rows, the rows are also fairly wide about 80 columns total. buyer_id is null ~30% of the time (as is supplier_id). A given buyer id maps to between 1 and ~100,000 records (in a decreasing distribution, about 1 million unique buyer id values).  Supplier_id is similar.  Note buyer_id and month columns are not always independent (for some buyer_ids there is a strong correlation as in this case where the buyer_ids are associated with only older months, though for others there isn't), though even so I'm still not clear on why it would pick the plan that it does. We can consider these table never updated or inserted into (updates are done in a new db offline that is periodically swapped in).

                                          Table "public.customs_records"

          Column          |          Type          |                          Modifiers                          
--------------------------+------------------------+--------------------------------------------------------------
 id                       | integer                | not null default nextval('customs_records_id_seq'::regclass)
....
 bl_number                | character varying(16)  |
....
 month                    | date                   |
....
 buyer_id                 | integer                |
...
 supplier_id              | integer                |
...
Indexes:
    "customs_records_pkey" PRIMARY KEY, btree (id) WITH (fillfactor=100)
    "index_customs_records_on_month_and_bl_number" UNIQUE, btree (month, bl_number) WITH (fillfactor=100)
    "index_customs_records_on_buyer_id" btree (buyer_id) WITH (fillfactor=100) WHERE buyer_id IS NOT NULL
    "index_customs_records_on_supplier_id_and_buyer_id" btree (supplier_id, buyer_id) WITH (fillfactor=100) CLUSTER


db version =>
 PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
 (enterprise db build)
ubuntu 8.04 LTS is the host


Sorry meant with 32GB of memory.

Tim

On Tue, Mar 15, 2011 at 2:39 PM, Timothy Garnett <tgarnett@panjiva.com> wrote:
Forgot to include our non-default config settings and server info, not that it probably makes a difference for this.

from pg_settings:
 name                         | current_setting

 version                      | PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
 bytea_output                 | escape
 checkpoint_completion_target | 0.9
 checkpoint_segments          | 24
 effective_cache_size         | 24GB
 effective_io_concurrency     | 2
 lc_collate                   | en_US.utf8
 lc_ctype                     | en_US.utf8
 listen_addresses             | *
 log_checkpoints              | on
 log_connections              | on
 log_disconnections           | on
 log_hostname                 | on
 log_line_prefix              | %t
 logging_collector            | on
 maintenance_work_mem         | 256MB
 max_connections              | 120
 max_stack_depth              | 2MB
 port                         | 5432
 server_encoding              | UTF8
 shared_buffers               | 4GB
 synchronous_commit           | off
 tcp_keepalives_idle          | 180
 TimeZone                     | US/Eastern
 track_activity_query_size    | 8192
 wal_buffers                  | 16MB
 wal_writer_delay             | 330ms
 work_mem                     | 512MB

This is a dual dual-core 64bit intel machine (hyperthreaded so 8 logical cpus) with 24GB of memory running basically just the db against a raid 5 diskarray.

Tim


On Tue, Mar 15, 2011 at 2:23 PM, Timothy Garnett <tgarnett@panjiva.com> wrote:
Hi all,

We added an index to a table (to support some different functionality) then ran into cases where the new index (on month, bl_number in the schema below) made performance of some existing queries ~20,000 times worse.  While we do have a workaround (using a CTE to force the proper index to be used) that gets us down to ~twice the original performance (i.e. without the new index), I'm wondering if there's a better workaround that can get us closer to the original performance. It also seems like kind of a strange case so I'm wondering if there's something weird going on in the optimizer.  The # of rows estimates are pretty accurate so it's guessing that about right, but the planner seems to be putting way too much weight on using a sorted index vs. looking up. This is all after an analyze.

Near as I can guess the planner seems to be weighting scanning what should be an expected 100k rows (though in practice it will have to do about 35 million, because the assumption of independence between columns is incorrect) given an expected selectivity of 48K rows out of 45 million over scanning ~48k rows (using the index) and doing a top-n 100 sort on them (actual row count is 43k so pretty close on that).  Even giving the optimizer the benefit of column independence I don't see how that first plan could possibly come out ahead.  It would really help if explain would print out the number of rows it expects to scan and analyze would print out the number of rows it actually scanned (instead of just the number that matched the filter/limit), see the expensive query explain analyze output below.

At the bottom I have some info on the contents and probability.


## The original Query:

explain analyze
SELECT customs_records.* FROM "customs_records" WHERE (((buyer_id IS NOT NULL AND buyer_id IN (1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585))))  ORDER BY month DESC LIMIT 100 OFFSET 0;
-----------------------------------
 Limit  (cost=184626.64..184626.89 rows=100 width=908) (actual time=102.630..102.777 rows=100 loops=1)
   ->  Sort  (cost=184626.64..184748.19 rows=48623 width=908) (actual time=102.628..102.683 rows=100 loops=1)
         Sort Key: month
         Sort Method:  top-N heapsort  Memory: 132kB
         ->  Bitmap Heap Scan on customs_records  (cost=1054.22..182768.30 rows=48623 width=908) (actual time=5.809..44.832 rows=43352 loops=1)
               Recheck Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
               ->  Bitmap Index Scan on index_customs_records_on_buyer_id  (cost=0.00..1042.07 rows=48623 width=0) (actual time=4.588..4.588 rows=43352 loops=1)
                     Index Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
 Total runtime: 102.919 ms


## Same query after adding the new index
### NOTE - it would be very useful here if explain would print out the number of rows it expects to scan in the index and analyze dumped out the number of rows actually scanned.  Instead analyze is printing the rows actually outputed and explain appears to be outputting the number of rows expected to match the filter ignoring the limit... (it exactly matches the row count in the query above)
##

explain analyze
SELECT customs_records.* FROM "customs_records" WHERE (((buyer_id IS NOT NULL AND buyer_id IN (1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585))))  ORDER BY month DESC LIMIT 100 OFFSET 0;
--------------------------------------------
 Limit  (cost=0.00..161295.58 rows=100 width=908) (actual time=171344.185..3858893.743 rows=100 loops=1)
   ->  Index Scan Backward using index_customs_records_on_month_and_bl_number on customs_records  (cost=0.00..78426750.74 rows=48623 width=908) (actual time=171344.182..3858893.588 rows=100 loops=1)
         Filter: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
 Total runtime: 3858893.908 ms


############################################################
My workaround is to use a CTE query to force the planner to not use the month index for sorting (using a subselect is not enough since the planner is too smart for that). However this is still twice as slow as the original query...
############################################################

explain analyze
with foo as (select customs_records.* FROM "customs_records" WHERE (((buyer_id IS NOT NULL AND buyer_id IN (1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585))))) select * from foo order by month desc limit 100 ;
-----------------------------------------------------------
 Limit  (cost=185599.10..185599.35 rows=100 width=5325) (actual time=196.968..197.105 rows=100 loops=1)
   CTE foo
     ->  Bitmap Heap Scan on customs_records  (cost=1054.22..182768.30 rows=48623 width=908) (actual time=5.765..43.489 rows=43352 loops=1)
           Recheck Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
           ->  Bitmap Index Scan on index_customs_records_on_buyer_id  (cost=0.00..1042.07 rows=48623 width=0) (actual time=4.544..4.544 rows=43352 loops=1)
                 Index Cond: ((buyer_id IS NOT NULL) AND (buyer_id = ANY ('{1172672,1570888,1336461,1336464,1336471,1189285,1336463,1336460,1654709,1000155646,1191114,1336480,1336479,1000384928,1161787,1811495,1507188,1159339,1980765,1200258,1980770,1980768,1980767,1980769,1980766,1980772,1000350850,1000265917,1980764,1980761,1170019,1980762,1184356,1985585}'::integer[])))
   ->  Sort  (cost=2830.80..2952.35 rows=48623 width=5325) (actual time=196.966..197.029 rows=100 loops=1)
         Sort Key: foo.month
         Sort Method:  top-N heapsort  Memory: 132kB
         ->  CTE Scan on foo  (cost=0.00..972.46 rows=48623 width=5325) (actual time=5.770..153.322 rows=43352 loops=1)
 Total runtime: 207.282 ms


#### Table information

Table information - the schema is the table below (with some columns removed for succinctness).  There are ~45 million rows, the rows are also fairly wide about 80 columns total. buyer_id is null ~30% of the time (as is supplier_id). A given buyer id maps to between 1 and ~100,000 records (in a decreasing distribution, about 1 million unique buyer id values).  Supplier_id is similar.  Note buyer_id and month columns are not always independent (for some buyer_ids there is a strong correlation as in this case where the buyer_ids are associated with only older months, though for others there isn't), though even so I'm still not clear on why it would pick the plan that it does. We can consider these table never updated or inserted into (updates are done in a new db offline that is periodically swapped in).

                                          Table "public.customs_records"

          Column          |          Type          |                          Modifiers                          
--------------------------+------------------------+--------------------------------------------------------------
 id                       | integer                | not null default nextval('customs_records_id_seq'::regclass)
....
 bl_number                | character varying(16)  |
....
 month                    | date                   |
....
 buyer_id                 | integer                |
...
 supplier_id              | integer                |
...
Indexes:
    "customs_records_pkey" PRIMARY KEY, btree (id) WITH (fillfactor=100)
    "index_customs_records_on_month_and_bl_number" UNIQUE, btree (month, bl_number) WITH (fillfactor=100)
    "index_customs_records_on_buyer_id" btree (buyer_id) WITH (fillfactor=100) WHERE buyer_id IS NOT NULL
    "index_customs_records_on_supplier_id_and_buyer_id" btree (supplier_id, buyer_id) WITH (fillfactor=100) CLUSTER


db version =>
 PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
 (enterprise db build)
ubuntu 8.04 LTS is the host



Timothy Garnett <tgarnett@panjiva.com> wrote:

>>>    ->  Index Scan Backward using
>>> index_customs_records_on_month_and_bl_number on customs_records
>>> (cost=0.00..78426750.74 rows=48623 width=908) (actual
>>> time=171344.182..3858893.588 rows=100 loops=1)

We've seen a lot of those lately -- Index Scan Backward performing
far worse than alternatives.  One part of it is that disk sectors
are arranged for optimal performance on forward scans; but I don't
think we've properly accounted for the higher cost of moving
backward through our btree indexes, either.  To quote from the
README for the btree AM:

| A backwards scan has one additional bit of complexity: after
| following the left-link we must account for the possibility that
| the left sibling page got split before we could read it.  So, we
| have to move right until we find a page whose right-link matches
| the page we came from.  (Actually, it's even harder than that; see
| deletion discussion below.)

I'm wondering whether the planner should have some multiplier or
other adjustment to attempt to approximate the known higher cost of
backward scans.

-Kevin

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Timothy Garnett <tgarnett@panjiva.com> wrote:
> ->  Index Scan Backward using
> index_customs_records_on_month_and_bl_number on customs_records
> (cost=0.00..78426750.74 rows=48623 width=908) (actual
> time=171344.182..3858893.588 rows=100 loops=1)

> We've seen a lot of those lately -- Index Scan Backward performing
> far worse than alternatives.

It's not clear to me that that has anything to do with Tim's problem.
It certainly wouldn't be 20000x faster if it were a forward scan.

> One part of it is that disk sectors
> are arranged for optimal performance on forward scans; but I don't
> think we've properly accounted for the higher cost of moving
> backward through our btree indexes, either.  To quote from the
> README for the btree AM:

> | A backwards scan has one additional bit of complexity: after
> | following the left-link we must account for the possibility that
> | the left sibling page got split before we could read it.  So, we
> | have to move right until we find a page whose right-link matches
> | the page we came from.  (Actually, it's even harder than that; see
> | deletion discussion below.)

That's complicated, but it's not slow, except in the extremely
infrequent case where there actually was an index page split while your
scan was in flight to the page.  The normal code path will only spend
one extra comparison to verify that no such split happened, and then it
goes on about its business.

The point about disk page layout is valid, so I could believe that in
a recently-built index there might be a significant difference in
forward vs backward scan speed, if none of the index were in memory.
The differential would degrade pretty rapidly due to page splits though
...

            regards, tom lane