Bad query plan with high-cardinality column - Mailing list pgsql-performance

From Alexander Staubo
Subject Bad query plan with high-cardinality column
Date
Msg-id 19B2B4ECC4724A438200C54257BA8B20@gmail.com
Whole thread Raw
Responses Re: Bad query plan with high-cardinality column
Re: Bad query plan with high-cardinality column
List pgsql-performance
I have a problem with a query that is planned wrong. This is my schema:

   create table comments (
     id serial primary key,
     conversation_id integer,
     created_at timestamp
   );
   create index comments_conversation_id_index on comments (conversation_id);
   create index comments_created_at_index on comments (created_at);

The table has 3.5M rows, and 650k unique values for "conversation_id", where the histogram goes up to 54000 rows for
themost frequent ID, with a long tail. There are only 20 values with a frequency of 1000 or higher. The "created_at"
columnhas 3.5M distinct values. 

Now, I have this query:

   select comments.id from comments where
     conversation_id = 3975979 order by created_at limit 13

This filters about 5000 rows and returns the oldest 13 rows. But the query is consistently planned wrong:

   Limit  (cost=0.00..830.53 rows=13 width=12) (actual time=3174.862..3179.525 rows=13 loops=1)
                                   
     Buffers: shared hit=2400709 read=338923 written=21
                                   
     ->  Index Scan using comments_created_at_index on comments  (cost=0.00..359938.52 rows=5634 width=12) (actual
time=3174.860..3179.518rows=13 loops=1) 
           Filter: (conversation_id = 3975979)
                                   
           Rows Removed by Filter: 2817751
                                   
           Buffers: shared hit=2400709 read=338923 written=21
                                   
   Total runtime: 3179.553 ms
                                   

It takes anywhere between 3 seconds and several minutes to run, depending on how warm the disk cache is. This is the
correctplan and index usage: 

   Limit  (cost=6214.34..6214.38 rows=13 width=12) (actual time=25.471..25.473 rows=13 loops=1)
                                        
     Buffers: shared hit=197 read=4510
                                        
     ->  Sort  (cost=6214.34..6228.02 rows=5471 width=12) (actual time=25.469..25.470 rows=13 loops=1)
                                        
           Sort Key: created_at
                                        
           Sort Method: top-N heapsort  Memory: 25kB
                                        
           Buffers: shared hit=197 read=4510
                                        
           ->  Index Scan using comments_conversation_id_index on comments  (cost=0.00..6085.76 rows=5471 width=12)
(actualtime=1.163..23.955 rows=5834 loops=1) 
                 Index Cond: (conversation_id = 3975979)
                                        
                 Buffers: shared hit=197 read=4510
                                        
   Total runtime: 25.500 ms
                                        

The problem for Postgres is obviously to estimate how many rows have a given "conversation_id" value, but I have
confirmedthat the value is correctly tracked in the histogram. 

I'm at a loss how to explain why the planner thinks scanning a huge index that covers the entire table will ever beat
scanninga small index that has 17% of the table's values. Even if the entire database were in RAM, this would hit way
toomuch buffers unnecessarily. (I have determined that planner will consistently use the bad plan for higher-frequency
values,and the good plan for lower-frequency values.) It will *always* be better to branch off the
"comments_conversation_id_index"index. 

Another curious thing: If I run ANALYZE repeatedly, the planner will sometimes, oddly enough, choose the correct plan.
Thisbehaviour actually seems related to effective_cache_size; if it's set small (128MB), the planner will sometimes
favourthe good plan, but if large (>= 2GB) it will consistently use the bad plan. Not sure if ANALYZE is changing
anythingor if it's just bad timing. 

Things I have tried: I have bumped the statistics target up to 10000, but it does not help. I have also tried setting
n_distinctfor the column manually, since Postgres guesses it's 285k instead of 650k, but that does not help. 

Our environment: 9.2.2, tweaked memory parameters (work_mem, sort_mem, shared_buffers, effective_cache_size), not
touchedplanner/cost parameters. Problem also exists on 9.2.3.  




pgsql-performance by date:

Previous
From: Markus Schulz
Date:
Subject: Re: BUG: endless lseek(.., SEEK_END) from select queries on x64 builds
Next
From: Nikolas Everett
Date:
Subject: Re: Are bitmap index scans slow to start?