Thread: NestedLoops over BitmapScan question

NestedLoops over BitmapScan question

From
Виктор Егоров
Date:
Greetings.

I have a small monitoring query on the following tables:
select relname,relpages,reltuples::numeric(12) from pg_class where relname in ('meta_version','account') order by 1;
   relname    | relpages | reltuples 
--------------+----------+-----------
 account      |     3235 |    197723
 meta_version |   710068 |  32561200
(2 rows)

The logical “body” of the query is:
select count(*) from meta_version where account_id in (select account_id from account where customer_id = 8608064);

I know that due to the data distribution (above customer's accounts are used in 45% of the meta_version table) I
cannot expect fast results. But I have another question.

With default default_statistics_target I get the following plan:

In order to get better estimates, I've increased statistics targets to 200 for account.customer_id and meta_version.account_id.
Now I have the following plan:

Second query takes twice more time.
My questions are:
- why with better statistics planner chooses to do a SeqScan in favor of BitmapIndexScan inside the NestedLoops?
- is it possible to adjust this decision by changing other GUCs, perhaps costs?
- would it be correct to adjust seq_page_cost and random_page_cost based on the IOPS of the underlying disks?
  any other metrics should be considered?

I'm running on a:
            name            |                                                current_setting                                                
----------------------------+---------------------------------------------------------------------------------------------------------------
 version                    | PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
 archive_command            | test ! -f $PG_WAL/%f && cp %p $PG_WAL/%f
 archive_mode               | on
 bgwriter_delay             | 50ms
 bgwriter_lru_maxpages      | 200
 checkpoint_segments        | 25
 checkpoint_timeout         | 30min
 client_encoding            | UTF8
 effective_cache_size       | 8GB
 hot_standby                | on
 lc_collate                 | en_US.UTF-8
 lc_ctype                   | en_US.UTF-8
 listen_addresses           | *
 log_checkpoints            | on
 log_connections            | on
 log_destination            | csvlog
 log_directory              | ../../log/CLUSTER
 log_disconnections         | on
 log_file_mode              | 0640
 log_filename               | pg-%Y%m%d_%H%M%S.log
 log_line_prefix            | %u:%d:%a:%h:%c:%x:%t> 
 log_lock_waits             | on
 log_min_duration_statement | 300ms
 log_rotation_age           | 1d
 log_rotation_size          | 0
 log_temp_files             | 20MB
 logging_collector          | on
 maintenance_work_mem       | 512MB
 max_connections            | 200
 max_prepared_transactions  | 0
 max_stack_depth            | 2MB
 max_wal_senders            | 2
 port                       | 9120
 server_encoding            | UTF8
 shared_buffers             | 5GB
 silent_mode                | on
 ssl                        | on
 ssl_renegotiation_limit    | 0
 tcp_keepalives_idle        | 0
 temp_buffers               | 256MB
 TimeZone                   | US/Eastern
 wal_buffers                | 512kB
 wal_keep_segments          | 0
 wal_level                  | hot_standby
 wal_sender_delay           | 1s
 work_mem                   | 32MB

Regards.

--
Victor Y. Yegorov

Re: NestedLoops over BitmapScan question

From
Виктор Егоров
Date:
Well, I've managed to track down the cause of improper plans.

Due to the data distribution n_distinct had been estimated way too low.
I've manually set it to be 195300 instead of 15500 (with stats_target=200):
select tablename,attname,null_frac,avg_width,n_distinct,correlation
  from pg_stats
 where (tablename,attname) IN
 (VALUES ('meta_version','account_id'),('account','customer_id'));
  tablename   |   attname   | null_frac | avg_width | n_distinct | correlation
--------------+-------------+-----------+-----------+------------+-------------
 account      | customer_id |         0 |         4 |         57 |    0.998553
 meta_version | account_id  |         0 |         4 |     195300 |   0.0262315
(2 rows)

Still, optimizer underestimates rows returned by the IndexScan heavily:
http://explain.depesz.com/s/pDw

Is it possible to get correct estimates for the IndexScan on the right side
of the NestedLoops? I assume estimation is done by the B-tree AM and
it is seems to be not affected by the STATISTICS parameter of the
column.


2012/9/29 Виктор Егоров <vyegorov@gmail.com>:
> Now I have the following plan:
> http://explain.depesz.com/s/YZJ
>
> Second query takes twice more time.


--
Victor Y. Yegorov