Thread: Query plan on identical tables differs . Why ?

Query plan on identical tables differs . Why ?

From
"Fabio Panizzutti"
Date:
Hello I'm tuning a postgresql (7.4.2) server for best performance .
I have a question about the planner .
I have two identical tables : one stores short data (about 2.000.000
record now) and
the other historycal data ( about 8.000.000 record now and growing ...)


A simple test query :   select tag_id,valore_tag,data_tag from
storico_misure where (data_tag>'2004-05-03' and data_tag <'2004-05-12')
and tag_id=37423 ;

Takes 57,637 ms on the short table and 1321,448 ms (!!) on the
historycal table .Tables are vacuumed and reindexed .



Tables and query plans :

\d storico_misure
                   Table "tenore.storico_misure"
         Column          |            Type             | Modifiers
-------------------------+-----------------------------+-----------
 data_tag                | timestamp without time zone | not null
 tag_id                  | integer                     | not null
 unita_misura            | character varying(6)        | not null
 valore_tag              | numeric(20,3)               | not null
 qualita                 | integer                     | not null
 numero_campioni         | numeric(5,0)                |
 frequenza_campionamento | numeric(3,0)                |
Indexes:
    "pk_storico_misure_2" primary key, btree (data_tag, tag_id)
    "pk_anagtstorico_misuree_idx_2" btree (tag_id)
    "storico_misure_data_tag_idx_2" btree (data_tag)

storico=# \d storico_misure_short
                Table "tenore.storico_misure_short"
         Column          |            Type             | Modifiers
-------------------------+-----------------------------+-----------
 data_tag                | timestamp without time zone | not null
 tag_id                  | integer                     | not null
 unita_misura            | character varying(6)        | not null
 valore_tag              | numeric(20,3)               | not null
 qualita                 | integer                     | not null
 numero_campioni         | numeric(5,0)                |
 frequenza_campionamento | numeric(3,0)                |
Indexes:
    "storico_misure_short_pkey_2" primary key, btree (data_tag, tag_id)
    "pk_anagtstorico_misuree_short_idx_2" btree (tag_id)
    "storico_misure_short_data_tag_idx_2" btree (data_tag)

storico=#
storico=#
storico=# explain select tag_id,valore_tag,data_tag from storico_misure
where (data_tag>'2004-05-03' and data_tag <'2004-05-12') and
tag_id=37423 ;

QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------
 Index Scan using pk_storico_misure_2 on storico_misure
(cost=0.00..1984.64 rows=658 width=21)
   Index Cond: ((data_tag > '2004-05-03 00:00:00'::timestamp without
time zone) AND (data_tag < '2004-05-12 00:00:00'::timestamp without time
zone) AND (tag_id = 37423))
(2 rows)

Time: 1,667 ms
storico=# explain select tag_id,valore_tag,data_tag from
storico_misure_short where (data_tag>'2004-05-03' and data_tag
<'2004-05-12') and tag_id=37423 ;
                                                                   QUERY
PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-
 Index Scan using pk_anagtstorico_misuree_short_idx_2 on
storico_misure_short  (cost=0.00..1784.04 rows=629 width=20)
   Index Cond: (tag_id = 37423)
   Filter: ((data_tag > '2004-05-03 00:00:00'::timestamp without time
zone) AND (data_tag < '2004-05-12 00:00:00'::timestamp without time
zone))


How can i force the planner to use the same query plan ? I'd like to
test if using the same query plan i've better performace .

Thanks in advance




this is my posgresql.conf

#-----------------------------------------------------------------------
----
# CONNECTIONS AND AUTHENTICATION
#-----------------------------------------------------------------------
----

# - Connection Settings -

tcpip_socket = true
max_connections = 100
    # note: increasing max_connections costs about 500 bytes of
shared
    # memory per connection slot, in addition to costs from
shared_buffers
        # and max_locks_per_transaction.
#superuser_reserved_connections = 2
port = 5432
#unix_socket_directory = ''
#unix_socket_group = ''
#unix_socket_permissions = 0777    # octal
#virtual_host = ''        # what interface to listen on; defaults
to any
#rendezvous_name = ''        # defaults to the computer name

# - Security & Authentication -

#authentication_timeout = 60    # 1-600, in seconds
#ssl = false
#password_encryption = true
#krb_server_keyfile = ''
#db_user_namespace = false


#-----------------------------------------------------------------------
----
# RESOURCE USAGE (except WAL)
#-----------------------------------------------------------------------
----

# - Memory -

shared_buffers = 3000        # min 16, at least max_connections*2,
8KB each
sort_mem = 4096         # min 64, size in KB
vacuum_mem = 32768        # min 1024, size in KB

# - Free Space Map -

#max_fsm_pages = 20000        # min max_fsm_relations*16, 6 bytes each
#max_fsm_relations = 1000    # min 100, ~50 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000    # min 25
#preload_libraries = ''


#-----------------------------------------------------------------------
----
# WRITE AHEAD LOG
#-----------------------------------------------------------------------
----

# - Settings -

fsync = false            # turns forced synchronization on or off
#wal_sync_method = fsync    # the default varies across platforms:
                # fsync, fdatasync, open_sync, or
open_datasync
#wal_buffers = 8        # min 4, 8KB each

# - Checkpoints -

checkpoint_segments = 12    # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300    # range 30-3600, in seconds
#checkpoint_warning = 30    # 0 is off, in seconds
#commit_delay = 0        # range 0-100000, in microseconds
#commit_siblings = 5        # range 1-1000


#-----------------------------------------------------------------------
----
# QUERY TUNING
#-----------------------------------------------------------------------
----

# - Planner Method Enabling -

enable_hashagg = false
enable_hashjoin = false
enable_indexscan = true
enable_mergejoin = true
enable_nestloop = false
enable_seqscan = true
enable_sort = false
enable_tidscan = false

# - Planner Cost Constants -

#effective_cache_size = 1000    # typically 8KB each
#random_page_cost = 4        # units are one sequential page fetch
cost
#cpu_tuple_cost = 0.01        # (same)
#cpu_index_tuple_cost = 0.001    # (same)
#cpu_operator_cost = 0.0025    # (same)

# - Genetic Query Optimizer -

#geqo = true
#geqo_threshold = 11
#geqo_effort = 1
#geqo_generations = 0
#geqo_pool_size = 0        # default based on tables in statement,
                # range 128-1024
#geqo_selection_bias = 2.0    # range 1.5-2.0

# - Other Planner Options -

#default_statistics_target = 10    # range 1-1000
#from_collapse_limit = 8
#join_collapse_limit = 8    # 1 disables collapsing of explicit
JOINs


#-----------------------------------------------------------------------
----
# ERROR REPORTING AND LOGGING
#-----------------------------------------------------------------------
----

# - Syslog -

#syslog = 0            # range 0-2; 0=stdout; 1=both; 2=syslog
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'

# - When to Log -

#client_min_messages = notice    # Values, in order of decreasing detail:
                #   debug5, debug4, debug3, debug2,
debug1,
                #   log, info, notice, warning, error

#log_min_messages = notice    # Values, in order of decreasing detail:
                #   debug5, debug4, debug3, debug2,
debug1,
                #   info, notice, warning, error, log,
fatal,
                #   panic

#log_error_verbosity = default   # terse, default, or verbose messages

#log_min_error_statement = panic # Values in order of increasing
severity:
                 #   debug5, debug4, debug3, debug2,
debug1,
                 #   info, notice, warning, error,
panic(off)

#log_min_duration_statement = -1 # Log all statements whose
                 # execution time exceeds the value, in
                 # milliseconds.  Zero prints all
queries.
                 # Minus-one disables.

#silent_mode = false         # DO NOT USE without Syslog!

# - What to Log -

#debug_print_parse = false
#debug_print_rewritten = false
#debug_print_plan = false
#debug_pretty_print = false
#log_connections = false
#log_duration = false
#log_pid = false
#log_statement = false
#log_timestamp = false
#log_hostname = false
#log_source_port = false


#-----------------------------------------------------------------------
----
# RUNTIME STATISTICS
#-----------------------------------------------------------------------
----

# - Statistics Monitoring -

#log_parser_stats = false
#log_planner_stats = false
#log_executor_stats = false
#log_statement_stats = false

# - Query/Index Statistics Collector -

stats_start_collector = true
stats_command_string = true
#stats_block_level = false
stats_row_level = true
#stats_reset_on_server_start = true


#-----------------------------------------------------------------------
----
# CLIENT CONNECTION DEFAULTS
#-----------------------------------------------------------------------
----

# - Statement Behavior -

#search_path = '$user,public'    # schema names
#check_function_bodies = true
#default_transaction_isolation = 'read committed'
#default_transaction_read_only = false
statement_timeout = 360000        # 0 is disabled, in milliseconds

# - Locale and Formatting -

#datestyle = 'iso, mdy'
#timezone = unknown        # actually, defaults to TZ environment
setting
#australian_timezones = false
#extra_float_digits = 0        # min -15, max 2
#client_encoding = sql_ascii    # actually, defaults to database
encoding

# These settings are initialized by initdb -- they may be changed
lc_messages = 'it_IT.UTF-8'        # locale for system error
message strings
lc_monetary = 'it_IT.UTF-8'        # locale for monetary formatting
lc_numeric = 'it_IT.UTF-8'        # locale for number formatting
lc_time = 'it_IT.UTF-8'            # locale for time formatting

# - Other Defaults -

#explain_pretty_print = true
#dynamic_library_path = '$libdir'
#max_expr_depth = 10000        # min 10


#-----------------------------------------------------------------------
----
# LOCK MANAGEMENT
#-----------------------------------------------------------------------
----

#deadlock_timeout = 1000    # in milliseconds
#max_locks_per_transaction = 64    # min 10, ~260*max_connections bytes
each


#-----------------------------------------------------------------------
----
# VERSION/PLATFORM COMPATIBILITY
#-----------------------------------------------------------------------
----

# - Previous Postgres Versions -

#add_missing_from = true
#regex_flavor = advanced    # advanced, extended, or basic
#sql_inheritance = true

# - Other Platforms & Clients -

#transform_null_equals = false









Re: Query plan on identical tables differs . Why ?

From
Shridhar Daithankar
Date:
Fabio Panizzutti wrote:
> storico=# explain select tag_id,valore_tag,data_tag from storico_misure
> where (data_tag>'2004-05-03' and data_tag <'2004-05-12') and
> tag_id=37423 ;

Can you please post explain analyze? That includes actual timings.

Looking at the schema, can you try "and tag_id=37423::integer" instead?

> enable_hashagg = false
> enable_hashjoin = false
> enable_indexscan = true
> enable_mergejoin = true
> enable_nestloop = false
> enable_seqscan = true
> enable_sort = false
> enable_tidscan = false

Why do you have these off? AFAIK, 7.4 improved hash aggregates a lot. So you
might miss on these in this case.

> # - Planner Cost Constants -
>
> #effective_cache_size = 1000    # typically 8KB each

You might set it to something realistic.

And what is your hardware setup? Disks/CPU/RAM?

Just to be sure, you went thr.
http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html and
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html?

HTH

  Regards
    Shridhar

R: Query plan on identical tables differs . Why ?

From
"Fabio Panizzutti"
Date:

>>>-----Messaggio originale-----
>>>Da: pgsql-performance-owner@postgresql.org
>>>[mailto:pgsql-performance-owner@postgresql.org] Per conto di
>>>Shridhar Daithankar
>>>Inviato: giovedì 13 maggio 2004 15.05
>>>A: Fabio Panizzutti
>>>Cc: pgsql-performance@postgresql.org
>>>Oggetto: Re: [PERFORM] Query plan on identical tables differs . Why ?
>>>
>>>
>>>Fabio Panizzutti wrote:
>>>> storico=# explain select tag_id,valore_tag,data_tag from
>>>> storico_misure where (data_tag>'2004-05-03' and data_tag
>>>> <'2004-05-12') and tag_id=37423 ;
>>>
>>>Can you please post explain analyze? That includes actual timings.

storico=# explain analyze select tag_id,valore_tag,data_tag from
storico_misure where (data_tag>'2004-05-03' and data_tag <'2004-05-12')
and tag_id=37423 ;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------
 Index Scan using pk_storico_misure_2 on storico_misure
(cost=0.00..1984.64 rows=658 width=21) (actual time=723.441..1858.107
rows=835 loops=1)
   Index Cond: ((data_tag > '2004-05-03 00:00:00'::timestamp without
time zone) AND (data_tag < '2004-05-12 00:00:00'::timestamp without time
zone) AND (tag_id = 37423))
 Total runtime: 1860.641 ms
(3 rows)

storico=# explain analyze select tag_id,valore_tag,data_tag from
storico_misure_short where (data_tag>'2004-05-03' and data_tag
<'2004-05-12') and tag_id=37423 ;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------
 Index Scan using pk_anagtstorico_misuree_short_idx_2 on
storico_misure_short  (cost=0.00..1783.04 rows=629 width=20) (actual
time=0.323..42.186 rows=864 loops=1)
   Index Cond: (tag_id = 37423)
   Filter: ((data_tag > '2004-05-03 00:00:00'::timestamp without time
zone) AND (data_tag < '2004-05-12 00:00:00'::timestamp without time
zone))
 Total runtime: 43.166 ms




>>>Looking at the schema, can you try "and
>>>tag_id=37423::integer" instead?
>>>

I try :
explain analyze select tag_id,valore_tag,data_tag from storico_misure
where (data_tag>'2004-05-03' and data_tag <'2004-05-12') and
tag_id=37423::integer;
Index Scan using pk_storico_misure_2 on storico_misure
(cost=0.00..1984.64 rows=658 width=21) (actual time=393.337..1303.998
rows=835 loops=1)
   Index Cond: ((data_tag > '2004-05-03 00:00:00'::timestamp without
time zone) AND (data_tag < '2004-05-12 00:00:00'::timestamp without time
zone) AND (tag_id = 37423))
 Total runtime: 1306.484 ms

>>>> enable_hashagg = false
>>>> enable_hashjoin = false
>>>> enable_indexscan = true
>>>> enable_mergejoin = true
>>>> enable_nestloop = false
>>>> enable_seqscan = true
>>>> enable_sort = false
>>>> enable_tidscan = false
>>>Why do you have these off? AFAIK, 7.4 improved hash
>>>aggregates a lot. So you
>>>might miss on these in this case.

I try for debug purpose , now i reset all 'enable' to default :

select * from pg_settings where name like 'enable%';
       name       | setting | context | vartype |       source       |
min_val | max_val
------------------+---------+---------+---------+--------------------+--
-------+---------
 enable_hashagg   | on      | user    | bool    | configuration file |
|
 enable_hashjoin  | on      | user    | bool    | configuration file |
|
 enable_indexscan | on      | user    | bool    | configuration file |
|
 enable_mergejoin | on      | user    | bool    | configuration file |
|
 enable_nestloop  | on      | user    | bool    | configuration file |
|
 enable_seqscan   | on      | user    | bool    | configuration file |
|
 enable_sort      | on      | user    | bool    | configuration file |
|
 enable_tidscan   | on      | user    | bool    | configuration file |
|
(8 rows)

The query plan are the same ....

>>>> # - Planner Cost Constants -
>>>>
>>>> #effective_cache_size = 1000    # typically 8KB each
>>>
>>>You might set it to something realistic.
>>>

I try 10000 and 100000 but nothing change .



>>>And what is your hardware setup? Disks/CPU/RAM?

32GB SCSI/DUAL Intel(R) Pentium(R) III CPU family      1133MHz/  1GB RAM
and linux red-hat 9


I don't understand why the planner chose a different query plan on
identical tables with same indexes .

Thanks a lot for help!.

Fabio


Re: R: Query plan on identical tables differs . Why ?

From
Tom Lane
Date:
"Fabio Panizzutti" <panizzutti@interlogica.net> writes:
> I don't understand why the planner chose a different query plan on
> identical tables with same indexes .

Different data statistics; not to mention different table sizes
(the cost equations are not linear).

Have you ANALYZEd (or VACUUM ANALYZEd) both tables recently?

If the stats are up to date but still not doing the right thing,
you might try increasing the statistics target for the larger
table's tag_id column.  See ALTER TABLE SET STATISTICS.

            regards, tom lane

Re: R: Query plan on identical tables differs . Why ?

From
Stephan Szabo
Date:
On Thu, 13 May 2004, Fabio Panizzutti wrote:


> I don't understand why the planner chose a different query plan on
> identical tables with same indexes .

Because it's more than table structure that affects the choice made by the
planner.  In addition the statistics about the values that are there as
well as the estimated size of the table have effects.  One way to see is
to see what it thinks is best is to remove the indexes it is using and see
what plan it gives then, how long it takes and the estimated costs for
those plans.

In other suggestions, I think having a (tag_id, data_tag) index rather
than (data_tag, tag_id) may be a win for queries like this. Also, unless
you're doing many select queries by only the first field of the composite
index and you're not doing very many insert/update/deletes, you may want
to drop the other index on just that field.

R: R: Query plan on identical tables differs . Why ?

From
"Fabio Panizzutti"
Date:

>>>-----Messaggio originale-----
>>>Da: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>>>Inviato: giovedì 13 maggio 2004 17.01
>>>A: Fabio Panizzutti
>>>Cc: 'Shridhar Daithankar'; pgsql-performance@postgresql.org
>>>Oggetto: Re: R: [PERFORM] Query plan on identical tables
>>>differs . Why ?
>>>
>>>
>>>"Fabio Panizzutti" <panizzutti@interlogica.net> writes:
>>>> I don't understand why the planner chose a different query plan on
>>>> identical tables with same indexes .
>>>
>>>Different data statistics; not to mention different table
>>>sizes (the cost equations are not linear).
>>>
>>>Have you ANALYZEd (or VACUUM ANALYZEd) both tables recently?
>>>
>>>If the stats are up to date but still not doing the right
>>>thing, you might try increasing the statistics target for
>>>the larger table's tag_id column.  See ALTER TABLE SET STATISTICS.
>>>
>>>            regards, tom lane
>>>

All tables are vacumed and analyzed .
I try so set statistics to 1000 to tag_id columns with ALTER TABLE SET
STATISTIC, revacuum analyze ,  but the planner choose the same query
plan .
I'm trying now to change the indexes .

Thanks




R: R: Query plan on identical tables differs . Why ?

From
"Fabio Panizzutti"
Date:

>>>-----Messaggio originale-----
>>>Da: Stephan Szabo [mailto:sszabo@megazone.bigpanda.com]
>>>Inviato: giovedì 13 maggio 2004 17.17
>>>A: Fabio Panizzutti
>>>Cc: 'Shridhar Daithankar'; pgsql-performance@postgresql.org
>>>Oggetto: Re: R: [PERFORM] Query plan on identical tables
>>>differs . Why ?
>>>
>>>
>>>On Thu, 13 May 2004, Fabio Panizzutti wrote:
>>>
>>>
>>>> I don't understand why the planner chose a different query plan on
>>>> identical tables with same indexes .
>>>
>>>Because it's more than table structure that affects the
>>>choice made by the planner.  In addition the statistics
>>>about the values that are there as well as the estimated
>>>size of the table have effects.  One way to see is to see
>>>what it thinks is best is to remove the indexes it is using
>>>and see what plan it gives then, how long it takes and the
>>>estimated costs for those plans.
>>>
>>>In other suggestions, I think having a (tag_id, data_tag)
>>>index rather than (data_tag, tag_id) may be a win for
>>>queries like this. Also, unless you're doing many select
>>>queries by only the first field of the composite index and
>>>you're not doing very many insert/update/deletes, you may
>>>want to drop the other index on just that field.
>>>

Thanks for your attention , i change the indexes on the tables as you
suggested  :

 storico=# \d storico_misure_short
                Table "tenore.storico_misure_short"
         Column          |            Type             | Modifiers
-------------------------+-----------------------------+-----------
 data_tag                | timestamp without time zone | not null
 tag_id                  | integer                     | not null
 unita_misura            | character varying(6)        | not null
 valore_tag              | numeric(20,3)               | not null
 qualita                 | integer                     | not null
 numero_campioni         | numeric(5,0)                |
 frequenza_campionamento | numeric(3,0)                |
Indexes:
    "storico_misure_short_idx" primary key, btree (tag_id, data_tag)
    "storico_misure_short_data_tag_idx2" btree (data_tag)

storico=# \d storico_misure
                   Table "tenore.storico_misure"
         Column          |            Type             | Modifiers
-------------------------+-----------------------------+-----------
 data_tag                | timestamp without time zone | not null
 tag_id                  | integer                     | not null
 unita_misura            | character varying(6)        | not null
 valore_tag              | numeric(20,3)               | not null
 qualita                 | integer                     | not null
 numero_campioni         | numeric(5,0)                |
 frequenza_campionamento | numeric(3,0)                |
Indexes:
    "storico_misure_idx" primary key, btree (tag_id, data_tag)
    "storico_misure_data_tag_idx2" btree (data_tag)

And now performance are similar and the planner works correctly :

storico=# \d storico_misure_short
                Table "tenore.storico_misure_short"
         Column          |            Type             | Modifiers
-------------------------+-----------------------------+-----------
 data_tag                | timestamp without time zone | not null
 tag_id                  | integer                     | not null
 unita_misura            | character varying(6)        | not null
 valore_tag              | numeric(20,3)               | not null
 qualita                 | integer                     | not null
 numero_campioni         | numeric(5,0)                |
 frequenza_campionamento | numeric(3,0)                |
Indexes:
    "storico_misure_short_idx" primary key, btree (tag_id, data_tag)
    "storico_misure_short_data_tag_idx2" btree (data_tag)

storico=# \d storico_misure
                   Table "tenore.storico_misure"
         Column          |            Type             | Modifiers
-------------------------+-----------------------------+-----------
 data_tag                | timestamp without time zone | not null
 tag_id                  | integer                     | not null
 unita_misura            | character varying(6)        | not null
 valore_tag              | numeric(20,3)               | not null
 qualita                 | integer                     | not null
 numero_campioni         | numeric(5,0)                |
 frequenza_campionamento | numeric(3,0)                |
Indexes:
    "storico_misure_idx" primary key, btree (tag_id, data_tag)
    "storico_misure_data_tag_idx2" btree (data_tag)

storico=# explain analyze select tag_id,valore_tag,data_tag from
storico_misure_short where (data_tag>'2004-05-03' and data_tag
<'2004-05-12') and tag_id=37423 ;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------
 Index Scan using storico_misure_short_idx on storico_misure_short
(cost=0.00..2104.47 rows=584 width=20) (actual time=0.232..39.932
rows=864 loops=1)
   Index Cond: ((tag_id = 37423) AND (data_tag > '2004-05-03
00:00:00'::timestamp without time zone) AND (data_tag < '2004-05-12
00:00:00'::timestamp without time zone))
 Total runtime: 40.912 ms
(3 rows)

Time: 43,233 ms
storico=# explain analyze select tag_id,valore_tag,data_tag from
storico_misure where (data_tag>'2004-05-03' and data_tag <'2004-05-12')
and tag_id=37423 ;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------
 Index Scan using storico_misure_idx on storico_misure
(cost=0.00..2097.56 rows=547 width=21) (actual time=0.518..92.067
rows=835 loops=1)
   Index Cond: ((tag_id = 37423) AND (data_tag > '2004-05-03
00:00:00'::timestamp without time zone) AND (data_tag < '2004-05-12
00:00:00'::timestamp without time zone))
 Total runtime: 93.459 ms
(3 rows)


I need the index on data_tag for other query ( last values on the last
date ) .


Regards

Fabio