Thread: Index Scan Costs versus Sort

Index Scan Costs versus Sort

From
Charlie Savage
Date:
This is related to my post the other day about sort performance.

Part of my problem seems to be that postgresql is greatly overestimating
the cost of index scans.  As a result, it prefers query plans that
involve seq scans and sorts versus query plans that use index scans.
Here is an example query:

SELECT tlid, count(tlid)
FROM completechain
GROUP BY tlid;

Approach #1 - seq scan with sort:

"GroupAggregate  (cost=10177594.61..11141577.89 rows=48199164 width=4)
(actual time=7439085.877..8429628.234 rows=47599910 loops=1)"
"  ->  Sort  (cost=10177594.61..10298092.52 rows=48199164 width=4)
(actual time=7439085.835..8082452.721 rows=48199165 loops=1)"
"        Sort Key: tlid"
"        ->  Seq Scan on completechain  (cost=0.00..2229858.64
rows=48199164 width=4) (actual time=10.788..768403.874 rows=48199165
loops=1)"
"Total runtime: 8596987.505 ms"


Approach #2 - index scan (done by setting enable_seqscan to false and
enable_sort to false):

"GroupAggregate  (cost=0.00..113713861.43 rows=48199164 width=4) (actual
time=53.211..2652227.201 rows=47599910 loops=1)"
"  ->  Index Scan using idx_completechain_tlid on completechain
(cost=0.00..112870376.06 rows=48199164 width=4) (actual
time=53.168..2312426.321 rows=48199165 loops=1)"
"Total runtime: 2795420.933 ms"

Approach #1 is estimated to be 10 times less costly, yet takes 3 times
longer to execute.


My questions:

1.  Postgresql estimates the index scan will be 50 times more costly
than the seq scan (112870376 vs 2229858) yet in fact it only takes 3
times longer to execute (2312426 s vs. 768403 s).  My understanding is
that postgresql assumes, via the random_page_cost parameter, that an
index scan will take 4 times longer than a sequential scan.  So why is
the analyzer estimating it is 50 times slower?

2.  In approach #1, the planner thinks the sort will take roughly 4
times longer [(10,298,092 - 2,229,858) / 2,229,858] than the sequential
scan.  Yet it really takes almost ten times longer.  It seems as is the
planner is greatly underestimating the sort cost?

Due to these two apparent miscalculations, postgresql is choosing the
wrong query plan to execute this query.   I've attached my
postgresql.conf file below just in case this is due to some
misconfiguration on my part.

Some setup notes:
* All tables are vacuumed and analyzed
* Running Postgresql 8.1 on Suse 10
* Low end hardware - Dell Dimension 3000, 1GB ram, 1 built-in 80 GB IDE
drive, 1 SATA Seagate 400GB drive.  The IDE drive has the OS and the WAL
files, the SATA drive the database. From hdparm the max IO for the IDE
drive is about 50Mb/s and the SATA drive is about 65Mb/s.


-------------------------------

#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

shared_buffers = 40000                  # 40000 buffers * 8192
bytes/buffer = 327,680,000 bytes
#shared_buffers = 1000            # min 16 or max_connections*2, 8KB each

temp_buffers = 5000
#temp_buffers = 1000            # min 100, 8KB each
#max_prepared_transactions = 5        # can be 0 or more
# note: increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).

work_mem =  16384                        # in Kb
#work_mem = 1024            # min 64, size in KB

maintenance_work_mem = 262144            # in kb
#maintenance_work_mem = 16384        # min 1024, size in KB
#max_stack_depth = 2048            # min 100, size in KB

# - Free Space Map -

max_fsm_pages = 60000
#max_fsm_pages = 20000            # min max_fsm_relations*16, 6 bytes each

#max_fsm_relations = 1000        # min 100, ~70 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000        # min 25
#preload_libraries = ''

# - Cost-Based Vacuum Delay -

#vacuum_cost_delay = 0            # 0-1000 milliseconds
#vacuum_cost_page_hit = 1        # 0-10000 credits
#vacuum_cost_page_miss = 10        # 0-10000 credits
#vacuum_cost_page_dirty = 20        # 0-10000 credits
#vacuum_cost_limit = 200        # 0-10000 credits

# - Background writer -

#bgwriter_delay = 200            # 10-10000 milliseconds between rounds
#bgwriter_lru_percent = 1.0        # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5        # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333        # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5        # 0-1000 buffers max written/round


#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

fsync = on                # turns forced synchronization on or off
#wal_sync_method = fsync        # the default is the first option
                     # supported by the operating system:
                     #   open_datasync
                     #   fdatasync
                     #   fsync
                     #   fsync_writethrough
                     #   open_sync
#full_page_writes = on            # recover from partial page writes

wal_buffers = 128
#wal_buffers = 8            # min 4, 8KB each

#commit_delay = 0            # range 0-100000, in microseconds
#commit_siblings = 5            # range 1-1000

# - Checkpoints -

checkpoint_segments = 256               # 256 * 16Mb = 4,294,967,296 bytes
checkpoint_timeout = 1200        # 1200 seconds (20 minutes)
checkpoint_warning = 30            # in seconds, 0 is off

#checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300        # range 30-3600, in seconds
#checkpoint_warning = 30        # in seconds, 0 is off

# - Archiving -

#archive_command = ''            # command to use to archive a logfile
                     # segment


#---------------------------------------------------------------------------
# QUERY TUNING
#---------------------------------------------------------------------------

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

effective_cache_size = 80000        # 80000 * 8192 = 655,360,000 bytes
#effective_cache_size = 1000        # typically 8KB each

random_page_cost = 2.5            # units are one sequential page fetch
#random_page_cost = 4            # units are one sequential page fetch
                     # cost
#cpu_tuple_cost = 0.01            # (same)
#cpu_index_tuple_cost = 0.001        # (same)
#cpu_operator_cost = 0.0025        # (same)

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5            # range 1-10
#geqo_pool_size = 0            # selects default based on effort
#geqo_generations = 0            # selects default based on effort
#geqo_selection_bias = 2.0        # range 1.5-2.0

# - Other Planner Options -

default_statistics_target = 100        # range 1-1000
#default_statistics_target = 10        # range 1-1000
#constraint_exclusion = off
#from_collapse_limit = 8
#join_collapse_limit = 8        # 1 disables collapsing of explicit
                     # JOINs


#---------------------------------------------------------------------------
#---------------------------------------------------------------------------
# RUNTIME STATISTICS
#---------------------------------------------------------------------------

# - Statistics Monitoring -

#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off

# - Query/Index Statistics Collector -

stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on

#stats_start_collector = on
#stats_command_string = off
#stats_block_level = off
#stats_row_level = off
#stats_reset_on_server_start = off


#---------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#---------------------------------------------------------------------------

autovacuum = true
autovacuum_naptime = 600

#autovacuum = false            # enable autovacuum subprocess?
#autovacuum_naptime = 60        # time between autovacuum runs, in secs
#autovacuum_vacuum_threshold = 1000    # min # of tuple updates before
                     # vacuum
#autovacuum_analyze_threshold = 500    # min # of tuple updates before
                     # analyze
#autovacuum_vacuum_scale_factor = 0.4    # fraction of rel size before
                     # vacuum
#autovacuum_analyze_scale_factor = 0.2    # fraction of rel size before
                     # analyze
#autovacuum_vacuum_cost_delay = -1    # default vacuum cost delay for
                     # autovac, -1 means use
                     # vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1    # default vacuum cost limit for
                     # autovac, -1 means use
                     # vacuum_cost_


----------------------
-- Table: tiger.completechain

-- DROP TABLE tiger.completechain;

CREATE TABLE tiger.completechain
(
   ogc_fid int4 NOT NULL DEFAULT
nextval('completechain_ogc_fid_seq'::regclass),
   module varchar(8) NOT NULL,
   tlid int4 NOT NULL,
   side1 int4,
   source varchar(1) NOT NULL,
   fedirp varchar(2),
   fename varchar(30),
   fetype varchar(4),
   fedirs varchar(2),
   cfcc varchar(3) NOT NULL,
   fraddl varchar(11),
   toaddl varchar(11),
   fraddr varchar(11),
   toaddr varchar(11),
   friaddl varchar(1),
   toiaddl varchar(1),
   friaddr varchar(1),
   toiaddr varchar(1),
   zipl int4,
   zipr int4,
   aianhhfpl int4,
   aianhhfpr int4,
   aihhtlil varchar(1),
   aihhtlir varchar(1),
   census1 varchar(1),
   census2 varchar(1),
   statel int4,
   stater int4,
   countyl int4,
   countyr int4,
   cousubl int4,
   cousubr int4,
   submcdl int4,
   submcdr int4,
   placel int4,
   placer int4,
   tractl int4,
   tractr int4,
   blockl int4,
   blockr int4,
   wkb_geometry public.geometry NOT NULL,
   CONSTRAINT enforce_dims_wkb_geometry CHECK (ndims(wkb_geometry) = 2),
   CONSTRAINT enforce_geotype_wkb_geometry CHECK
(geometrytype(wkb_geometry) = 'LINESTRING'::text OR wkb_geometry IS NULL),
   CONSTRAINT enforce_srid_wkb_geometry CHECK (srid(wkb_geometry) = 4269)
)
WITHOUT OIDS;
ALTER TABLE tiger.completechain OWNER TO postgres;


-- Index: tiger.idx_completechain_tlid

-- DROP INDEX tiger.idx_completechain_tlid;

CREATE INDEX idx_completechain_tlid
   ON tiger.completechain
   USING btree
   (tlid);




Re: Index Scan Costs versus Sort

From
Tom Lane
Date:
Charlie Savage <cfis@interserv.com> writes:
> 1.  Postgresql estimates the index scan will be 50 times more costly
> than the seq scan (112870376 vs 2229858) yet in fact it only takes 3
> times longer to execute (2312426 s vs. 768403 s).  My understanding is
> that postgresql assumes, via the random_page_cost parameter, that an
> index scan will take 4 times longer than a sequential scan.  So why is
> the analyzer estimating it is 50 times slower?

The other factors that are likely to affect this are index correlation
and effective cache size.  It's fairly remarkable that a full-table
index scan only takes 3 times longer than a seqscan; you must have both
a high correlation and a reasonably large cache.  You showed us your
effective_cache_size setting, but what's the pg_stats entry for
completechain.tlid contain?  Can you quantify what the physical
ordering of tlid values is likely to be?

            regards, tom lane

Re: Index Scan Costs versus Sort

From
Charlie Savage
Date:
Hi Tom,

 From pg_stats:

schema = "tiger";
tablename = "completechain";
attname = "tlid";
null_frac = 0;
avg_width = 4;
n_distinct = -1;
most_common_vals = ;
most_common_freqs = ;
correlation = 0.155914;

Note that I have default_statistics_target set to 100.  Here is the
first few values from histogram_bounds:


"{102450,2202250,4571797,6365754,8444936,10541593,12485818,14545727,16745594,18421868,20300549,22498643,24114709,26301001,28280632,30370123,32253657,33943046,35898115,37499478,39469054,41868498,43992143,45907830,47826340,49843926,52051798,54409298,56447416,



The tlid column is a US Census bureau ID assigned to each chain in the
US - where a chain is a road segment, river segment, railroad segment,
etc.  The data is loaded on state-by-state basis, and then a
county-by-county basis.  There is no overall ordering to TLIDs, although
perhaps there is some local ordering at the county level (but from a
quick look at the data I don't see any, and the correlation factor
indicates there isn't any if I am interpreting it correctly).

Any other info that would be helpful to see?


Charlie


Tom Lane wrote:
> Charlie Savage <cfis@interserv.com> writes:
>> 1.  Postgresql estimates the index scan will be 50 times more costly
>> than the seq scan (112870376 vs 2229858) yet in fact it only takes 3
>> times longer to execute (2312426 s vs. 768403 s).  My understanding is
>> that postgresql assumes, via the random_page_cost parameter, that an
>> index scan will take 4 times longer than a sequential scan.  So why is
>> the analyzer estimating it is 50 times slower?
>
> The other factors that are likely to affect this are index correlation
> and effective cache size.  It's fairly remarkable that a full-table
> index scan only takes 3 times longer than a seqscan; you must have both
> a high correlation and a reasonably large cache.  You showed us your
> effective_cache_size setting, but what's the pg_stats entry for
> completechain.tlid contain?  Can you quantify what the physical
> ordering of tlid values is likely to be?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: Index Scan Costs versus Sort

From
Charlie Savage
Date:
Following up with some additional information.

The machine has 1Gb physical RAM.  When I run the query (with sort and
seqscan enabled), top reports (numbers are fairly consistent):

Mem:   1,032,972k total, 1,019,516k used, 13,412k free, 17,132k buffers

Swap:  2,032,140k total, 17,592k used, 2,014,548k free, 742,636k cached

The postmaster process is using 34.7% of RAM - 359m virt, 349 res, 319m.
  No other process is using more than 2% of the memory.

 From vmstat:

r  b      swpd     free      buff        cache
1  0     17592    13568     17056        743676

vmstat also shows no swapping going on.

Note that I have part of the database, for just Colorado, on my Windows
XP laptop (table size for completechain table in this case is 1Gb versus
18Gb for the whole US) for development purposes.  I see the same
behavior on it, which is a Dell D6100 laptop with 1Gb, running 8.1, and
a default postgres.conf file with three changes (shared_buffers set to
7000, and work_mem set to 8192, effective_cache_size 2500).

Out of curiosity, how much longer would an index_scan expected to be
versus a seq scan?  I was under the impression it would be about a facto
of 4, or is that not usually the case?

Thanks for the help,

Charlie

Re: Index Scan Costs versus Sort

From
Tom Lane
Date:
Charlie Savage <cfis@interserv.com> writes:
> Out of curiosity, how much longer would an index_scan expected to be
> versus a seq scan?  I was under the impression it would be about a facto
> of 4, or is that not usually the case?

No, it can easily be dozens or even hundreds of times worse, in the
worst case.  The factor of 4 you are thinking of is the random_page_cost
which is the assumed ratio between the cost of randomly fetching a page
and the cost of fetching it in a sequential scan of the whole table.
Not only is the sequential scan fetch normally much cheaper (due to less
seeking and the kernel probably catching on and doing read-ahead), but
if there are N tuples on a page then a seqscan reads them all with one
page fetch.  In the worst case an indexscan might fetch the page from
disk N separate times, if all its tuples are far apart in the index
order.  This is all on top of the extra cost to read the index itself,
too.

The planner's estimate of 50x higher cost is not out of line for small
tuples (large N) and a badly-out-of-order table.  What's puzzling is
that you seem to be getting near best-case behavior in what does not
seem to be a best-case scenario for an indexscan.

            regards, tom lane