Re: Sort performance on large tables - Mailing list pgsql-performance

From Marc Morin
Subject Re: Sort performance on large tables
Date
Msg-id 2BCEB9A37A4D354AA276774EE13FB8C263B0BF@mailserver.sandvine.com
Whole thread Raw
In response to Sort performance on large tables  (Charlie Savage <cfis@interserv.com>)
Responses Re: Sort performance on large tables
List pgsql-performance
I have run into this type of query problem as well.  I solved it in my
application by the following type of query.

SELECT tlid
FROM completechain AS o
WHERE not exists (
    SELECT 1
    FROM completechain
    WHERE tlid=o.tlid and ogc_fid!=o.ogc_fid
);

Assumes of course that you have an index on tlid.

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Charlie Savage
> Sent: Tuesday, November 08, 2005 2:05 AM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Sort performance on large tables
>
> Hi everyone,
>
> I have a question about the performance of sort.
>
> Setup: Dell Dimension 3000, Suse 10, 1GB ram, PostgreSQL 8.1
> RC 1 with PostGIS, 1 built-in 80 GB IDE drive, 1 SATA Seagate
> 400GB drive.  The IDE drive has the OS and the WAL files, the
> SATA drive the database.
>  From hdparm the max IO for the IDE drive is about 50Mb/s and
> the SATA drive is about 65Mb/s.  Thus a very low-end machine
> - but it used just for development (i.e., it is not a
> production machine) and the only thing it does is run a
> PostgresSQL database.
>
> I have a staging table called completechain that holds US
> tiger data (i.e., streets and addresses for the US).  The
> table is approximately 18GB.  Its big because there is a lot
> of data, but also because the table is not normalized (it
> comes that way).
>
> I want to extract data out of the file, with the most
> important values being stored in a column called tlid.  The
> tlid field is an integer, and the values are 98% unique.
> There is a second column called ogc_fid which is unique (it
> is a serial field).  I need to extract out unique TLID's
> (doesn't matter which duplicate I get rid of).  To do this I
> am running this query:
>
> SELECT tlid, min(ogc_fid)
> FROM completechain
> GROUP BY tlid;
>
> The results from explain analyze are:
>
> "GroupAggregate  (cost=10400373.80..11361807.88 rows=48071704
> width=8) (actual time=7311682.715..8315746.835 rows=47599910 loops=1)"
> "  ->  Sort  (cost=10400373.80..10520553.06 rows=48071704
> width=8) (actual time=7311682.682..7972304.777 rows=48199165 loops=1)"
> "        Sort Key: tlid"
> "        ->  Seq Scan on completechain  (cost=0.00..2228584.04
> rows=48071704 width=8) (actual time=27.514..773245.046
> rows=48199165 loops=1)"
> "Total runtime: 8486057.185 ms"
>
> Doing a similar query produces the same results:
>
> SELECT DISTINCT ON (tlid), tlid, ogc_fid FROM completechain;
>
> Note it takes over 10 times longer to do the sort than the
> full sequential scan.
>
> Should I expect results like this?  I realize that the
> computer is quite low-end and is very IO bound for this
> query, but I'm still surprised that the sort operation takes so long.
>
> Out of curiosity, I setup an Oracle database on the same
> machine with the same data and ran the same query.  Oracle
> was over an order of magnitude faster. Looking at its query
> plan, it avoided the sort by using "HASH GROUP BY."  Does
> such a construct exist in PostgreSQL (I see only hash joins)?
>
> Also as an experiment I forced oracle to do a sort by running
> this query:
>
> SELECT tlid, min(ogc_fid)
> FROM completechain
> GROUP BY tlid
> ORDER BY tlid;
>
> Even with this, it was more than a magnitude faster than Postgresql.
> Which makes me think I have somehow misconfigured postgresql
> (see the relevant parts of postgresql.conf below).
>
> Any idea/help appreciated.
>
> Thanks,
>
> Charlie
>
>
> -------------------------------
>
> #-------------------------------------------------------------
> --------------
> # RESOURCE USAGE (except WAL)
> #-------------------------------------------------------------
> --------------
>
> shared_buffers = 40000                  # 40000 buffers * 8192
> bytes/buffer = 327,680,000 bytes
> #shared_buffers = 1000            # min 16 or
> max_connections*2, 8KB each
>
> temp_buffers = 5000
> #temp_buffers = 1000            # min 100, 8KB each
> #max_prepared_transactions = 5        # can be 0 or more
> # note: increasing max_prepared_transactions costs ~600 bytes
> of shared memory # per transaction slot, plus lock space (see
> max_locks_per_transaction).
>
> work_mem =  16384                        # in Kb
> #work_mem = 1024            # min 64, size in KB
>
> maintenance_work_mem = 262144            # in kb
> #maintenance_work_mem = 16384        # min 1024, size in KB
> #max_stack_depth = 2048            # min 100, size in KB
>
> # - Free Space Map -
>
> max_fsm_pages = 60000
> #max_fsm_pages = 20000            # min
> max_fsm_relations*16, 6 bytes each
>
> #max_fsm_relations = 1000        # min 100, ~70 bytes each
>
> # - Kernel Resource Usage -
>
> #max_files_per_process = 1000        # min 25
> #preload_libraries = ''
>
> # - Cost-Based Vacuum Delay -
>
> #vacuum_cost_delay = 0            # 0-1000 milliseconds
> #vacuum_cost_page_hit = 1        # 0-10000 credits
> #vacuum_cost_page_miss = 10        # 0-10000 credits
> #vacuum_cost_page_dirty = 20        # 0-10000 credits
> #vacuum_cost_limit = 200        # 0-10000 credits
>
> # - Background writer -
>
> #bgwriter_delay = 200            # 10-10000 milliseconds
> between rounds
> #bgwriter_lru_percent = 1.0        # 0-100% of LRU buffers
> scanned/round
> #bgwriter_lru_maxpages = 5        # 0-1000 buffers max
> written/round
> #bgwriter_all_percent = 0.333        # 0-100% of all buffers
> scanned/round
> #bgwriter_all_maxpages = 5        # 0-1000 buffers max
> written/round
>
>
> #-------------------------------------------------------------
> --------------
> # WRITE AHEAD LOG
> #-------------------------------------------------------------
> --------------
>
> # - Settings -
>
> fsync = on                # turns forced
> synchronization on or off
> #wal_sync_method = fsync        # the default is the
> first option
>                     # supported by the
> operating system:
>                     #   open_datasync
>                     #   fdatasync
>                     #   fsync
>                     #   fsync_writethrough
>                     #   open_sync
> #full_page_writes = on            # recover from
> partial page writes
>
> wal_buffers = 128
> #wal_buffers = 8            # min 4, 8KB each
>
> #commit_delay = 0            # range 0-100000, in
> microseconds
> #commit_siblings = 5            # range 1-1000
>
> # - Checkpoints -
>
> checkpoint_segments = 256               # 256 * 16Mb =
> 4,294,967,296 bytes
> checkpoint_timeout = 1200        # 1200 seconds (20 minutes)
> checkpoint_warning = 30            # in seconds, 0 is off
>
> #checkpoint_segments = 3        # in logfile segments,
> min 1, 16MB each
> #checkpoint_timeout = 300        # range 30-3600, in seconds
> #checkpoint_warning = 30        # in seconds, 0 is off
>
> # - Archiving -
>
> #archive_command = ''            # command to use to
> archive a logfile
>                     # segment
>
>
> #-------------------------------------------------------------
> --------------
> # QUERY TUNING
> #-------------------------------------------------------------
> --------------
>
> # - Planner Method Configuration -
>
> #enable_bitmapscan = on
> #enable_hashagg = on
> #enable_hashjoin = on
> #enable_indexscan = on
> #enable_mergejoin = on
> #enable_nestloop = on
> #enable_seqscan = on
> #enable_sort = on
> #enable_tidscan = on
>
> # - Planner Cost Constants -
>
> effective_cache_size = 80000        # 80000 * 8192 =
> 655,360,000 bytes
> #effective_cache_size = 1000        # typically 8KB each
>
> random_page_cost = 2.5            # units are one
> sequential page fetch
> #random_page_cost = 4            # units are one
> sequential page fetch
>                     # cost
> #cpu_tuple_cost = 0.01            # (same)
> #cpu_index_tuple_cost = 0.001        # (same)
> #cpu_operator_cost = 0.0025        # (same)
>
> # - Genetic Query Optimizer -
>
> #geqo = on
> #geqo_threshold = 12
> #geqo_effort = 5            # range 1-10
> #geqo_pool_size = 0            # selects default based
> on effort
> #geqo_generations = 0            # selects default based
> on effort
> #geqo_selection_bias = 2.0        # range 1.5-2.0
>
> # - Other Planner Options -
>
> default_statistics_target = 100        # range 1-1000
> #default_statistics_target = 10        # range 1-1000
> #constraint_exclusion = off
> #from_collapse_limit = 8
> #join_collapse_limit = 8        # 1 disables collapsing
> of explicit
>                     # JOINs
>
>
> #-------------------------------------------------------------
> --------------
> #-------------------------------------------------------------
> --------------
> # RUNTIME STATISTICS
> #-------------------------------------------------------------
> --------------
>
> # - Statistics Monitoring -
>
> #log_parser_stats = off
> #log_planner_stats = off
> #log_executor_stats = off
> #log_statement_stats = off
>
> # - Query/Index Statistics Collector -
>
> stats_start_collector = on
> stats_command_string = on
> stats_block_level = on
> stats_row_level = on
>
> #stats_start_collector = on
> #stats_command_string = off
> #stats_block_level = off
> #stats_row_level = off
> #stats_reset_on_server_start = off
>
>
> #-------------------------------------------------------------
> --------------
> # AUTOVACUUM PARAMETERS
> #-------------------------------------------------------------
> --------------
>
> autovacuum = true
> autovacuum_naptime = 600
>
> #autovacuum = false            # enable autovacuum subprocess?
> #autovacuum_naptime = 60        # time between
> autovacuum runs, in secs
> #autovacuum_vacuum_threshold = 1000    # min # of tuple updates before
>                     # vacuum
> #autovacuum_analyze_threshold = 500    # min # of tuple updates before
>                     # analyze
> #autovacuum_vacuum_scale_factor = 0.4    # fraction of rel size before
>                     # vacuum
> #autovacuum_analyze_scale_factor = 0.2    # fraction of
> rel size before
>                     # analyze
> #autovacuum_vacuum_cost_delay = -1    # default vacuum cost delay for
>                     # autovac, -1 means use
>                     # vacuum_cost_delay
> #autovacuum_vacuum_cost_limit = -1    # default vacuum cost limit for
>                     # autovac, -1 means use
>                     # vacuum_cost_
>
>
> ----------------------
>
> CREATE TABLE tiger.completechain
> (
>    ogc_fid int4 NOT NULL DEFAULT
> nextval('completechain_ogc_fid_seq'::regclass),
>    module varchar(8) NOT NULL,
>    tlid int4 NOT NULL,
>    side1 int4,
>    source varchar(1) NOT NULL,
>    fedirp varchar(2),
>    fename varchar(30),
>    fetype varchar(4),
>    fedirs varchar(2),
>    cfcc varchar(3) NOT NULL,
>    fraddl varchar(11),
>    toaddl varchar(11),
>    fraddr varchar(11),
>    toaddr varchar(11),
>    friaddl varchar(1),
>    toiaddl varchar(1),
>    friaddr varchar(1),
>    toiaddr varchar(1),
>    zipl int4,
>    zipr int4,
>    aianhhfpl int4,
>    aianhhfpr int4,
>    aihhtlil varchar(1),
>    aihhtlir varchar(1),
>    census1 varchar(1),
>    census2 varchar(1),
>    statel int4,
>    stater int4,
>    countyl int4,
>    countyr int4,
>    cousubl int4,
>    cousubr int4,
>    submcdl int4,
>    submcdr int4,
>    placel int4,
>    placer int4,
>    tractl int4,
>    tractr int4,
>    blockl int4,
>    blockr int4,
>    wkb_geometry public.geometry NOT NULL,
>    CONSTRAINT enforce_dims_wkb_geometry CHECK
> (ndims(wkb_geometry) = 2),
>    CONSTRAINT enforce_geotype_wkb_geometry CHECK
> (geometrytype(wkb_geometry) = 'LINESTRING'::text OR
> wkb_geometry IS NULL),
>    CONSTRAINT enforce_srid_wkb_geometry CHECK
> (srid(wkb_geometry) = 4269)
> )
> WITHOUT OIDS;
> ALTER TABLE tiger.completechain OWNER TO postgres;
>
>
>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>

pgsql-performance by date:

Previous
From: "Luke Lonergan"
Date:
Subject: Re: Sort performance on large tables
Next
From: Stephan Szabo
Date:
Subject: Re: Sort performance on large tables