Home > mailing lists

Re: Caching (was Re: choosing the right platform) - Mailing list pgsql-performance

From	Ron Mayer
Subject	Re: Caching (was Re: choosing the right platform)
Date	April 10, 2003 22:15:09
Msg-id	POEDIPIPKGJJLDNIEMBEOEAACJAA.ron@intervideo.com Whole thread Raw
In response to	Re: Caching (was Re: choosing the right platform) (Josh Berkus <josh@agliodbs.com>)
Responses	Re: Caching (was Re: choosing the right platform) Re: Caching (was Re: choosing the right platform)
List	pgsql-performance

Tree view

Short summary...

  I think sort_mem matters quite a bit (20-40%) on
  my data-warehousing applications.

  Am I doing something wrong to need so much sort_mem?

Josh wrote:
>> ** It's better to not use huge amount of sort-mem...
>
>...However, I have never seen a database use more than 32mb sort mem
>in a single process, so I don't think the 2GB limit will hurt you much ...

Do you think this is true in data warehousing applications as well?

During the ETL part of data warehousing, large sorts are often
used to get the "new" values that need to be inserted
into "dimension" tables, like this:
  INSERT INTO dimension_val (id,val)
       SELECT nextval('val_seq'),val
         FROM (SELECT DISTINCT val FROM import_table
               EXCEPT
               SELECT          val FROM dimension_val) as a;
As far as I can tell, this query typically does two sorts,
one for the distinct, and one for the except.


In a data warehouse we have here, we load about 3 million rows
each week; load time improved from about 9 to 7 hours
by breaking up such queries into expressions that only require
one sort at a time, and surrounding the expressions with
"set sort_mem=something_big" statements to give it enough
space to not hit the disk.

  SET SORT_MEM=300000;
  CREATE TEMPORARY TABLE potential_new_values AS
     SELECT DISTINCT val FROM import_table;
  ...
  SET SORT_MEM=1000;

Anyone else have similar experience, or am I doing something
wrong to need so much SORT_MEM?


    Ron



PS:

Below is an example of another real-world query from the same
reporting system that benefits from a sort_mem over 32M.
Explain analyze (below) shows a 40% improvement by having
the sort fit in memory.

10Meg and 32Meg take over 22 seconds.  100Meg takes 14.

====================================================================================================
logs2=#
logs2=#
logs2=# set sort_mem=10000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE:  QUERY PLAN:

Unique  (cost=71612.82..72838.69 rows=49035 width=17) (actual time=20315.47..22457.21 rows=2914 loops=1)
  ->  Sort  (cost=71612.82..71612.82 rows=490348 width=17) (actual time=20315.46..21351.42 rows=511368 loops=1)
        ->  Seq Scan on c_transaction_credit  (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.08..2932.72
rows=511368
loops=1)
Total runtime: 22475.63 msec

EXPLAIN
logs2=# set sort_mem=32000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE:  QUERY PLAN:

Unique  (cost=60442.82..61668.69 rows=49035 width=17) (actual time=22657.31..24794.19 rows=2914 loops=1)
  ->  Sort  (cost=60442.82..60442.82 rows=490348 width=17) (actual time=22657.30..23714.43 rows=511368 loops=1)
        ->  Seq Scan on c_transaction_credit  (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.07..3020.83
rows=511368
loops=1)
Total runtime: 24811.65 msec

EXPLAIN
logs2=# set sort_mem=100000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE:  QUERY PLAN:

Unique  (cost=60442.82..61668.69 rows=49035 width=17) (actual time=12205.19..14012.57 rows=2914 loops=1)
  ->  Sort  (cost=60442.82..60442.82 rows=490348 width=17) (actual time=12205.18..12710.16 rows=511368 loops=1)
        ->  Seq Scan on c_transaction_credit  (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.08..3001.05
rows=511368
loops=1)
Total runtime: 14187.96 msec

EXPLAIN
logs2=#

pgsql-performance by date:

From: Jean-Luc Lachance
Date: 10 April 2003, 17:59:32
Subject: Re: Caching (was Re: choosing the right platform)

From: Josh Berkus
Date: 10 April 2003, 23:13:35
Subject: Re: Help analyzing 7.2.4 EXPLAIN

Re: Caching (was Re: choosing the right platform) - Mailing list pgsql-performance

Previous

Next