Home > mailing lists

Re: Hash join on int takes 8..114 seconds - Mailing list pgsql-performance

From	Andrus
Subject	Re: Hash join on int takes 8..114 seconds
Date	November 21, 2008 15:00:34
Msg-id	EDAB641B4D004D1BB5126A75CFBDD79F@andrusnotebook Whole thread Raw
In response to	Re: Hash join on int takes 8..114 seconds (tv@fuzzy.cz)
Responses	Re: Hash join on int takes 8..114 seconds Re: Hash join on int takes 8..114 seconds Re: Hash join on int takes 8..114 seconds
List	pgsql-performance

Tree view

Thomas,

Thank you.

> Just the most important points:
>
> 1) "dok" table contains 1235086 row versions in 171641 pages (with 8kB
> pages this means 1.4GB MB of data), but there are 1834279 unused item
> pointers (i.e. about 60% of the space is wasted)
>
> 2) "rid" table contains 3275189 roiws in 165282 (with 8kB pages this means
> about 1.3GB of data), but there are 1878923 unused item pointers (i.e.
> about 30% of the space is wasted)
>
> 3) don't forget to execute analyze after vacuuming (or vacuum analyze)

autovacuum is running.
So if I understand properly, I must ran
VACUUM FULL ANALYZE dok;
VACUUM FULL ANALYZE rid;

Those commands cause server probably to stop responding to other client like
vacuum full pg_shdepend
did.

Should vacuum_cost_delay = 2000 allow other users to work when running those
commands ?

> 4) I'm not sure why the sizes reported by you (for example 2.3GB vs 1.5GB
> for "doc" table) - the difference seems too large for me.

I used pg_total_relation_size(). So 2.3 GB includes indexes also:

   8        44286 dok_tasudok_idx                        245 MB
     10        44283 dok_klient_idx                         142 MB
    18        44288 dok_tasumata_idx                       91 MB
    19        44289 dok_tellimus_idx                       89 MB
     20        44284dok_krdokumnr_idx                      89 MB
     21        44285 dok_kuupaev_idx                        84 MB
     22        43531 makse_pkey                             77 MB
     23        43479 dok_pkey                               74 MB
     24        44282 dok_dokumnr_idx                        74 MB
     26     18663923 dok_yksus_pattern_idx                  43 MB
     27     18801591 dok_sihtyksus_pattern_idx              42 MB

> Anyway the amount of wasted rows seems significant to me - I'd try to
> solve this first. Either by VACUUM FULL or by CLUSTER. The CLUSTER will
> lock the table exclusively, but the results may be better (when sorting by
> a well chosen index). Don't forget to run ANALYZE afterwards.

How to invoke those commands so that other clients can continue work?
I'm using 8.1.4.
Log files show that autovacuum is running.

I'm planning the following solution:

1. Set

vacuum_cost_delay=2000

2. Run the following commands periodically in this order:

VACUUM FULL;
vacuum full pg_shdepend;
CLUSTER rid on (toode);
CLUSTER dok  on (kuupaev);
REINDEX DATABASE mydb;
REINDEX SYSTEM mydb;
ANALYZE;

Are all those command required or can something leaved out ?

> Several other things to consider:
>
> 1) Regarding the toode column - why are you using CHAR(20) when the values
> are actually shorter? This may significantly increase the amount of space
> required.

There may be some products whose codes may be up to 20 characters.
PostgreSQL does not hold trailing spaces in db, so this does *not* affect to
space.

> 2) I've noticed the CPU used is Celeron, which may negatively affect the
> speed of hash computation. I'd try to replace it by something faster - say
> INTEGER as an artificial primary key of the "toode" table and using it as
> a FK in other tables.  This might improve the "Bitmap Heap Scan on rid"
> part, but yes - it's just a minor improvement compared to the "Hash Join"
> part of the query.

Natural key Toode CHAR(20) is used widely in different queries. Replacing it
with
INT surrogate key requires major application rewrite.

Should I add surrogate index INT columns to toode and rid table and measure
test query speed in this case?

> Materialized views seem like a good idea to me, but maybe I'm not seeing
> something. What do you mean by "reports are different"? If there is a lot
> of rows for a given product / day, then creating an aggregated table with
> (product code / day) as a primary key is quite simple. It may require a
> lot of disk space, but it'll remove the hash join overhead. But if the
> queries are very different, then it may be difficult to build such
> materialized view(s).

log file seems that mostly only those queries are slow:

SELECT ...
   FROM dok JOIN rid USING (dokumnr)
   JOIN ProductId USING (ProductId)
   WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2

:p1 and :p2 are parameters different for different queries.

dok contains several years of data. :p2 is usually only few previous months
or last year ago.
SELECT column list contains fixed list of known columns from all tables.

How to create index or materialized view to optimize this types of queries ?

Andrus.

pgsql-performance by date:

From: PFC
Date: 21 November 2008, 14:31:15
Subject: Re: Hash join on int takes 8..114 seconds

From: PFC
Date: 21 November 2008, 15:08:00
Subject: Re: Hash join on int takes 8..114 seconds

Re: Hash join on int takes 8..114 seconds - Mailing list pgsql-performance

Previous

Next