Re: Hash Anti Join performance degradation - Mailing list pgsql-performance

From Tom Lane
Subject Re: Hash Anti Join performance degradation
Date
Msg-id 15099.1306269513@sss.pgh.pa.us
Whole thread Raw
In response to Hash Anti Join performance degradation  (panam <panam@gmx.net>)
List pgsql-performance
panam <panam@gmx.net> writes:
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:

> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;

BTW, this query doesn't actually match the EXPLAIN outputs...

> So from my limited experience, the only significant difference I see is that
> the Hash Anti Join takes a lot more time in plan 2, but I do not understand
> why.

Whatever's going on is below the level that EXPLAIN can show.  I can
think of a couple of possibilities:

1. The "extra" rows in the slower case all manage to come out to the
same hash value, or some very small number of distinct hash values, such
that we spend a lot of time searching a single hash chain.  But it's
hard to credit that adding 0.4% more rows could result in near 100x
slowdown, no matter how bad their distribution.

2. There's some inefficiency in the use of the temp files, though again
it's far from clear why your two cases would be noticeably different
there.  Possibly enabling log_temp_files would tell you something useful.

One other thing I'm not following is how come it's using hash temp files
at all, when you claim in your later message that you've got work_mem
set to 1GB.  It should certainly not take more than a couple meg to hold
20K rows at 16 payload bytes per row.  You might want to check whether
that setting actually took effect.

            regards, tom lane

pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Performance degradation of inserts when database size grows
Next
From: Greg Smith
Date:
Subject: Re: Performance degradation of inserts when database size grows