Thread: Hash Anti Join performance degradation

Hash Anti Join performance degradation

From

panam

Date:

24 May 2011, 01:14:33

Hi,

In my DB, there is a large table containing messages and one that contains
message boxes.
Messages are assigned to boxes via a child parent link m->b.
In order to obtain the last message for a specific box, I use the following
SQL:

SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;

This worked quite well for a long time. But now, suddenly the performance of
this query drastically degraded as new messages were added.
If these new messages are removed again, everything is back to normal. If
other messages are removed instead, the problem remains, so it does not seem
to be a memory issue. I fear I have difficulties to understand what is going
wrong here.

This is the query plan when everything is fine:

"Seq Scan on public.box this_  (cost=0.00..10467236.32 rows=128 width=696)
(actual time=0.169..7683.978 rows=128 loops=1)"
"  Output: this_.id, this_.login, (SubPlan 1)"
"  Buffers: shared hit=188413 read=94635 written=135, temp read=22530
written=22374"
"  SubPlan 1"
"    ->  Hash Anti Join  (cost=41323.25..81775.25 rows=20427 width=8)
(actual time=59.571..59.877 rows=1 loops=128)"
"          Output: m1.id"
"          Hash Cond: (m1.box_id = m2.box_id)"
"          Join Filter: (m1.id < m2.id)"
"          Buffers: shared hit=188412 read=94633 written=135, temp
read=22530 written=22374"
"          ->  Bitmap Heap Scan on public.message m1  (cost=577.97..40212.28
rows=30640 width=16) (actual time=3.152..9.514 rows=17982 loops=128)"
"                Output: m1.id, m1.box_id"
"                Recheck Cond: (m1.box_id = $0)"
"                Buffers: shared hit=131993 read=9550 written=23"
"                ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.840..2.840 rows=18193
loops=128)"
"                      Index Cond: (m1.box_id = $0)"
"                      Buffers: shared hit=314 read=6433 written=23"
"          ->  Hash  (cost=40212.28..40212.28 rows=30640 width=16) (actual
time=26.840..26.840 rows=20014 loops=115)"
"                Output: m2.box_id, m2.id"
"                Buckets: 4096  Batches: 4 (originally 2)  Memory Usage:
5444kB"
"                Buffers: shared hit=56419 read=85083 written=112, temp
written=7767"
"                ->  Bitmap Heap Scan on public.message m2
(cost=577.97..40212.28 rows=30640 width=16) (actual time=2.419..20.007
rows=20014 loops=115)"
"                      Output: m2.box_id, m2.id"
"                      Recheck Cond: (m2.box_id = $0)"
"                      Buffers: shared hit=56419 read=85083 written=112"
"                      ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.166..2.166 rows=20249
loops=115)"
"                            Index Cond: (m2.box_id = $0)"
"                            Buffers: shared hit=6708"
"Total runtime: 7685.202 ms"

This is the plan when the query gets sluggish:

"Seq Scan on public.box this_  (cost=0.00..10467236.32 rows=128 width=696)
(actual time=0.262..179333.086 rows=128 loops=1)"
"  Output: this_.id, this_.login, (SubPlan 1)"
"  Buffers: shared hit=189065 read=93983 written=10, temp read=22668
written=22512"
"  SubPlan 1"
"    ->  Hash Anti Join  (cost=41323.25..81775.25 rows=20427 width=8)
(actual time=1264.700..1400.886 rows=1 loops=128)"
"          Output: m1.id"
"          Hash Cond: (m1.box_id = m2.box_id)"
"          Join Filter: (m1.id < m2.id)"
"          Buffers: shared hit=189064 read=93981 written=10, temp read=22668
written=22512"
"          ->  Bitmap Heap Scan on public.message m1  (cost=577.97..40212.28
rows=30640 width=16) (actual time=3.109..9.850 rows=18060 loops=128)"
"                Output: m1.id, m1.box_id"
"                Recheck Cond: (m1.box_id = $0)"
"                Buffers: shared hit=132095 read=9448"
"                ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.867..2.867 rows=18193
loops=128)"
"                      Index Cond: (m1.box_id = $0)"
"                      Buffers: shared hit=312 read=6435"
"          ->  Hash  (cost=40212.28..40212.28 rows=30640 width=16) (actual
time=27.533..27.533 rows=20102 loops=115)"
"                Output: m2.box_id, m2.id"
"                Buckets: 4096  Batches: 4 (originally 2)  Memory Usage:
5522kB"
"                Buffers: shared hit=56969 read=84533 written=10, temp
written=7811"
"                ->  Bitmap Heap Scan on public.message m2
(cost=577.97..40212.28 rows=30640 width=16) (actual time=2.406..20.492
rows=20102 loops=115)"
"                      Output: m2.box_id, m2.id"
"                      Recheck Cond: (m2.box_id = $0)"
"                      Buffers: shared hit=56969 read=84533 written=10"
"                      ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.170..2.170 rows=20249
loops=115)"
"                            Index Cond: (m2.box_id = $0)"
"                            Buffers: shared hit=6708"
"Total runtime: 179334.310 ms"


So from my limited experience, the only significant difference I see is that
the Hash Anti Join takes a lot more time in plan 2, but I do not understand
why.
Ideas somebody?

Thanks
panam


--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4420974.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Hash Anti Join performance degradation

From

Craig Ringer

Date:

24 May 2011, 02:53:13

On 24/05/11 12:14, panam wrote:
> Hi,
>
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:
>
> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;
>
> This worked quite well for a long time. But now, suddenly the performance of
> this query drastically degraded as new messages were added.
> If these new messages are removed again, everything is back to normal. If
> other messages are removed instead, the problem remains, so it does not seem
> to be a memory issue. I fear I have difficulties to understand what is going
> wrong here.

The usual cause is that the statistics for estimated row counts cross a
threshold that makes the query planner think that a different kind of
plan will be faster.

If the query planner is using bad information about the performance of
the storage, then it will be making bad decisions about which approach
is faster. So the usual thing to do is to adjust seq_page_cost and
random_page_cost to more closely reflect the real performance of your
hardware, and to make sure that effective_cache_size matches the real
amount of memory your computer has free for disk cache use.

Newer versions of PostgreSQL always include query planning and
statistics improvements too.

BTW, it can be really helpful to paste your query plans into
http://explain.depesz.com/ , which will provide an easier to read visual
analysis of the plan. This will only work with query plans that haven't
been butchered by mail client word wrapping, so I can't do it for you,
but if you paste them there and post the links that'd be really handy.

Also have a look at http://wiki.postgresql.org/wiki/SlowQueryQuestions

I found the plans you posted a bit hard to read. Not your fault; it's
stupid mail clients.  Maybe depesz.com needs to be taught to de-munge
the damage done to plans by common mail clients.

--
Craig Ringer

Re: Hash Anti Join performance degradation

From

Cédric Villemain

Date:

24 May 2011, 08:17:12

2011/5/24 panam <panam@gmx.net>:
> Hi,
>
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:
>
> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;
>
> This worked quite well for a long time. But now, suddenly the performance of
> this query drastically degraded as new messages were added.
> If these new messages are removed again, everything is back to normal. If
> other messages are removed instead, the problem remains, so it does not seem
> to be a memory issue. I fear I have difficulties to understand what is going
> wrong here.

We need more information here. The case is in fact interesting.
What's the PostgreSQL version, and did you have log of vacuum and
checkpoint activity ? (no vacuum full/cluster or such thing running ?)

Obvisouly, Craig suggestion to read
http://wiki.postgresql.org/wiki/SlowQueryQuestions is relevant as it
helps to have all common information required to analyze the issue.

>
> This is the query plan when everything is fine:
>
> "Seq Scan on public.box this_  (cost=0.00..10467236.32 rows=128 width=696)
> (actual time=0.169..7683.978 rows=128 loops=1)"
> "  Output: this_.id, this_.login, (SubPlan 1)"
> "  Buffers: shared hit=188413 read=94635 written=135, temp read=22530
> written=22374"
> "  SubPlan 1"
> "    ->  Hash Anti Join  (cost=41323.25..81775.25 rows=20427 width=8)
> (actual time=59.571..59.877 rows=1 loops=128)"
> "          Output: m1.id"
> "          Hash Cond: (m1.box_id = m2.box_id)"
> "          Join Filter: (m1.id < m2.id)"
> "          Buffers: shared hit=188412 read=94633 written=135, temp
> read=22530 written=22374"
> "          ->  Bitmap Heap Scan on public.message m1  (cost=577.97..40212.28
> rows=30640 width=16) (actual time=3.152..9.514 rows=17982 loops=128)"
> "                Output: m1.id, m1.box_id"
> "                Recheck Cond: (m1.box_id = $0)"
> "                Buffers: shared hit=131993 read=9550 written=23"
> "                ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.840..2.840 rows=18193
> loops=128)"
> "                      Index Cond: (m1.box_id = $0)"
> "                      Buffers: shared hit=314 read=6433 written=23"
> "          ->  Hash  (cost=40212.28..40212.28 rows=30640 width=16) (actual
> time=26.840..26.840 rows=20014 loops=115)"
> "                Output: m2.box_id, m2.id"
> "                Buckets: 4096  Batches: 4 (originally 2)  Memory Usage:
> 5444kB"
> "                Buffers: shared hit=56419 read=85083 written=112, temp
> written=7767"
> "                ->  Bitmap Heap Scan on public.message m2
> (cost=577.97..40212.28 rows=30640 width=16) (actual time=2.419..20.007
> rows=20014 loops=115)"
> "                      Output: m2.box_id, m2.id"
> "                      Recheck Cond: (m2.box_id = $0)"
> "                      Buffers: shared hit=56419 read=85083 written=112"
> "                      ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.166..2.166 rows=20249
> loops=115)"
> "                            Index Cond: (m2.box_id = $0)"
> "                            Buffers: shared hit=6708"
> "Total runtime: 7685.202 ms"
>
> This is the plan when the query gets sluggish:
>
> "Seq Scan on public.box this_  (cost=0.00..10467236.32 rows=128 width=696)
> (actual time=0.262..179333.086 rows=128 loops=1)"
> "  Output: this_.id, this_.login, (SubPlan 1)"
> "  Buffers: shared hit=189065 read=93983 written=10, temp read=22668
> written=22512"
> "  SubPlan 1"
> "    ->  Hash Anti Join  (cost=41323.25..81775.25 rows=20427 width=8)
> (actual time=1264.700..1400.886 rows=1 loops=128)"
> "          Output: m1.id"
> "          Hash Cond: (m1.box_id = m2.box_id)"
> "          Join Filter: (m1.id < m2.id)"
> "          Buffers: shared hit=189064 read=93981 written=10, temp read=22668
> written=22512"
> "          ->  Bitmap Heap Scan on public.message m1  (cost=577.97..40212.28
> rows=30640 width=16) (actual time=3.109..9.850 rows=18060 loops=128)"
> "                Output: m1.id, m1.box_id"
> "                Recheck Cond: (m1.box_id = $0)"
> "                Buffers: shared hit=132095 read=9448"
> "                ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.867..2.867 rows=18193
> loops=128)"
> "                      Index Cond: (m1.box_id = $0)"
> "                      Buffers: shared hit=312 read=6435"
> "          ->  Hash  (cost=40212.28..40212.28 rows=30640 width=16) (actual
> time=27.533..27.533 rows=20102 loops=115)"
> "                Output: m2.box_id, m2.id"
> "                Buckets: 4096  Batches: 4 (originally 2)  Memory Usage:
> 5522kB"
> "                Buffers: shared hit=56969 read=84533 written=10, temp
> written=7811"
> "                ->  Bitmap Heap Scan on public.message m2
> (cost=577.97..40212.28 rows=30640 width=16) (actual time=2.406..20.492
> rows=20102 loops=115)"
> "                      Output: m2.box_id, m2.id"
> "                      Recheck Cond: (m2.box_id = $0)"
> "                      Buffers: shared hit=56969 read=84533 written=10"
> "                      ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.170..2.170 rows=20249
> loops=115)"
> "                            Index Cond: (m2.box_id = $0)"
> "                            Buffers: shared hit=6708"
> "Total runtime: 179334.310 ms"
>
>
> So from my limited experience, the only significant difference I see is that
> the Hash Anti Join takes a lot more time in plan 2, but I do not understand
> why.
> Ideas somebody?
>
> Thanks
> panam
>
>
> --
> View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4420974.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Hash Anti Join performance degradation

From

panam

Date:

24 May 2011, 11:35:06

Hi Craig and Cédric,

Thanks for the very informative introduction to the netiquette here and
thanks for sharing your time.
I wasn't aware of http://explain.depesz.com/, very useful.
So, here are the query plans:
http://explain.depesz.com/s/6AU (1st from previous post, good)
http://explain.depesz.com/s/YPS (2nd from previous post, bad)

> The usual cause is that the statistics for estimated row counts cross a
> threshold that makes the query planner think that a different kind of
> plan will be faster.

Hm, as far as i understand the plans, they are equivalent, aren't they?

> If the query planner is using bad information about the performance of
> the storage, then it will be making bad decisions about which approach
> is faster. So the usual thing to do is to adjust seq_page_cost and
> random_page_cost to more closely reflect the real performance of your
> hardware, and to make sure that effective_cache_size matches the real
> amount of memory your computer has free for disk cache use.

Will this make any difference even when the plans are equivalent as assumed
above?

The table creation SQL is as follows:
http://pastebin.com/qFDUP7Aa (Message table); ~ 2328680    rows, is growing
constantly (~ 10000 new rows each day),
http://pastebin.com/vEmh4hb8 (Box table); ~ 128    rows (growing very slowly 1
row every two days, each row updated about 2x a day)

The DB contains the same data, except that for the "good" query, the last
10976 rows (0.4%) of message are removed by doing a

DELETE FROM message where timestamp > TO_DATE ('05/23/2011','mm/dd/yyyy');


This speeds up the query by a factor of ~27. (207033.081 (bad) vs. 7683.978
(good)).

Each query was run before and after a vacuum analyze, one time to create
appropriate statistics, and the second time to do the actual measurement.
All tests were made on the dev-machine, which is a 8GB, Core i7, Windows 7

I experienced the issue at first on the "production"-environment, which is a
64-bit Ubuntu, running PostgreSQL 9.0.1 on x86_64-unknown-linux-gnu,
compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit,
and later for analysis on the dev-environment, which is a
64-bit Windows 7, running PostgreSQL 9.0.4, compiled by Visual C++ build
1500, 64-bit
For testing, I've increased the buffers that I judge important for the issue
to the following values:
effective_cache_size: 4GB
shared_buffers: 1GB
work_mem: 1GB
temp_buffers: 32MB
After that, configuration was reloaded and the postgresql service was
restarted using pgAdmin.
Interestingly, there was no performance gain as compared to the default
settings, the "bad" query even took about 30 seconds (15%) longer.
As well it seems, all data fit into memory, so there is not much disk I/O
involved.

@Cédric
> did you have log of vacuum and checkpoint activity ?
> (no vacuum full/cluster or such thing running ?)
There is no clustering involved here, its a pretty basic setup.
How can I obtain the information you require here? I could send you the
output of the analyse vacuum command from pgAdmin, but is there a way to
make it output the information in English (rather than German)?

Thanks for your interest in this issue.

Regards,
panam

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4422247.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Hash Anti Join performance degradation

From

Tom Lane

Date:

24 May 2011, 17:38:44

panam <panam@gmx.net> writes:
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:

> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;

BTW, this query doesn't actually match the EXPLAIN outputs...

> So from my limited experience, the only significant difference I see is that
> the Hash Anti Join takes a lot more time in plan 2, but I do not understand
> why.

Whatever's going on is below the level that EXPLAIN can show.  I can
think of a couple of possibilities:

1. The "extra" rows in the slower case all manage to come out to the
same hash value, or some very small number of distinct hash values, such
that we spend a lot of time searching a single hash chain.  But it's
hard to credit that adding 0.4% more rows could result in near 100x
slowdown, no matter how bad their distribution.

2. There's some inefficiency in the use of the temp files, though again
it's far from clear why your two cases would be noticeably different
there.  Possibly enabling log_temp_files would tell you something useful.

One other thing I'm not following is how come it's using hash temp files
at all, when you claim in your later message that you've got work_mem
set to 1GB.  It should certainly not take more than a couple meg to hold
20K rows at 16 payload bytes per row.  You might want to check whether
that setting actually took effect.

            regards, tom lane

Re: Hash Anti Join performance degradation

From

Craig Ringer

Date:

24 May 2011, 22:53:26

On 24/05/11 22:34, panam wrote:

>> The usual cause is that the statistics for estimated row counts cross a
>> threshold that makes the query planner think that a different kind of
>> plan will be faster.
>
> Hm, as far as i understand the plans, they are equivalent, aren't they?

Yes, they are, and the estimates are too. This isn't the usual case
where the planner trips over a threshold and switches to a totally
different plan type that it thinks is faster, but isn't.

The estimates are actually IDENTICAL for the hash anti-join node of
interest, and so are the actual loop count and row count. Temp file
activity is much the same across both plans too.

You can reproduce this behaviour consistently? It's _seriously_ weird,
and the sort of thing that when I encounter myself I tend to ask "what
else is going on that I'm missing?".

What happens if you DELETE more rows? Or fewer? What's the threshold?

What happens if you DELETE rows from the start not the end, or a random
selection?

Does the problem persist if you DELETE the rows then CLUSTER the table
before running the query?

Does the problem persist if you DELETE the rows then REINDEX?

>> If the query planner is using bad information about the performance of
>> the storage, then it will be making bad decisions about which approach
>> is faster. [snip]
>
> Will this make any difference even when the plans are equivalent as assumed
> above?

Nope. It doesn't seem to be a problem with plan selection.

> This speeds up the query by a factor of ~27. (207033.081 (bad) vs. 7683.978
> (good)).

That's a serious WTF.

> @Cédric
>> did you have log of vacuum and checkpoint activity ?
>> (no vacuum full/cluster or such thing running ?)
> There is no clustering involved here, its a pretty basic setup.

He means 'CLUSTER', the SQL command that tells PostgreSQL to re-organize
a table.

The answer from the rest of your post would appear to be 'no, it's being
run in an otherwise-idle stand-alone test environment'. Right?

> How can I obtain the information you require here? I could send you the
> output of the analyse vacuum command from pgAdmin, but is there a way to
> make it output the information in English (rather than German)?

It's easy enough to read familiar output like that in German, if needs be.

--
Craig Ringer

Re: Hash Anti Join performance degradation

From

panam

Date:

25 May 2011, 14:00:44

Hi all,

@Tom,
> BTW, this query doesn't actually match the EXPLAIN outputs...
You're right, it is actually just the "heavy" subquery of a larger query
which can be found here:
http://pastebin.com/fuGrt0tB

> One other thing I'm not following is how come it's using hash temp files
> at all, when you claim in your later message that you've got work_mem
> set to 1GB.  It should certainly not take more than a couple meg to hold
> 20K rows at 16 payload bytes per row.  You might want to check whether
> that setting actually took effect.
 As I said, I drastically increased the buffer sizes (at least I intended
to) to see if it changed something. It first I thought it wouldn't. But
yesterday (I think it was after a reboot), the "bad" queries suddenly were
much faster (~ 20secs, still at least 3 times slower than the "good" queries
though). Today, they were very slow again (I replayed the dumps in between).
So I am not sure whether postgres actually picks up the altered
configuration (even after reboot). Is there a way to determine the values
actually used?

@Craig
> You can reproduce this behaviour consistently? It's _seriously_ weird,
> and the sort of thing that when I encounter myself I tend to ask "what
> else is going on that I'm missing?".
Yes, I can reproduce it consistently using a dumpfile.

> What happens if you DELETE more rows? Or fewer? What's the threshold?
> What happens if you DELETE rows from the start not the end, or a random
selection?
From some experiments I made earlier, I conclude that the rows added last
are the problem. Deleting the first 10% did not seem to speed up the bad
query. However, I haven't checked that systematically.

> Does the problem persist if you DELETE the rows then CLUSTER the table
> before running the query?
Wow, I wasn't aware of cluster. Applying it (clustering the id PK) on the
table causing the "bad" query worked wonders. It now needs just 4.3 secs as
compared to 600 secs before (now with one day of data added as compared to
the previous post) and 4.0 secs for the "good" query (also clustered) which
is faster than the unclustered "good" query (about 6-8 secs).

> Does the problem persist if you DELETE the rows then REINDEX?
No, not noticeably.

> The answer from the rest of your post would appear to be 'no, it's being
> run in an otherwise-idle stand-alone test environment'. Right?
Correct.

It seems my issue is solved (at least for now). My impression is that it was
just somehow "bad luck" that the rows originally and replayed from the dumps
were kind of messed up in regard to their ids, especially - it seems - the
newly added ones. This is somehow consistent with the peculiarities of the
query which contains a pairwise id comparison which should greatly benefit
an ordered set of ids.
This also made me wonder how the internal plan is carried out. Is the engine
able to leverage the fact that a part/range of the rows is totally or
partially ordered on disk, e.g. using some kind of binary search or even
"nearest neighbor"-search in that section (i.e. a special "micro-plan" or
algorithm)? Or is the speed-up "just" because related data is usually
"nearby" and most of the standard algorithms work best with clustered data?
If the first is not the case, would that be a potential point for
improvement? Maybe it would even be more efficient, if there were some sort
of constraints that guarantee "ordered row" sections on the disk, i.e.
preventing the addition of a row that had an index value in between two row
values of an already ordered/clustered section. In the simplest case, it
would start with the "first" row and end with the "last" row (on the time of
doing the equivalent of "cluster"). So there would be a small list saying
rows with id x - rows with id y are guaranteed to be ordered on disk (by id
for example) now and for all times.

So, would you like to further investigate my previous issue (I think it is
still strange that performance suddenly dropped that dramatically)?

Many thanks and regards,
panam

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4425890.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Hash Anti Join performance degradation

From

"Kevin Grittner"

Date:

25 May 2011, 14:41:13

panam <panam@gmx.net> wrote:

> Is there a way to determine the values actually used?

The pg_settings view.  Try the query shown here:

http://wiki.postgresql.org/wiki/Server_Configuration

-Kevin

Re: Hash Anti Join performance degradation

From

Craig Ringer

Date:

25 May 2011, 23:48:01

On 05/26/2011 12:42 AM, panam wrote:

> So, would you like to further investigate my previous issue (I think it is
> still strange that performance suddenly dropped that dramatically)?

It's a bit beyond me, but I suspect that it'd be best if you could hang
onto the dump file in case someone has the time and enthusiasm to
investigate it. I take it you can't distribute the dump file, even
privately?

--
Craig Ringer

Re: Hash Anti Join performance degradation

From

panam

Date:

26 May 2011, 09:33:53

Hi there,


Kevin Grittner wrote:
>
>> Is there a way to determine the values actually used?
> The pg_settings view.  Try the query shown here:
> http://wiki.postgresql.org/wiki/Server_Configuration
>
Thanks Kevin, very usful. Here is the output:

"version";"PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit"
"bytea_output";"escape"
"client_encoding";"UNICODE"
"effective_cache_size";"4GB"
"lc_collate";"German_Germany.1252"
"lc_ctype";"German_Germany.1252"
"listen_addresses";"*"
"log_destination";"stderr"
"log_line_prefix";"%t "
"logging_collector";"on"
"max_connections";"100"
"max_stack_depth";"2MB"
"port";"5432"
"server_encoding";"UTF8"
"shared_buffers";"1GB"
"temp_buffers";"4096"
"TimeZone";"CET"
"work_mem";"1GB"


Craig Ringer wrote:
>
> On 05/26/2011 12:42 AM, panam wrote:
> It's a bit beyond me, but I suspect that it'd be best if you could hang
> onto the dump file in case someone has the time and enthusiasm to
> investigate it. I take it you can't distribute the dump file, even
> privately?
>
Fortunately, I managed to reduce it to the absolute minimum (i.e. only
meaningless ids), and the issue is still observable.
You can download it from here:
http://www.zumodrive.com/file/460997770?key=cIdeODVlNz

Some things to try:
* tune your psql settings if you want
* reindex, vaccum analzye if you want

"Patholgical" query:

select
    b.id,
    (SELECT
        m1.id
    FROM
        message m1
    LEFT JOIN
        message m2
            ON (
                m1.box_id = m2.box_id
                AND m1.id < m2.id
            )
    WHERE
        m2.id IS NULL
        AND m1.box_id = b.id)
from
    box b

=> takes almost "forever" (~600 seconds on my system)

Try

delete from message where id > 2550000;

=> deletes 78404 rows
Do the "pathological" query again
=> speed is back (~4 seconds on my system)

Replay the dump
Try

delete from message where id < 1000000;

=> deletes 835844 (10 times than before) rows. Maybe you can delete many
more, I haven't tested this systematically.
Do the "pathological" query again
=> takes almost "forever" (didn't wait...)

Replay the dump
Cluster:

cluster message_pkey on message;

Do the "pathological" query again
=> speed is back (~3 seconds on my system)

Any third party confirmation?

Thanks
panam

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4428435.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Hash Anti Join performance degradation

From

Cédric Villemain

Date:

26 May 2011, 11:21:50

2011/5/26 panam <panam@gmx.net>:
> Hi there,
>
>
> Kevin Grittner wrote:
>>
>>> Is there a way to determine the values actually used?
>> The pg_settings view.  Try the query shown here:
>> http://wiki.postgresql.org/wiki/Server_Configuration
>>
> Thanks Kevin, very usful. Here is the output:
>
> "version";"PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit"
> "bytea_output";"escape"
> "client_encoding";"UNICODE"
> "effective_cache_size";"4GB"
> "lc_collate";"German_Germany.1252"
> "lc_ctype";"German_Germany.1252"
> "listen_addresses";"*"
> "log_destination";"stderr"
> "log_line_prefix";"%t "
> "logging_collector";"on"
> "max_connections";"100"
> "max_stack_depth";"2MB"
> "port";"5432"
> "server_encoding";"UTF8"
> "shared_buffers";"1GB"
> "temp_buffers";"4096"
> "TimeZone";"CET"
> "work_mem";"1GB"
>
>
> Craig Ringer wrote:
>>
>> On 05/26/2011 12:42 AM, panam wrote:
>> It's a bit beyond me, but I suspect that it'd be best if you could hang
>> onto the dump file in case someone has the time and enthusiasm to
>> investigate it. I take it you can't distribute the dump file, even
>> privately?
>>
> Fortunately, I managed to reduce it to the absolute minimum (i.e. only
> meaningless ids), and the issue is still observable.
> You can download it from here:
> http://www.zumodrive.com/file/460997770?key=cIdeODVlNz
>
> Some things to try:
> * tune your psql settings if you want
> * reindex, vaccum analzye if you want
>
> "Patholgical" query:
>
> select
>        b.id,
>        (SELECT
>                m1.id
>        FROM
>                message m1
>        LEFT JOIN
>                message m2
>                        ON (
>                                m1.box_id = m2.box_id
>                                AND m1.id < m2.id
>                        )
>        WHERE
>                m2.id IS NULL
>                AND m1.box_id = b.id)
> from
>        box b
>
> => takes almost "forever" (~600 seconds on my system)
>
> Try
>
> delete from message where id > 2550000;
>
> => deletes 78404 rows
> Do the "pathological" query again
> => speed is back (~4 seconds on my system)
>
> Replay the dump
> Try
>
> delete from message where id < 1000000;
>
> => deletes 835844 (10 times than before) rows. Maybe you can delete many
> more, I haven't tested this systematically.
> Do the "pathological" query again
> => takes almost "forever" (didn't wait...)
>
> Replay the dump
> Cluster:
>
> cluster message_pkey on message;
>
> Do the "pathological" query again
> => speed is back (~3 seconds on my system)
>
> Any third party confirmation?

without explaining further why the antijoin has bad performance
without cluster, I wonder why you don't use this query :

SELECT  b.id,
                  max(m.id)
FROM box b, message m
WHERE m.box_id = b.id
GROUP BY b.id;

looks similar and fastest.

>
> Thanks
> panam
>
> --
> View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4428435.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Hash Anti Join performance degradation

From

"Kevin Grittner"

Date:

26 May 2011, 11:48:57

Cédric Villemain<cedric.villemain.debian@gmail.com> wrote:
> 2011/5/26 panam <panam@gmx.net>:

>> "max_connections";"100"

>> "work_mem";"1GB"

Each connection can allocate work_mem, potentially several times.
On a machines without hundreds of GB of RAM, that pair of settings
could cause severe swapping.

>> "Patholgical" query:
>>
>> select
>>        b.id,
>>        (SELECT
>>                m1.id
>>        FROM
>>                message m1
>>        LEFT JOIN
>>                message m2
>>                        ON (
>>                                m1.box_id = m2.box_id
>>                                AND m1.id < m2.id
>>                        )
>>        WHERE
>>                m2.id IS NULL
>>                AND m1.box_id = b.id)
>> from
>>        box b

> without explaining further why the antijoin has bad performance
> without cluster, I wonder why you don't use this query :
>
> SELECT  b.id,
>                   max(m.id)
> FROM box b, message m
> WHERE m.box_id = b.id
> GROUP BY b.id;
>
> looks similar and fastest.

I think you would need a left join to actually get identical
results:

SELECT  b.id, max(m.id)
  FROM box b
  LEFT JOIN message m ON m.box_id = b.id
  GROUP BY b.id;

But yeah, I would expect this approach to be much faster.  Rather
easier to understand and harder to get wrong, too.

-Kevin

Re: Hash Anti Join performance degradation

From

panam

Date:

26 May 2011, 13:08:16

Hi all,


Cédric Villemain-3 wrote:
>
> without explaining further why the antijoin has bad performance
> without cluster, I wonder why you don't use this query :
>
> SELECT  b.id,
>                   max(m.id)
> FROM box b, message m
> WHERE m.box_id = b.id
> GROUP BY b.id;
>
> looks similar and fastest.
>
I actually did use a similar strategy in the meantime (during my problem
with the "left join" query we are talking about here all the time) for
mitigation.
It was
SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id
and it performed worse in comparison to the "left join" query in the general
case (i.e. before my problems began).
At the end of this post is an explanation why I think I cannot use the
solution you suggested above.


Kevin Grittner wrote:
>
>  Each connection can allocate work_mem, potentially several times.
> On a machines without hundreds of GB of RAM, that pair of settings
> could cause severe swapping.
>
Indeed, thanks for the warning. These settings are not for production but to
exclude a performance degradation because of small cache sizes.


Kevin Grittner wrote:
>
> I think you would need a left join to actually get identical
> results:
>
> SELECT  b.id, max(m.id)
>   FROM box b
>   LEFT JOIN message m ON m.box_id = b.id
>   GROUP BY b.id;
>
> But yeah, I would expect this approach to be much faster.  Rather
> easier to understand and harder to get wrong, too.
>
>
Correct, it is much faster, even with unclustered ids.
However, I think I cannot use it because of the way that query is generated
(by hibernate).
The (simplyfied) base query is just

SELECT b.id from box

the subquery

(SELECT  m1.id FROM message m1
   LEFT JOIN message m2
      ON (m1.box_id = m2.box_id  AND m1.id < m2.id )
   WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId

is due to a hibernate formula (containing more or less plain SQL) to
determine the last message id for that box. It ought to return just one row,
not multiple. So I am constrained to the subquery in all optimization
attemps (I cannot combine them as you did), at least I do not see how. If
you have an idea for a more performant subquery though, let me know, as this
can easily be replaced.

Thanks for your help and suggestions
panam

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429125.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Hash Anti Join performance degradation

From

"Kevin Grittner"

Date:

26 May 2011, 14:08:30

panam <panam@gmx.net> wrote:

> I cannot use it because of the way that query is generated
> (by hibernate).
>
> The (simplyfied) base query is just
>
> SELECT b.id from box
>
> the subquery
>
> (SELECT  m1.id FROM message m1
>    LEFT JOIN message m2
>       ON (m1.box_id = m2.box_id  AND m1.id < m2.id )
>    WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId
>
> is due to a hibernate formula (containing more or less plain SQL)
> to determine the last message id for that box. It ought to return
> just one row, not multiple. So I am constrained to the subquery in
> all optimization attemps (I cannot combine them as you did), at
> least I do not see how. If you have an idea for a more performant
> subquery though, let me know, as this can easily be replaced.

Maybe:

(SELECT max(m1.id) FROM message m1 WHERE m1.box_id = b.id)

-Kevin

Re: Hash Anti Join performance degradation

From

panam

Date:

26 May 2011, 15:04:45

Sorry,

SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id

as posted previously should actually read

SELECT max(m1.id) FROM message m1 WHERE m1.box_id = b.id)

so I tried this already.

Regards,
panam




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429475.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Hash Anti Join performance degradation

From

Cédric Villemain

Date:

26 May 2011, 15:13:51

2011/5/26 panam <panam@gmx.net>:
> Hi all,
>
>
> Cédric Villemain-3 wrote:
>>
>> without explaining further why the antijoin has bad performance
>> without cluster, I wonder why you don't use this query :
>>
>> SELECT  b.id,
>>                   max(m.id)
>> FROM box b, message m
>> WHERE m.box_id = b.id
>> GROUP BY b.id;
>>
>> looks similar and fastest.
>>
> I actually did use a similar strategy in the meantime (during my problem
> with the "left join" query we are talking about here all the time) for
> mitigation.
> It was
> SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id
> and it performed worse in comparison to the "left join" query in the general
> case (i.e. before my problems began).
> At the end of this post is an explanation why I think I cannot use the
> solution you suggested above.
>
>
> Kevin Grittner wrote:
>>
>>  Each connection can allocate work_mem, potentially several times.
>> On a machines without hundreds of GB of RAM, that pair of settings
>> could cause severe swapping.
>>
> Indeed, thanks for the warning. These settings are not for production but to
> exclude a performance degradation because of small cache sizes.
>
>
> Kevin Grittner wrote:
>>
>> I think you would need a left join to actually get identical
>> results:
>>
>> SELECT  b.id, max(m.id)
>>   FROM box b
>>   LEFT JOIN message m ON m.box_id = b.id
>>   GROUP BY b.id;
>>
>> But yeah, I would expect this approach to be much faster.  Rather
>> easier to understand and harder to get wrong, too.
>>
>>
> Correct, it is much faster, even with unclustered ids.
> However, I think I cannot use it because of the way that query is generated
> (by hibernate).
> The (simplyfied) base query is just
>
> SELECT b.id from box
>
> the subquery
>
> (SELECT  m1.id FROM message m1
>   LEFT JOIN message m2
>      ON (m1.box_id = m2.box_id  AND m1.id < m2.id )
>   WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId
>
> is due to a hibernate formula (containing more or less plain SQL) to
> determine the last message id for that box. It ought to return just one row,
> not multiple. So I am constrained to the subquery in all optimization
> attemps (I cannot combine them as you did), at least I do not see how. If
> you have an idea for a more performant subquery though, let me know, as this
> can easily be replaced.

In production, if you have a decent IO system, you can lower
random_page_cost and it may be faster using index (by default, with
the use case you provided it choose a seqscan). It can be a bit tricky
if you have to lower random_page_cost so much that it destroy others
query plan but increase the perf of the current one. if it happens,
post again :) (sometime need to change other cost parameters but it
needs to be handle with care)

I am not an hibernate expert, but I'll surprised if you can not drive
hibernate to do what you want.

>
> Thanks for your help and suggestions
> panam
>
> --
> View this message in context:
http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429125.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Hash Anti Join performance degradation

From

Craig Ringer

Date:

26 May 2011, 21:01:11

On 05/27/2011 02:13 AM, Cédric Villemain wrote:

> I am not an hibernate expert, but I'll surprised if you can not drive
> hibernate to do what you want.

If nothing else, you can do a native query in hand-written SQL through
Hibernate. ORMs are useful tools for some jobs, but it's good to be able
to bypass them when needed too.

--
Craig Ringer

Re: Hash Anti Join performance degradation

From

Robert Haas

Date:

31 May 2011, 19:15:39

On Thu, May 26, 2011 at 8:33 AM, panam <panam@gmx.net> wrote:
> Any third party confirmation?

Yeah, it definitely looks like there is some kind of bug here.  Or if
not a bug, then a very surprising feature.  EXPLAIN ANALYZE outputs
from your proposed test attached.  Here's a unified diff of the two
outputs:


  QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------
- Seq Scan on box b  (cost=0.00..3669095.76 rows=128 width=8) (actual
time=0.147..431517.693 rows=128 loops=1)
+ Seq Scan on box b  (cost=0.00..3669095.76 rows=128 width=8) (actual
time=0.047..6938.165 rows=128 loops=1)
    SubPlan 1
-     ->  Hash Anti Join  (cost=14742.77..28664.79 rows=19239 width=8)
(actual time=2960.176..3370.425 rows=1 loops=128)
+     ->  Hash Anti Join  (cost=14742.77..28664.79 rows=19239 width=8)
(actual time=48.385..53.361 rows=1 loops=128)
            Hash Cond: (m1.box_id = m2.box_id)
            Join Filter: (m1.id < m2.id)
-           ->  Bitmap Heap Scan on message m1  (cost=544.16..13696.88
rows=28858 width=16) (actual time=2.320..6.204 rows=18487 loops=128)
+           ->  Bitmap Heap Scan on message m1  (cost=544.16..13696.88
rows=28858 width=16) (actual time=1.928..5.502 rows=17875 loops=128)
                  Recheck Cond: (box_id = b.id)
-                 ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=2.251..2.251
rows=18487 loops=128)
+                 ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=1.797..1.797
rows=18487 loops=128)
                        Index Cond: (box_id = b.id)
-           ->  Hash  (cost=13696.88..13696.88 rows=28858 width=16)
(actual time=12.632..12.632 rows=19720 loops=120)
-                 Buckets: 4096  Batches: 4 (originally 2)  Memory Usage: 1787kB
-                 ->  Bitmap Heap Scan on message m2
(cost=544.16..13696.88 rows=28858 width=16) (actual time=1.668..6.619
rows=19720 loops=120)
+           ->  Hash  (cost=13696.88..13696.88 rows=28858 width=16)
(actual time=11.603..11.603 rows=20248 loops=113)
+                 Buckets: 4096  Batches: 4 (originally 2)  Memory Usage: 1423kB
+                 ->  Bitmap Heap Scan on message m2
(cost=544.16..13696.88 rows=28858 width=16) (actual time=1.838..6.886
rows=20248 loops=113)
                        Recheck Cond: (box_id = b.id)
-                       ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=1.602..1.602
rows=19720 loops=120)
+                       ->  Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=1.743..1.743
rows=20903 loops=113)
                              Index Cond: (box_id = b.id)
- Total runtime: 431520.186 ms
+ Total runtime: 6940.369 ms

That's pretty odd.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Hash Anti Join performance degradation

From

Cédric Villemain

Date:

31 May 2011, 21:44:21

2011/5/31 Robert Haas <robertmhaas@gmail.com>:
> On Thu, May 26, 2011 at 8:33 AM, panam <panam@gmx.net> wrote:
>> Any third party confirmation?
>
> Yeah, it definitely looks like there is some kind of bug here.  Or if
> not a bug, then a very surprising feature.  EXPLAIN ANALYZE outputs
> from your proposed test attached.  Here's a unified diff of the two
> outputs:
>
>
>  QUERY PLAN
>
 ----------------------------------------------------------------------------------------------------------------------------------------------------------
> - Seq Scan on box b  (cost=0.00..3669095.76 rows=128 width=8) (actual
> time=0.147..431517.693 rows=128 loops=1)
> + Seq Scan on box b  (cost=0.00..3669095.76 rows=128 width=8) (actual
> time=0.047..6938.165 rows=128 loops=1)
>    SubPlan 1
> -     ->  Hash Anti Join  (cost=14742.77..28664.79 rows=19239 width=8)
> (actual time=2960.176..3370.425 rows=1 loops=128)
> +     ->  Hash Anti Join  (cost=14742.77..28664.79 rows=19239 width=8)
> (actual time=48.385..53.361 rows=1 loops=128)
>            Hash Cond: (m1.box_id = m2.box_id)
>            Join Filter: (m1.id < m2.id)
> -           ->  Bitmap Heap Scan on message m1  (cost=544.16..13696.88
> rows=28858 width=16) (actual time=2.320..6.204 rows=18487 loops=128)
> +           ->  Bitmap Heap Scan on message m1  (cost=544.16..13696.88
> rows=28858 width=16) (actual time=1.928..5.502 rows=17875 loops=128)
>                  Recheck Cond: (box_id = b.id)
> -                 ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=2.251..2.251
> rows=18487 loops=128)
> +                 ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=1.797..1.797
> rows=18487 loops=128)
>                        Index Cond: (box_id = b.id)
> -           ->  Hash  (cost=13696.88..13696.88 rows=28858 width=16)
> (actual time=12.632..12.632 rows=19720 loops=120)
> -                 Buckets: 4096  Batches: 4 (originally 2)  Memory Usage: 1787kB
> -                 ->  Bitmap Heap Scan on message m2
> (cost=544.16..13696.88 rows=28858 width=16) (actual time=1.668..6.619
> rows=19720 loops=120)
> +           ->  Hash  (cost=13696.88..13696.88 rows=28858 width=16)
> (actual time=11.603..11.603 rows=20248 loops=113)
> +                 Buckets: 4096  Batches: 4 (originally 2)  Memory Usage: 1423kB
> +                 ->  Bitmap Heap Scan on message m2
> (cost=544.16..13696.88 rows=28858 width=16) (actual time=1.838..6.886
> rows=20248 loops=113)
>                        Recheck Cond: (box_id = b.id)
> -                       ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=1.602..1.602
> rows=19720 loops=120)
> +                       ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=1.743..1.743
> rows=20903 loops=113)
>                              Index Cond: (box_id = b.id)
> - Total runtime: 431520.186 ms
> + Total runtime: 6940.369 ms
>
> That's pretty odd.

Yes, while here I noticed that the query was long to be killed.
I added a CHECK_FOR_INTERRUPT() in the for(;;) loop in nodeHashjoin.c.
It fixes the delay when trying to kill but I don't know about
performance impact this can have in this place of the code.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Attachment

fix_hardtokill_hash.patch