Thread: Large CASE-statement is pretty slow?

Large CASE-statement is pretty slow?

From
Arjen van der Meijden
Date:
Hi list,

I was more or less toying with an idea for a project I have, which
includes renumbering a primary key (don't ask, it's necessary :/ )

Anyway, I was looking into the usefullness of a INSERT INTO newtable
SELECT field, field, CASE pkey WHEN x1 THEN y1 WHEN x2 THEN y2 etc END
FROM oldtable

The resulting select was about 1.7MB of query-text, mostly composed of
the CASE-statement. So I discarded that idea, I still wanted to know how
much time it would take on my database (MySQL) and found it to take
about 1100 seconds, in contrast to simply selecting the data, which'd
take about 0.7 seconds orso... The table I tested this on is about 30MB.

Of course I wanted to know how long it'd take on postgresql, selecting
the pkey-field only (without the case) took also some 0.7 seconds (the
entire table may have been more).
But the CASE-version took 9026139.201 ms, i.e. over 9000 seconds about 8
times slower than MySQL.

What I'm wondering about:
Although I was not expecting Postgresql to heavily beat MySQL, I was
surprised to see it so much slower. Is the CASE-statement in Postgresql
that inefficient? Or is it simply not very scalable (i.e. don't try to
have 100000 cases like I did)?

The database is a lightly optimised gentoo-compile of 7.4.2, the
mysql-version was 4.0.18 in case anyone wanted to know that.


Best regards,

Arjen van der Meijden


PS, don't try to "help improve the query" I discarded the idea as too
inefficient and went along with a simple left join to get the "new pkey"
out of a temporary table ;)



Re: Large CASE-statement is pretty slow?

From
Tom Lane
Date:
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes:
> Anyway, I was looking into the usefullness of a INSERT INTO newtable
> SELECT field, field, CASE pkey WHEN x1 THEN y1 WHEN x2 THEN y2 etc END
> FROM oldtable

> The resulting select was about 1.7MB of query-text, mostly composed of
> the CASE-statement.

Hm, you mean one single SELECT, one single CASE?  How many WHEN clauses
exactly?  Exactly what did a typical clause of the CASE look like?

I wouldn't be too surprised to find some bit of code that's O(N^2) in
the number of arms of the CASE, or something like that; it's not an area
that we've ever felt the need to optimize.  But I'd like a fairly
specific test case before trying to look into it.

            regards, tom lane

Re: Large CASE-statement is pretty slow?

From
Arjen van der Meijden
Date:
Tom Lane wrote:

> Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes:
>
>>Anyway, I was looking into the usefullness of a INSERT INTO newtable
>>SELECT field, field, CASE pkey WHEN x1 THEN y1 WHEN x2 THEN y2 etc END
>>FROM oldtable
>
>
>>The resulting select was about 1.7MB of query-text, mostly composed of
>>the CASE-statement.
>
>
> Hm, you mean one single SELECT, one single CASE?  How many WHEN clauses
> exactly?  Exactly what did a typical clause of the CASE look like?
Yes, one SELECT-query with one single CASE-statement.
The CASE-statement had the simple-case-structure like:
SELECT CASE UserID WHEN 1 THEN 1 WHEN 34 THEN 2 ... etc

I noticed, by the way, that the ordering is on the THEN y parameter, the
x parameter (WHEN x THEN y) is "more or less increasing".

But some numbers:
The table I did my tests on has 88291 rows, I did the select on the
integer primary key, so the CASE was the only column in the select.
I'm running the query again on a table that has only the primary key of
my original table and it seems to be as slow.
I'm not really sure how many WHEN's there are in that CASE, but it is
supposed to be a relocation of all primary key-values to some other
value, so it will contain some number close to that 88291.

> I wouldn't be too surprised to find some bit of code that's O(N^2) in
> the number of arms of the CASE, or something like that; it's not an area
> that we've ever felt the need to optimize.  But I'd like a fairly
> specific test case before trying to look into it.

Well, I have discarded this type of query as "too inefficient" and found
a better way, so don't feel the need to optimize it just because I
noticed it is slow with very large CASEs. Although CASEs with a few
hundred WHENs wont be that uncommon and might improve a bit as well?

I can send you the "primary key only"-table and the query off list if
you want to. That won't make me violate any privacy rule and is probably
a good test case?

Best regards,

Arjen van der Meijden



Re: Large CASE-statement is pretty slow?

From
Greg Stark
Date:
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes:

>
> Of course I wanted to know how long it'd take on postgresql, selecting the
> pkey-field only (without the case) took also some 0.7 seconds (the entire table
> may have been more).
> But the CASE-version took 9026139.201 ms, i.e. over 9000 seconds about 8 times
> slower than MySQL.

Was this the select with the CASE, or the update?

If you did the update and have lots of foreign key references to the table
then every record that's updated forces a check to make sure there are no
references to that record (or an update if it's ON UPDATE CASCADE). If there
are no indexes on the referencing table columns that will be very slow.

--
greg

Re: Large CASE-statement is pretty slow?

From
Greg Stark
Date:
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes:

> Well, I have discarded this type of query as "too inefficient" and found a
> better way

Loading the mapping into a table with an index and doing an update using
"from" to do a join seems likely to end up being the most efficient method.
Postgres would probably not even bother with the index and do a hash join.


--
greg

Re: Large CASE-statement is pretty slow?

From
Arjen van der Meijden
Date:
Greg Stark wrote:

> Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes:
>
>
> Was this the select with the CASE, or the update?

It was just the select to see how long it'd take. I already anticipated
it to be possibly a "slow query", so I only did the select first.

Best regards,

Arjen van der Meijden



Re: Large CASE-statement is pretty slow?

From
Tom Lane
Date:
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes:
> [ huge CASE is pretty slow ]

I did some profiling of the test case that Arjen was kind enough to send
me.  It seems there are two distinct problems.  One is that the parser
uses repeated lappend()'s to construct the list of CASE arms; this
makes building the structure O(N^2) in the number of arms.  (If you
simply try to EXPLAIN the query, you find out that the parse time is
about a third of the run time :-( ... and 90% of that is spent inside
nconc() which is the guts of lappend.)  This problem is slated to be
fixed by Neil Conway's upcoming rewrite of the list support, which will
convert lappend into a constant-time operation.

The other difficulty is that the evaluation machinery for arithmetic
expressions has a lot of overhead.  The profile run shows:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 38.15     41.92    41.92   229646     0.00     0.00  nconc
 21.76     65.84    23.92 199054454     0.00     0.00  ExecEvalExpr
 11.38     78.34    12.50    10000     0.00     0.00  ExecEvalCase
  8.43     87.61     9.27 66348151     0.00     0.00  ExecEvalFuncArgs
  8.12     96.54     8.93 66348151     0.00     0.00  ExecMakeFunctionResult
  2.96     99.78     3.25 66348151     0.00     0.00  ExecEvalVar
  1.23    101.14     1.36    10058     0.00     0.00  AllocSetCheck
  1.23    102.49     1.35 66348151     0.00     0.00  ExecEvalOper
  1.12    103.72     1.24    76537     0.00     0.00  OpernameGetCandidates
  0.85    104.66     0.94 66424693     0.00     0.00  int4eq

(Note: I added LIMIT 10000 to the query so that the CASE is only carried
out 10000 times, rather than nearly 90000 times as in Arjen's original
test case.  Without this, the call-counter overflows for ExecEvalExpr,
and the time percentages seem to get confused.  One must recognize
though that this overstates the parser overhead compared to the original
test case.)

Clearly the useful work (int4eq) is getting a bit swamped by the ExecEval
mechanism.  I have some ideas about reducing the overhead, which I'll
post to the pghackers list in a bit.

            regards, tom lane