Thread: Large CASE-statement is pretty slow?
Hi list, I was more or less toying with an idea for a project I have, which includes renumbering a primary key (don't ask, it's necessary :/ ) Anyway, I was looking into the usefullness of a INSERT INTO newtable SELECT field, field, CASE pkey WHEN x1 THEN y1 WHEN x2 THEN y2 etc END FROM oldtable The resulting select was about 1.7MB of query-text, mostly composed of the CASE-statement. So I discarded that idea, I still wanted to know how much time it would take on my database (MySQL) and found it to take about 1100 seconds, in contrast to simply selecting the data, which'd take about 0.7 seconds orso... The table I tested this on is about 30MB. Of course I wanted to know how long it'd take on postgresql, selecting the pkey-field only (without the case) took also some 0.7 seconds (the entire table may have been more). But the CASE-version took 9026139.201 ms, i.e. over 9000 seconds about 8 times slower than MySQL. What I'm wondering about: Although I was not expecting Postgresql to heavily beat MySQL, I was surprised to see it so much slower. Is the CASE-statement in Postgresql that inefficient? Or is it simply not very scalable (i.e. don't try to have 100000 cases like I did)? The database is a lightly optimised gentoo-compile of 7.4.2, the mysql-version was 4.0.18 in case anyone wanted to know that. Best regards, Arjen van der Meijden PS, don't try to "help improve the query" I discarded the idea as too inefficient and went along with a simple left join to get the "new pkey" out of a temporary table ;)
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: > Anyway, I was looking into the usefullness of a INSERT INTO newtable > SELECT field, field, CASE pkey WHEN x1 THEN y1 WHEN x2 THEN y2 etc END > FROM oldtable > The resulting select was about 1.7MB of query-text, mostly composed of > the CASE-statement. Hm, you mean one single SELECT, one single CASE? How many WHEN clauses exactly? Exactly what did a typical clause of the CASE look like? I wouldn't be too surprised to find some bit of code that's O(N^2) in the number of arms of the CASE, or something like that; it's not an area that we've ever felt the need to optimize. But I'd like a fairly specific test case before trying to look into it. regards, tom lane
Tom Lane wrote: > Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: > >>Anyway, I was looking into the usefullness of a INSERT INTO newtable >>SELECT field, field, CASE pkey WHEN x1 THEN y1 WHEN x2 THEN y2 etc END >>FROM oldtable > > >>The resulting select was about 1.7MB of query-text, mostly composed of >>the CASE-statement. > > > Hm, you mean one single SELECT, one single CASE? How many WHEN clauses > exactly? Exactly what did a typical clause of the CASE look like? Yes, one SELECT-query with one single CASE-statement. The CASE-statement had the simple-case-structure like: SELECT CASE UserID WHEN 1 THEN 1 WHEN 34 THEN 2 ... etc I noticed, by the way, that the ordering is on the THEN y parameter, the x parameter (WHEN x THEN y) is "more or less increasing". But some numbers: The table I did my tests on has 88291 rows, I did the select on the integer primary key, so the CASE was the only column in the select. I'm running the query again on a table that has only the primary key of my original table and it seems to be as slow. I'm not really sure how many WHEN's there are in that CASE, but it is supposed to be a relocation of all primary key-values to some other value, so it will contain some number close to that 88291. > I wouldn't be too surprised to find some bit of code that's O(N^2) in > the number of arms of the CASE, or something like that; it's not an area > that we've ever felt the need to optimize. But I'd like a fairly > specific test case before trying to look into it. Well, I have discarded this type of query as "too inefficient" and found a better way, so don't feel the need to optimize it just because I noticed it is slow with very large CASEs. Although CASEs with a few hundred WHENs wont be that uncommon and might improve a bit as well? I can send you the "primary key only"-table and the query off list if you want to. That won't make me violate any privacy rule and is probably a good test case? Best regards, Arjen van der Meijden
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: > > Of course I wanted to know how long it'd take on postgresql, selecting the > pkey-field only (without the case) took also some 0.7 seconds (the entire table > may have been more). > But the CASE-version took 9026139.201 ms, i.e. over 9000 seconds about 8 times > slower than MySQL. Was this the select with the CASE, or the update? If you did the update and have lots of foreign key references to the table then every record that's updated forces a check to make sure there are no references to that record (or an update if it's ON UPDATE CASCADE). If there are no indexes on the referencing table columns that will be very slow. -- greg
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: > Well, I have discarded this type of query as "too inefficient" and found a > better way Loading the mapping into a table with an index and doing an update using "from" to do a join seems likely to end up being the most efficient method. Postgres would probably not even bother with the index and do a hash join. -- greg
Greg Stark wrote: > Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: > > > Was this the select with the CASE, or the update? It was just the select to see how long it'd take. I already anticipated it to be possibly a "slow query", so I only did the select first. Best regards, Arjen van der Meijden
Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: > [ huge CASE is pretty slow ] I did some profiling of the test case that Arjen was kind enough to send me. It seems there are two distinct problems. One is that the parser uses repeated lappend()'s to construct the list of CASE arms; this makes building the structure O(N^2) in the number of arms. (If you simply try to EXPLAIN the query, you find out that the parse time is about a third of the run time :-( ... and 90% of that is spent inside nconc() which is the guts of lappend.) This problem is slated to be fixed by Neil Conway's upcoming rewrite of the list support, which will convert lappend into a constant-time operation. The other difficulty is that the evaluation machinery for arithmetic expressions has a lot of overhead. The profile run shows: % cumulative self self total time seconds seconds calls s/call s/call name 38.15 41.92 41.92 229646 0.00 0.00 nconc 21.76 65.84 23.92 199054454 0.00 0.00 ExecEvalExpr 11.38 78.34 12.50 10000 0.00 0.00 ExecEvalCase 8.43 87.61 9.27 66348151 0.00 0.00 ExecEvalFuncArgs 8.12 96.54 8.93 66348151 0.00 0.00 ExecMakeFunctionResult 2.96 99.78 3.25 66348151 0.00 0.00 ExecEvalVar 1.23 101.14 1.36 10058 0.00 0.00 AllocSetCheck 1.23 102.49 1.35 66348151 0.00 0.00 ExecEvalOper 1.12 103.72 1.24 76537 0.00 0.00 OpernameGetCandidates 0.85 104.66 0.94 66424693 0.00 0.00 int4eq (Note: I added LIMIT 10000 to the query so that the CASE is only carried out 10000 times, rather than nearly 90000 times as in Arjen's original test case. Without this, the call-counter overflows for ExecEvalExpr, and the time percentages seem to get confused. One must recognize though that this overstates the parser overhead compared to the original test case.) Clearly the useful work (int4eq) is getting a bit swamped by the ExecEval mechanism. I have some ideas about reducing the overhead, which I'll post to the pghackers list in a bit. regards, tom lane