Thread: performance bug in instrument.c
Hi, it seems there's a issue in instrument.c that may have significant performance impact for some plans when running explain analyze with "TIMING OFF". There's this piece of code in InstrStartNode: if (instr->need_timer && INSTR_TIME_IS_ZERO(instr->starttime)) INSTR_TIME_SET_CURRENT(instr->starttime); else elog(DEBUG2, "InstrStartNode called twice in a row"); but it should actually be like this if (instr->need_timer && INSTR_TIME_IS_ZERO(instr->starttime)) INSTR_TIME_SET_CURRENT(instr->starttime); elseif (instr->need_timer) elog(DEBUG2, "InstrStartNode called twice in a row"); or maybe if (instr->need_timer) { if (INSTR_TIME_IS_ZERO(instr->starttime)) INSTR_TIME_SET_CURRENT(instr->starttime); else elog(DEBUG2, "InstrStartNode called twice in a row"); } The current code calls the "else" branch everytime when "TIMING OFF" is used, and this may lead to serious performance degradation - see for example this: CREATE TABLE x AS SELECT id FROM generate_series(1,15000) s(id); ANALYZE x; EXPLAIN ANALYZE SELECT a.id FROM x a, x b; Current code: QUERY PLAN -----------------------------------------------------------------------Nested Loop (cost=0.00..2812971.50 rows=225000000width=4) (actual time=0.013..29611.859 rows=225000000 loops=1) -> Seq Scan on x a (cost=0.00..217.00 rows=15000 width=4) (actual time=0.006..2.847 rows=15000 loops=1) -> Materialize (cost=0.00..292.00 rows=15000 width=0) (actual time=0.000..0.740 rows=15000 loops=15000) -> Seq Scan on x b (cost=0.00..217.00 rows=15000 width=0) (actual time=0.003..1.186 rows=15000 loops=1)Total runtime: 36054.079 ms (5 rows) and with the fix QUERY PLAN -----------------------------------------------------------------------Nested Loop (cost=0.00..1458333.00 rows=116640000width=4) (actual time=0.021..13158.399 rows=100000000 loops=1) -> Seq Scan on x a (cost=0.00..153.00 rows=10800 width=4) (actual time=0.014..1.960 rows=10000 loops=1) -> Materialize (cost=0.00..207.00 rows=10800 width=0) (actual time=0.000..0.499 rows=10000 loops=10000) -> Seq Scan on x b (cost=0.00..153.00 rows=10800 width=0) (actual time=0.003..1.037 rows=10000 loops=1)Total runtime: 16017.953 ms (5 rows) Obviously this is a quite extreme example, most plans won't be hit this hard. This was discovered by Pavel Stehule when backporting my "TIMING OFF" patch (which introduced the bug). regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > There's this piece of code in InstrStartNode: > if (instr->need_timer && INSTR_TIME_IS_ZERO(instr->starttime)) > INSTR_TIME_SET_CURRENT(instr->starttime); > else > elog(DEBUG2, "InstrStartNode called twice in a row"); > but it should actually be like this > if (instr->need_timer) > { > if (INSTR_TIME_IS_ZERO(instr->starttime)) > INSTR_TIME_SET_CURRENT(instr->starttime); > else > elog(DEBUG2, "InstrStartNode called twice in a row"); > } Hm. It's a bit annoying that we can't detect the "called twice" condition when !need_timer, but I suppose that any such bug would be a caller logic error that would probably not be sensitive to need_timer, so it's likely not worth adding overhead to handle that. A bigger question is why this is elog(DEBUG2) and not elog(ERROR). Had it been the latter, we'd have noticed the mistake immediately. The current choice might be masking any caller-logic errors that exist, too. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tv(at)fuzzy(dot)cz> writes: > There's this piece of code in InstrStartNode: > >> if (instr->need_timer && INSTR_TIME_IS_ZERO(instr->starttime)) >> INSTR_TIME_SET_CURRENT(instr->starttime); >> else >> elog(DEBUG2, "InstrStartNode called twice in a row"); > >> but it should actually be like this > >> if (instr->need_timer) >> { >> if (INSTR_TIME_IS_ZERO(instr->starttime)) >> INSTR_TIME_SET_CURRENT(instr->starttime); >> else >> elog(DEBUG2, "InstrStartNode called twice in a row"); >> } > > Hm. It's a bit annoying that we can't detect the "called twice" > condition when !need_timer, but I suppose that any such bug would be a > caller logic error that would probably not be sensitive to need_timer, > so it's likely not worth adding overhead to handle that. Yes, that's annoying. But if we need / want to detect that, wouldn't it be cleaner to add there a separate "already executed" flag, instead of misusing starttime for that? > > A bigger question is why this is elog(DEBUG2) and not elog(ERROR). > Had it been the latter, we'd have noticed the mistake immediately. > The current choice might be masking any caller-logic errors that > exist, too. Not sure why it's DEBUG2, but if it really is an error then it should be logged as ERROR I guess. Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A bigger question is why this is elog(DEBUG2) and not elog(ERROR). >> Had it been the latter, we'd have noticed the mistake immediately. >> The current choice might be masking any caller-logic errors that >> exist, too. > Not sure why it's DEBUG2, but if it really is an error then it should be > logged as ERROR I guess. Perhaps there was a reason for that once, but AFAICT the logic is clean now. I changed these elogs to ERROR and ran all the regression tests under auto_explain, and nothing failed. That's hardly conclusive of course, but it's enough evidence to persuade me to make the change in HEAD. regards, tom lane