A colleague, Korry Douglas, observed a table partitioning scenario where
deserializing pg_constraint.ccbin is a hot spot. The following test case, a
simplification of a typical partitioning setup, spends 28% of its time in
stringToNode() and callees thereof:
\timing on
\set n 600000
BEGIN;
CREATE TABLE bench_check_constr_parent (c int);
CREATE TABLE bench_check_constr_child (
CHECK (c > 0 AND c <= 100000000)
) INHERITS (bench_check_constr_parent);
CREATE FUNCTION trig() RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
INSERT INTO bench_check_constr_child VALUES (NEW.*);
RETURN NULL;
END
$$;
CREATE TRIGGER redir BEFORE INSERT ON bench_check_constr_parent
FOR EACH ROW EXECUTE PROCEDURE trig();
-- Main benchmark
INSERT INTO bench_check_constr_parent SELECT * FROM generate_series(1, :n);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_parent SELECT * FROM generate_series(1, :n);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_parent SELECT * FROM generate_series(1, :n);
TRUNCATE bench_check_constr_parent;
-- Compare direct insert performance @ 10x volume
INSERT INTO bench_check_constr_child SELECT * FROM generate_series(1, :n * 10);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_child SELECT * FROM generate_series(1, :n * 10);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_child SELECT * FROM generate_series(1, :n * 10);
TRUNCATE bench_check_constr_parent;
ROLLBACK;
The executor caches each CHECK constraint in ResultRelInfo as a planned
expression. That cache is highly effectively for long-running statements, but
the trivial INSERTs effectively work without a cache. Korry devised this
patch to cache the stringToNode() form of the constraint in the relcache. It
improves the benchmark's partitioned scenario by 33%:
-- Timings (seconds) --
master, INSERT parent: 14.2, 14.4, 14.4
patched, INSERT parent: 9.6, 9.7, 9.7
master, INSERT*10 child: 9.9, 9.9, 10.2
patched, INSERT*10 child: 10.0, 10.2, 10.2
There's still not much to like about that tenfold overhead from use of the
partition routing trigger, but this patch makes a nice cut into that overhead
without doing anything aggressive. The profile no longer shows low-hanging
fruit; running an entire SQL statement per row piles on the runtime from a
wide range of sources. For anyone curious, I've attached output from "perf
report -s parent -g graph,1,caller" with the patch applied; I suggest browsing
under "less -S".
Some call sites need to modify the node tree, so the patch has them do
copyObject(). I ran a microbenchmark of copyObject() on the cached node tree
vs. redoing stringToNode(), and copyObject() still won by a factor of four.
Thanks,
nm
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com