Home > mailing lists

Re: Slow index creation - Mailing list pgsql-general

From	Peter J. Holzer
Subject	Re: Slow index creation
Date	February 19, 2021 13:54:59
Msg-id	20210219135459.GB10464@hjp.at Whole thread Raw
In response to	Slow index creation (Paul van der Linden <paul.doskabouter@gmail.com>)
List	pgsql-general

Tree view

On 2021-02-16 19:30:23 +0100, Paul van der Linden wrote:
> I have 2 functions:
>
> CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
> $func$
> DECLARE
>     retVal text;
> BEGIN
>     SELECT
>       CASE
>         ... snip long list containing various tests on a,b and c
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;
>
> CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
> $func$
> DECLARE
>     retVal int;
> BEGIN
>     SELECT
>       CASE
>         WHEN r='r1' AND st_area(geom)>100 THEN 1
>         ... snip long list containing various tests on r and st_area(geom)
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;
>
> and a large table t (100M+ records) with columns a, b, c and geom running on PG
> 11, on spinning disks with 64GB memory and 28 cores.
>
> When I create a simple geom index with CREATE INDEX ON t USING gist(geom) it
> finishes in about an hour, but when I create a partial index using these 2
> functions
> CREATE INDEX ON t USING gist(geom) WHERE bar(foo(a,b,c),geom)<12 it takes over
> 20 hours...

It has to call these functions for each of those 100M+ lines. So that's
about 0.7 ms per line. Is that to be expected for what those functions
do? I don't know.

As depesz wrote, use smaller test case, like 1000 or 10000 rows. You can
test the speed of those functions in isolation. Compare

select a, b, c from t;
select foo(a, b, c) from t;
select bar(foo(a, b, c), geom) from t;

That tells you how much time is spent in foo and bar. If the sum is
close to those 0.7 ms, you know that you need to make those functions
faster.

I think SQL doesn't short-circuit, so in your function bar, that
st_area(geom) in the first WHEN clause will always be called, regardless
of the value of r. You can either call that once (as depesz suggested)
or you can avoid calling it by nesting the cases:

    case
        when r = 'r1' then
            case
                when st_area(geom) > 100 then 1
                else 2
            end
        when r = 'r2' then
            ....
    end

You can then go one step furthe:

    case
        when r like 'r_' then
            -- handle r1 ... r9
        when r like 'r1_' then
            -- handle r10 ... r19
        when r like 'r2_' then
            -- handle r20 ... r29
    end

That reduces the average number of comparisons from 50+ to 8.

Depesz measured an overhead of 0.016ms per log message. That's low
enough that you can probably afford a few messages, even if each
function invocation only takes a few milliseconds. So definitely try
that if you need to know where your functions spend their time.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc

pgsql-general by date:

From: Laurenz Albe
Date: 19 February 2021, 13:31:24
Subject: Re: how does PostgreSQL determine how many parallel processes to start

From: "Peter J. Holzer"
Date: 19 February 2021, 14:11:18
Subject: Re: Slow while inserting and retrieval (compared to SQL Server)

Re: Slow index creation - Mailing list pgsql-general

Attachment

Previous

Next