Thread: Slow index creation

Slow index creation

From
Paul van der Linden
Date:
Hi,

I have 2 functions:
CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
$func$
DECLARE
    retVal text;
BEGIN
    SELECT
      CASE
        WHEN a='v1' AND b='b1' THEN 'r1'
        WHEN a='v1' THEN 'r2'
        ... snip long list containing various tests on a,b and c
        WHEN a='v50' THEN 'r50'
      END INTO retval;
    RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;

CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
$func$
DECLARE
    retVal int;
BEGIN
    SELECT
      CASE
        WHEN r='r1' AND st_area(geom)>100 THEN 1
        WHEN r='r1' THEN 2
        ... snip long list containing various tests on r and st_area(geom)
        WHEN r='r50' THEN 25
      END INTO retval;
    RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;

and a large table t (100M+ records) with columns a, b, c and geom running on PG 11, on spinning disks with 64GB memory and 28 cores.

When I create a simple geom index with CREATE INDEX ON t USING gist(geom) it finishes in about an hour, but when I create a partial index using these 2 functions
CREATE INDEX ON t USING gist(geom) WHERE bar(foo(a,b,c),geom)<12 it takes over 20 hours...

Is that because I'm using functions in the WHERE clause, or because my CASE lists are quite long, or both?
Is there any way to speed up that index creation? Is upgrading to a newer postgres a viable option (so the JITTER can speed up the functions)?

Paul

Re: Slow index creation

From
Michael Lewis
Date:
What is your concern with it taking 20 hours vs 1 hour? Is this index re-created on a regular basis? 

Would it make any sense to materialize the value of foo(a,b,c) as a generated column (PG12+ natively, or maintained by a trigger before)? Or even bar(foo(a,b,c),geom)?

Do you know if parallel_workers are being used?

JIT is available in PG11, it is just off by default. If it is available, turning it on and trying it seems like the simplest check if it would speed up the index creation.

Re: Slow index creation

From
Paul van der Linden
Date:
Well, first off it's annoying if I have to change the function and a reindex afterwards, and secondly, lots of other queries are blocking on that reindex query (basically everything needing a queryplan on that table).

Materializing is also an option but that too is taking its time.

As far as I know there's no parallelism used currently, and as per documentation, only creating b-tree indices support parallelism..
Also my postgres installation (on windows) doesn't seem to do anything with JIT (even after setting all the jit-related values to 0)

I was more trying to get a feeling on where the slowness is, and how to improve that...

On Tue, Feb 16, 2021 at 7:45 PM Michael Lewis <mlewis@entrata.com> wrote:
What is your concern with it taking 20 hours vs 1 hour? Is this index re-created on a regular basis? 

Would it make any sense to materialize the value of foo(a,b,c) as a generated column (PG12+ natively, or maintained by a trigger before)? Or even bar(foo(a,b,c),geom)?

Do you know if parallel_workers are being used?

JIT is available in PG11, it is just off by default. If it is available, turning it on and trying it seems like the simplest check if it would speed up the index creation.

Re: Slow index creation

From
Ron
Date:
On 2/16/21 12:30 PM, Paul van der Linden wrote:
Hi,

I have 2 functions:
CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
$func$
DECLARE
    retVal text;
BEGIN
    SELECT
      CASE
        WHEN a='v1' AND b='b1' THEN 'r1'
        WHEN a='v1' THEN 'r2'
        ... snip long list containing various tests on a,b and c
        WHEN a='v50' THEN 'r50'
      END INTO retval;
    RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;

CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
$func$
DECLARE
    retVal int;
BEGIN
    SELECT
      CASE
        WHEN r='r1' AND st_area(geom)>100 THEN 1
        WHEN r='r1' THEN 2
        ... snip long list containing various tests on r and st_area(geom)
        WHEN r='r50' THEN 25
      END INTO retval;
    RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;

and a large table t (100M+ records) with columns a, b, c and geom running on PG 11, on spinning disks with 64GB memory and 28 cores.

When I create a simple geom index with CREATE INDEX ON t USING gist(geom) it finishes in about an hour, but when I create a partial index using these 2 functions
CREATE INDEX ON t USING gist(geom) WHERE bar(foo(a,b,c),geom)<12 it takes over 20 hours...

Is that because I'm using functions in the WHERE clause, or because my CASE lists are quite long, or both?

How long does SELECT a, b, c, foo(a, b, c) from blarg; take?
Ditto SELECT a, b, c, foo(a, b, c), bar(foo(a, b, c)) from blarg;
Ditto SELECT a, b, c, foo(a, b, c) from blarg where bar(foo(a, b, c)) < 12;

That'll narrow the problem.

Is there any way to speed up that index creation? Is upgrading to a newer postgres a viable option (so the JITTER can speed up the functions)?

Paul

--
Angular momentum makes the world go 'round.

Re: Slow index creation

From
hubert depesz lubaczewski
Date:
On Tue, Feb 16, 2021 at 07:30:23PM +0100, Paul van der Linden wrote:
> Hi,
> I have 2 functions:
> CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
> $func$
> DECLARE
>     retVal text;
> BEGIN
>     SELECT
>       CASE
>         WHEN a='v1' AND b='b1' THEN 'r1'
>         WHEN a='v1' THEN 'r2'
>         ... snip long list containing various tests on a,b and c
>         WHEN a='v50' THEN 'r50'
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;

If this function was converted to SQL function it could be faster, as it
could be inlined.

> CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
> $func$
> DECLARE
>     retVal int;
> BEGIN
>     SELECT
>       CASE
>         WHEN r='r1' AND st_area(geom)>100 THEN 1
>         WHEN r='r1' THEN 2
>         ... snip long list containing various tests on r and st_area(geom)
>         WHEN r='r50' THEN 25
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;

First thing that I notice is that it seems (from the code and comment)
that you run st_area(geom) multiple times.

If that's really the case - why don't you cache it in some variable?

declare
    v_area float := st_area( geom );
begin 
...

and then use v_area instead of st_area(geom)

depesz



Re: Slow index creation

From
Paul van der Linden
Date:

The st_area calculation is done mostly once or sometimes twice for each geom, and I suspect that can't explain the factor 20 slower.
Creating an index with only one st_area calculation is also done rather quickly.

On Wed, Feb 17, 2021 at 7:48 PM hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Tue, Feb 16, 2021 at 07:30:23PM +0100, Paul van der Linden wrote:
> Hi,
> I have 2 functions:
> CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
> $func$
> DECLARE
>     retVal text;
> BEGIN
>     SELECT
>       CASE
>         WHEN a='v1' AND b='b1' THEN 'r1'
>         WHEN a='v1' THEN 'r2'
>         ... snip long list containing various tests on a,b and c
>         WHEN a='v50' THEN 'r50'
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;

If this function was converted to SQL function it could be faster, as it
could be inlined.

> CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
> $func$
> DECLARE
>     retVal int;
> BEGIN
>     SELECT
>       CASE
>         WHEN r='r1' AND st_area(geom)>100 THEN 1
>         WHEN r='r1' THEN 2
>         ... snip long list containing various tests on r and st_area(geom)
>         WHEN r='r50' THEN 25
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;

First thing that I notice is that it seems (from the code and comment)
that you run st_area(geom) multiple times.

If that's really the case - why don't you cache it in some variable?

declare
    v_area float := st_area( geom );
begin
...

and then use v_area instead of st_area(geom)

depesz

Re: Slow index creation

From
hubert depesz lubaczewski
Date:
On Wed, Feb 17, 2021 at 08:40:17PM +0100, Paul van der Linden wrote:
> The st_area calculation is done mostly once or sometimes twice for each geom, and I suspect that can't explain the
factor20 slower.
 
> Creating an index with only one st_area calculation is also done rather quickly.

In this case, make small test case, like 1000 rows, or something like
this. run create index without where, and then remake the functions with
profiling info, for example using this approach:
https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/
and check for yourself what takes this time, and if it's something you
can fix.

Best regards,

depesz




Re: Slow index creation

From
Michael Lewis
Date:
https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/

Thanks for this reference. I enjoy your blog, but haven't made the time to read all the archives somehow. Stuff doesn't stick very well when it isn't yet "needed" info besides.

I have seen overhead from 'raise notice' in small functions that are sometimes called many thousands of times in a single query, but hadn't done the test to verify if the same overhead still exists for raise debug or another level below both client_min_messages and log_min_messages. Using your examples, I saw about .006 ms for each call to RAISE DEBUG with a client/log_min as notice/warning.

Re: Slow index creation

From
hubert depesz lubaczewski
Date:
On Thu, Feb 18, 2021 at 10:24:25AM -0700, Michael Lewis wrote:
>   [1]https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/
> 
> Thanks for this reference. I enjoy your blog, but haven't made the time to read all the archives somehow. Stuff
doesn'tstick very
 
> well when it isn't yet "needed" info besides.
> I have seen overhead from 'raise notice' in small functions that are sometimes called many thousands of times in a
singlequery, but
 
> hadn't done the test to verify if the same overhead still exists for raise debug or another level below
both client_min_messages
> and log_min_messages. Using your examples, I saw about .006 ms for each call to RAISE DEBUG with a client/log_min as
notice/warning.

Sure, this overhead is definitely possible, but kinda besides the point
- there will be some slowdowns in other places, and it will be good to
track them.
That's why I suggested to do it on small sample of data.

Best regards,

depesz




Re: Slow index creation

From
"Peter J. Holzer"
Date:
On 2021-02-16 19:30:23 +0100, Paul van der Linden wrote:
> I have 2 functions:
>
> CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
> $func$
> DECLARE
>     retVal text;
> BEGIN
>     SELECT
>       CASE
>         ... snip long list containing various tests on a,b and c
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;
>
> CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
> $func$
> DECLARE
>     retVal int;
> BEGIN
>     SELECT
>       CASE
>         WHEN r='r1' AND st_area(geom)>100 THEN 1
>         ... snip long list containing various tests on r and st_area(geom)
>       END INTO retval;
>     RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;
>
> and a large table t (100M+ records) with columns a, b, c and geom running on PG
> 11, on spinning disks with 64GB memory and 28 cores.
>
> When I create a simple geom index with CREATE INDEX ON t USING gist(geom) it
> finishes in about an hour, but when I create a partial index using these 2
> functions
> CREATE INDEX ON t USING gist(geom) WHERE bar(foo(a,b,c),geom)<12 it takes over
> 20 hours...

It has to call these functions for each of those 100M+ lines. So that's
about 0.7 ms per line. Is that to be expected for what those functions
do? I don't know.

As depesz wrote, use smaller test case, like 1000 or 10000 rows. You can
test the speed of those functions in isolation. Compare

select a, b, c from t;
select foo(a, b, c) from t;
select bar(foo(a, b, c), geom) from t;

That tells you how much time is spent in foo and bar. If the sum is
close to those 0.7 ms, you know that you need to make those functions
faster.

I think SQL doesn't short-circuit, so in your function bar, that
st_area(geom) in the first WHEN clause will always be called, regardless
of the value of r. You can either call that once (as depesz suggested)
or you can avoid calling it by nesting the cases:

    case
        when r = 'r1' then
            case
                when st_area(geom) > 100 then 1
                else 2
            end
        when r = 'r2' then
            ....
    end

You can then go one step furthe:

    case
        when r like 'r_' then
            -- handle r1 ... r9
        when r like 'r1_' then
            -- handle r10 ... r19
        when r like 'r2_' then
            -- handle r20 ... r29
    end

That reduces the average number of comparisons from 50+ to 8.

Depesz measured an overhead of 0.016ms per log message. That's low
enough that you can probably afford a few messages, even if each
function invocation only takes a few milliseconds. So definitely try
that if you need to know where your functions spend their time.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Slow index creation

From
Paul van der Linden
Date:
Extra tables is not something that will work out in my workflow...
I've managed to cut the time in half already, but perhaps there's more to be won.
In one of the calculations done on the st_area, I used a log(base,value), but since the only 2-param log function present in postgres takes numerics as parameters, i inserted a typecast to numeric.
Apparently that conversion takes quite some time, rewriting it to log(value)/log(base) (which are present with float params, so no need to convert to numeric) took only half the original time

On Wed, Feb 24, 2021 at 9:37 PM Bjornar Skinnes <bjornar_skinnes@trimble.com> wrote:
Why not create a table with cols a, b, c and d. Where you insert a row for each combination and key and index abc then return d? 

ons. 24. feb. 2021, 21:15 skrev Paul van der Linden <paul.doskabouter@gmail.com>:
Thanks for all the suggestions,

When the server is not in use for mission-critical work, I'll definitely going to do some testing based on your ideas.
Will let you know what comes out of that

Cheers,
Paul

On Thu, Feb 18, 2021 at 6:49 PM hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Thu, Feb 18, 2021 at 10:24:25AM -0700, Michael Lewis wrote:
>   [1]https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/
>
> Thanks for this reference. I enjoy your blog, but haven't made the time to read all the archives somehow. Stuff doesn't stick very
> well when it isn't yet "needed" info besides.
> I have seen overhead from 'raise notice' in small functions that are sometimes called many thousands of times in a single query, but
> hadn't done the test to verify if the same overhead still exists for raise debug or another level below both client_min_messages
> and log_min_messages. Using your examples, I saw about .006 ms for each call to RAISE DEBUG with a client/log_min as notice/warning.

Sure, this overhead is definitely possible, but kinda besides the point
- there will be some slowdowns in other places, and it will be good to
track them.
That's why I suggested to do it on small sample of data.

Best regards,

depesz