Thread: Slow index creation
Hi,
I have 2 functions:
CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
$func$
DECLARE
retVal text;
BEGIN
SELECT
CASE
WHEN a='v1' AND b='b1' THEN 'r1'
WHEN a='v1' THEN 'r2'
... snip long list containing various tests on a,b and c
WHEN a='v50' THEN 'r50'
END INTO retval;
RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;
CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
$func$
DECLARE
retVal int;
BEGIN
SELECT
CASE
WHEN r='r1' AND st_area(geom)>100 THEN 1
WHEN r='r1' THEN 2
... snip long list containing various tests on r and st_area(geom)
WHEN r='r50' THEN 25
END INTO retval;
RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;
$func$
DECLARE
retVal text;
BEGIN
SELECT
CASE
WHEN a='v1' AND b='b1' THEN 'r1'
WHEN a='v1' THEN 'r2'
... snip long list containing various tests on a,b and c
WHEN a='v50' THEN 'r50'
END INTO retval;
RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;
CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
$func$
DECLARE
retVal int;
BEGIN
SELECT
CASE
WHEN r='r1' AND st_area(geom)>100 THEN 1
WHEN r='r1' THEN 2
... snip long list containing various tests on r and st_area(geom)
WHEN r='r50' THEN 25
END INTO retval;
RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;
and a large table t (100M+ records) with columns a, b, c and geom running on PG 11, on spinning disks with 64GB memory and 28 cores.
When I create a simple geom index with CREATE INDEX ON t USING gist(geom) it finishes in about an hour, but when I create a partial index using these 2 functions
CREATE INDEX ON t USING gist(geom) WHERE bar(foo(a,b,c),geom)<12 it takes over 20 hours...
Is that because I'm using functions in the WHERE clause, or because my CASE lists are quite long, or both?
Is there any way to speed up that index creation? Is upgrading to a newer postgres a viable option (so the JITTER can speed up the functions)?
Paul
What is your concern with it taking 20 hours vs 1 hour? Is this index re-created on a regular basis?
Would it make any sense to materialize the value of foo(a,b,c) as a generated column (PG12+ natively, or maintained by a trigger before)? Or even bar(foo(a,b,c),geom)?
Do you know if parallel_workers are being used?
JIT is available in PG11, it is just off by default. If it is available, turning it on and trying it seems like the simplest check if it would speed up the index creation.
Well, first off it's annoying if I have to change the function and a reindex afterwards, and secondly, lots of other queries are blocking on that reindex query (basically everything needing a queryplan on that table).
Materializing is also an option but that too is taking its time.
As far as I know there's no parallelism used currently, and as per documentation, only creating b-tree indices support parallelism..
Also my postgres installation (on windows) doesn't seem to do anything with JIT (even after setting all the jit-related values to 0)
I was more trying to get a feeling on where the slowness is, and how to improve that...
On Tue, Feb 16, 2021 at 7:45 PM Michael Lewis <mlewis@entrata.com> wrote:
What is your concern with it taking 20 hours vs 1 hour? Is this index re-created on a regular basis?Would it make any sense to materialize the value of foo(a,b,c) as a generated column (PG12+ natively, or maintained by a trigger before)? Or even bar(foo(a,b,c),geom)?Do you know if parallel_workers are being used?JIT is available in PG11, it is just off by default. If it is available, turning it on and trying it seems like the simplest check if it would speed up the index creation.
On 2/16/21 12:30 PM, Paul van der Linden wrote:
How long does SELECT a, b, c, foo(a, b, c) from blarg; take?
Ditto SELECT a, b, c, foo(a, b, c), bar(foo(a, b, c)) from blarg;
Ditto SELECT a, b, c, foo(a, b, c) from blarg where bar(foo(a, b, c)) < 12;
That'll narrow the problem.
Hi,I have 2 functions:CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
$func$
DECLARE
retVal text;
BEGIN
SELECT
CASE
WHEN a='v1' AND b='b1' THEN 'r1'
WHEN a='v1' THEN 'r2'
... snip long list containing various tests on a,b and c
WHEN a='v50' THEN 'r50'
END INTO retval;
RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;
CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
$func$
DECLARE
retVal int;
BEGIN
SELECT
CASE
WHEN r='r1' AND st_area(geom)>100 THEN 1
WHEN r='r1' THEN 2
... snip long list containing various tests on r and st_area(geom)
WHEN r='r50' THEN 25
END INTO retval;
RETURN retVal;
END
$func$ LANGUAGE PLPGSQL IMMUTABLE;and a large table t (100M+ records) with columns a, b, c and geom running on PG 11, on spinning disks with 64GB memory and 28 cores.When I create a simple geom index with CREATE INDEX ON t USING gist(geom) it finishes in about an hour, but when I create a partial index using these 2 functionsCREATE INDEX ON t USING gist(geom) WHERE bar(foo(a,b,c),geom)<12 it takes over 20 hours...Is that because I'm using functions in the WHERE clause, or because my CASE lists are quite long, or both?
How long does SELECT a, b, c, foo(a, b, c) from blarg; take?
Ditto SELECT a, b, c, foo(a, b, c), bar(foo(a, b, c)) from blarg;
Ditto SELECT a, b, c, foo(a, b, c) from blarg where bar(foo(a, b, c)) < 12;
That'll narrow the problem.
Is there any way to speed up that index creation? Is upgrading to a newer postgres a viable option (so the JITTER can speed up the functions)?Paul
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
On Tue, Feb 16, 2021 at 07:30:23PM +0100, Paul van der Linden wrote: > Hi, > I have 2 functions: > CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS > $func$ > DECLARE > retVal text; > BEGIN > SELECT > CASE > WHEN a='v1' AND b='b1' THEN 'r1' > WHEN a='v1' THEN 'r2' > ... snip long list containing various tests on a,b and c > WHEN a='v50' THEN 'r50' > END INTO retval; > RETURN retVal; > END > $func$ LANGUAGE PLPGSQL IMMUTABLE; If this function was converted to SQL function it could be faster, as it could be inlined. > CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS > $func$ > DECLARE > retVal int; > BEGIN > SELECT > CASE > WHEN r='r1' AND st_area(geom)>100 THEN 1 > WHEN r='r1' THEN 2 > ... snip long list containing various tests on r and st_area(geom) > WHEN r='r50' THEN 25 > END INTO retval; > RETURN retVal; > END > $func$ LANGUAGE PLPGSQL IMMUTABLE; First thing that I notice is that it seems (from the code and comment) that you run st_area(geom) multiple times. If that's really the case - why don't you cache it in some variable? declare v_area float := st_area( geom ); begin ... and then use v_area instead of st_area(geom) depesz
The st_area calculation is done mostly once or sometimes twice for each geom, and I suspect that can't explain the factor 20 slower.
Creating an index with only one st_area calculation is also done rather quickly.
On Wed, Feb 17, 2021 at 7:48 PM hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Tue, Feb 16, 2021 at 07:30:23PM +0100, Paul van der Linden wrote:
> Hi,
> I have 2 functions:
> CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS
> $func$
> DECLARE
> retVal text;
> BEGIN
> SELECT
> CASE
> WHEN a='v1' AND b='b1' THEN 'r1'
> WHEN a='v1' THEN 'r2'
> ... snip long list containing various tests on a,b and c
> WHEN a='v50' THEN 'r50'
> END INTO retval;
> RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;
If this function was converted to SQL function it could be faster, as it
could be inlined.
> CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS
> $func$
> DECLARE
> retVal int;
> BEGIN
> SELECT
> CASE
> WHEN r='r1' AND st_area(geom)>100 THEN 1
> WHEN r='r1' THEN 2
> ... snip long list containing various tests on r and st_area(geom)
> WHEN r='r50' THEN 25
> END INTO retval;
> RETURN retVal;
> END
> $func$ LANGUAGE PLPGSQL IMMUTABLE;
First thing that I notice is that it seems (from the code and comment)
that you run st_area(geom) multiple times.
If that's really the case - why don't you cache it in some variable?
declare
v_area float := st_area( geom );
begin
...
and then use v_area instead of st_area(geom)
depesz
On Wed, Feb 17, 2021 at 08:40:17PM +0100, Paul van der Linden wrote: > The st_area calculation is done mostly once or sometimes twice for each geom, and I suspect that can't explain the factor20 slower. > Creating an index with only one st_area calculation is also done rather quickly. In this case, make small test case, like 1000 rows, or something like this. run create index without where, and then remake the functions with profiling info, for example using this approach: https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/ and check for yourself what takes this time, and if it's something you can fix. Best regards, depesz
https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/
Thanks for this reference. I enjoy your blog, but haven't made the time to read all the archives somehow. Stuff doesn't stick very well when it isn't yet "needed" info besides.
I have seen overhead from 'raise notice' in small functions that are sometimes called many thousands of times in a single query, but hadn't done the test to verify if the same overhead still exists for raise debug or another level below both client_min_messages and log_min_messages. Using your examples, I saw about .006 ms for each call to RAISE DEBUG with a client/log_min as notice/warning.
On Thu, Feb 18, 2021 at 10:24:25AM -0700, Michael Lewis wrote: > [1]https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/ > > Thanks for this reference. I enjoy your blog, but haven't made the time to read all the archives somehow. Stuff doesn'tstick very > well when it isn't yet "needed" info besides. > I have seen overhead from 'raise notice' in small functions that are sometimes called many thousands of times in a singlequery, but > hadn't done the test to verify if the same overhead still exists for raise debug or another level below both client_min_messages > and log_min_messages. Using your examples, I saw about .006 ms for each call to RAISE DEBUG with a client/log_min as notice/warning. Sure, this overhead is definitely possible, but kinda besides the point - there will be some slowdowns in other places, and it will be good to track them. That's why I suggested to do it on small sample of data. Best regards, depesz
On 2021-02-16 19:30:23 +0100, Paul van der Linden wrote: > I have 2 functions: > > CREATE FUNCTION foo(a text, b text, c text) RETURNS text AS > $func$ > DECLARE > retVal text; > BEGIN > SELECT > CASE > ... snip long list containing various tests on a,b and c > END INTO retval; > RETURN retVal; > END > $func$ LANGUAGE PLPGSQL IMMUTABLE; > > CREATE FUNCTION bar(r text, geom geometry) RETURNS int AS > $func$ > DECLARE > retVal int; > BEGIN > SELECT > CASE > WHEN r='r1' AND st_area(geom)>100 THEN 1 > ... snip long list containing various tests on r and st_area(geom) > END INTO retval; > RETURN retVal; > END > $func$ LANGUAGE PLPGSQL IMMUTABLE; > > and a large table t (100M+ records) with columns a, b, c and geom running on PG > 11, on spinning disks with 64GB memory and 28 cores. > > When I create a simple geom index with CREATE INDEX ON t USING gist(geom) it > finishes in about an hour, but when I create a partial index using these 2 > functions > CREATE INDEX ON t USING gist(geom) WHERE bar(foo(a,b,c),geom)<12 it takes over > 20 hours... It has to call these functions for each of those 100M+ lines. So that's about 0.7 ms per line. Is that to be expected for what those functions do? I don't know. As depesz wrote, use smaller test case, like 1000 or 10000 rows. You can test the speed of those functions in isolation. Compare select a, b, c from t; select foo(a, b, c) from t; select bar(foo(a, b, c), geom) from t; That tells you how much time is spent in foo and bar. If the sum is close to those 0.7 ms, you know that you need to make those functions faster. I think SQL doesn't short-circuit, so in your function bar, that st_area(geom) in the first WHEN clause will always be called, regardless of the value of r. You can either call that once (as depesz suggested) or you can avoid calling it by nesting the cases: case when r = 'r1' then case when st_area(geom) > 100 then 1 else 2 end when r = 'r2' then .... end You can then go one step furthe: case when r like 'r_' then -- handle r1 ... r9 when r like 'r1_' then -- handle r10 ... r19 when r like 'r2_' then -- handle r20 ... r29 end That reduces the average number of comparisons from 50+ to 8. Depesz measured an overhead of 0.016ms per log message. That's low enough that you can probably afford a few messages, even if each function invocation only takes a few milliseconds. So definitely try that if you need to know where your functions spend their time. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
Extra tables is not something that will work out in my workflow...
I've managed to cut the time in half already, but perhaps there's more to be won.
In one of the calculations done on the st_area, I used a log(base,value), but since the only 2-param log function present in postgres takes numerics as parameters, i inserted a typecast to numeric.
Apparently that conversion takes quite some time, rewriting it to log(value)/log(base) (which are present with float params, so no need to convert to numeric) took only half the original time
On Wed, Feb 24, 2021 at 9:37 PM Bjornar Skinnes <bjornar_skinnes@trimble.com> wrote:
Why not create a table with cols a, b, c and d. Where you insert a row for each combination and key and index abc then return d?ons. 24. feb. 2021, 21:15 skrev Paul van der Linden <paul.doskabouter@gmail.com>:Thanks for all the suggestions,When the server is not in use for mission-critical work, I'll definitely going to do some testing based on your ideas.Will let you know what comes out of thatCheers,PaulOn Thu, Feb 18, 2021 at 6:49 PM hubert depesz lubaczewski <depesz@depesz.com> wrote:On Thu, Feb 18, 2021 at 10:24:25AM -0700, Michael Lewis wrote:
> [1]https://www.depesz.com/2010/03/18/profiling-stored-proceduresfunctions/
>
> Thanks for this reference. I enjoy your blog, but haven't made the time to read all the archives somehow. Stuff doesn't stick very
> well when it isn't yet "needed" info besides.
> I have seen overhead from 'raise notice' in small functions that are sometimes called many thousands of times in a single query, but
> hadn't done the test to verify if the same overhead still exists for raise debug or another level below both client_min_messages
> and log_min_messages. Using your examples, I saw about .006 ms for each call to RAISE DEBUG with a client/log_min as notice/warning.
Sure, this overhead is definitely possible, but kinda besides the point
- there will be some slowdowns in other places, and it will be good to
track them.
That's why I suggested to do it on small sample of data.
Best regards,
depesz