Thread: Missing free_var() at end of accum_sum_final()?

Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
Hi,

I noticed the NumericVar's pos_var and neg_var are not free_var()'d at the end of accum_sum_final().

The potential memory leak seems small, since the function is called only once per sum() per worker (and from a few more places), but maybe they should be free'd anyways for correctness?

/Joel

Re: Missing free_var() at end of accum_sum_final()?

From
Michael Paquier
Date:
On Thu, Feb 16, 2023 at 06:59:13AM +0100, Joel Jacobson wrote:
> I noticed the NumericVar's pos_var and neg_var are not free_var()'d
> at the end of accum_sum_final().
>
> The potential memory leak seems small, since the function is called
> only once per sum() per worker (and from a few more places), but
> maybe they should be free'd anyways for correctness?

Indeed, it is true that any code path of numeric.c that relies on a
NumericVar with an allocation done in its buffer is careful enough to
free it, except for generate_series's SRF where one step of the
computation is done.  I don't see directly why you could not do the
following:
@@ -11973,6 +11973,9 @@ accum_sum_final(NumericSumAccum *accum, NumericVar *result)
    /* And add them together */
    add_var(&pos_var, &neg_var, result);

+   free_var(&pos_var);
+   free_var(&neg_var);
+
    /* Remove leading/trailing zeroes */
    strip_var(result);
--
Michael

Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
On Thu, Feb 16, 2023, at 07:26, Michael Paquier wrote:
> Indeed, it is true that any code path of numeric.c that relies on a
> NumericVar with an allocation done in its buffer is careful enough to
> free it, except for generate_series's SRF where one step of the
> computation is done.  I don't see directly why you could not do the
> following:
> @@ -11973,6 +11973,9 @@ accum_sum_final(NumericSumAccum *accum, 
> NumericVar *result)
>     /* And add them together */
>     add_var(&pos_var, &neg_var, result);
> 
> +   free_var(&pos_var);
> +   free_var(&neg_var);
> +

Thanks for looking and explaining.

I added the free_var() calls after strip_var() to match the similar existing code at the end of sqrt_var().
Not that it matters, but thought it looked nicer to keep them in the same order.

/Joel
Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
Michael Paquier
Date:
On Thu, Feb 16, 2023 at 08:08:37AM +0100, Joel Jacobson wrote:
> I added the free_var() calls after strip_var() to match the similar
> existing code at the end of sqrt_var().
> Not that it matters, but thought it looked nicer to keep them in the
> same order.

WFM.  Thanks!
--
Michael

Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
Andres Freund
Date:
Hi,

On 2023-02-16 15:26:26 +0900, Michael Paquier wrote:
> On Thu, Feb 16, 2023 at 06:59:13AM +0100, Joel Jacobson wrote:
> > I noticed the NumericVar's pos_var and neg_var are not free_var()'d
> > at the end of accum_sum_final().
> > 
> > The potential memory leak seems small, since the function is called
> > only once per sum() per worker (and from a few more places), but
> > maybe they should be free'd anyways for correctness? 
> 
> Indeed, it is true that any code path of numeric.c that relies on a
> NumericVar with an allocation done in its buffer is careful enough to
> free it, except for generate_series's SRF where one step of the
> computation is done.  I don't see directly why you could not do the
> following:
> @@ -11973,6 +11973,9 @@ accum_sum_final(NumericSumAccum *accum, NumericVar *result)
>     /* And add them together */
>     add_var(&pos_var, &neg_var, result);
>  
> +   free_var(&pos_var);
> +   free_var(&neg_var);
> +
>     /* Remove leading/trailing zeroes */
>     strip_var(result);

But why do we need it? Most SQL callable functions don't need to be careful
about not leaking O(1) memory, the exception being functions backing btree
opclasses.

In fact, the detailed memory management often is *more* expensive than just
relying on the calling memory context being reset.

Of course, numeric.c doesn't really seem to have gotten that message, so
there's a consistency argument here.

Greetings,

Andres Freund



Re: Missing free_var() at end of accum_sum_final()?

From
Michael Paquier
Date:
On Thu, Feb 16, 2023 at 01:35:54PM -0800, Andres Freund wrote:
> But why do we need it? Most SQL callable functions don't need to be careful
> about not leaking O(1) memory, the exception being functions backing btree
> opclasses.
>
> In fact, the detailed memory management often is *more* expensive than just
> relying on the calling memory context being reset.
>
> Of course, numeric.c doesn't really seem to have gotten that message, so
> there's a consistency argument here.

I don't know which final result is better.  The arguments go two ways:
1) Should numeric.c be simplified so as its memory structure with extra
pfree()s, making it more consistent with more global assumptions than
just this file?  This has the disadvantage of creating more noise in
backpatching, while increasing the risk of leaks if some of the
removed parts are allocated in a tight loop within the same context.
This makes memory management less complicated.  That's how I am
understanding your point.
2) Should the style within numeric.c be more consistent?  This is how
I am understanding this proposal.  As you quote, this makes memory
management more complicated (not convinced about that for the internal
of numerics?), while making the file more consistent.

At the end, perhaps that's not worth bothering, but 2) prevails when
it comes to the rule of making some code consistent with its
surroundings.  1) has more risks seeing how old this code is.
--
Michael

Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
Andres Freund
Date:
Hi,

On 2023-02-17 11:48:14 +0900, Michael Paquier wrote:
> On Thu, Feb 16, 2023 at 01:35:54PM -0800, Andres Freund wrote:
> > But why do we need it? Most SQL callable functions don't need to be careful
> > about not leaking O(1) memory, the exception being functions backing btree
> > opclasses.
> > 
> > In fact, the detailed memory management often is *more* expensive than just
> > relying on the calling memory context being reset.
> > 
> > Of course, numeric.c doesn't really seem to have gotten that message, so
> > there's a consistency argument here.
> 
> I don't know which final result is better.  The arguments go two ways:
> 1) Should numeric.c be simplified so as its memory structure with extra
> pfree()s, making it more consistent with more global assumptions than
> just this file?  This has the disadvantage of creating more noise in
> backpatching, while increasing the risk of leaks if some of the
> removed parts are allocated in a tight loop within the same context.
> This makes memory management less complicated.  That's how I am
> understanding your point.

It's not just simplification, it's just faster to free via context reset. I
whipped up a random query exercising numeric math a bunch:

SELECT max(a + b + '17'::numeric + c) FROM (SELECT generate_series(1::numeric, 1000::numeric)) aa(a), (SELECT
generate_series(1::numeric,100::numeric)) bb(b), (SELECT generate_series(1::numeric, 10::numeric)) cc(c);
 

Removing the free_var()s from numeric_add_opt_error() speeds it up from ~361ms
to ~338ms.


This code really needs some memory management overhead reduction love. Many
allocation could be avoided by having a small on-stack "buffer" that's used
unless the numeric is large.


> 2) Should the style within numeric.c be more consistent?  This is how
> I am understanding this proposal.  As you quote, this makes memory
> management more complicated (not convinced about that for the internal
> of numerics?), while making the file more consistent.

> At the end, perhaps that's not worth bothering, but 2) prevails when
> it comes to the rule of making some code consistent with its
> surroundings.  1) has more risks seeing how old this code is.

I'm a bit wary that this will trigger a stream of patches to pointlessly free
things, causing churn and slowdowns. I suspect there's other places in
numeric.c where we don't free, and there certainly are a crapton in other
functions.

Greetings,

Andres Freund



Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
On Fri, Feb 17, 2023, at 21:26, Andres Freund wrote:
> Removing the free_var()s from numeric_add_opt_error() speeds it up from ~361ms
> to ~338ms.

I notice numeric_add_opt_error() is extern and declared in numeric.h,
called from e.g. timestamp.c and jsonpath_exec.c. Is that a problem,
i.e. is there a risk it could be used in a for loop by some code outside Numeric?

> This code really needs some memory management overhead reduction love. Many
> allocation could be avoided by having a small on-stack "buffer" that's used
> unless the numeric is large.

Nice idea!
Could something like the attached patch work?

/Joel
Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
Hi again,

Ignore previous patch, new correct version attached, that also keeps track of if the buf-field is in use or not.

Seems like your idea gives a signifiant speed-up!
On my machine, numeric_in() is 22% faster!

I ran a benchmark with 100 tests measuring execution-time of numeric_in() with two significant digits time precision.

Benchmark:

CREATE EXTENSION pit;
SELECT count(pit.async('numeric_in','{1234,0,-1}',2)) FROM generate_series(1,100);
CALL pit.work(true);
 SELECT
    format('%s(%s)',pit.test_params.function_name, array_to_string(pit.test_params.input_values,',')),
    pit.pretty_time(pit.tests.final_result),
    count(*)
FROM pit.tests
JOIN pit.test_params USING (id)
GROUP BY 1,2
ORDER BY 1,2;

HEAD:

        format         | pretty_time | count
-----------------------+-------------+-------
 numeric_in(1234,0,-1) | 31 ns       |    34
 numeric_in(1234,0,-1) | 32 ns       |    66
(2 rows)

0002-fixed-buf.patch:

        format         | pretty_time | count
-----------------------+-------------+-------
 numeric_in(1234,0,-1) | 24 ns       |     4
 numeric_in(1234,0,-1) | 25 ns       |    71
 numeric_in(1234,0,-1) | 26 ns       |    25
(3 rows)


The benchmark results was produced with: https://github.com/joelonsql/pg-timeit

Now, the question is how big of a fixed_buf we want. I will run some more benchmarks.

/Joel

On Sun, Feb 19, 2023, at 20:54, Joel Jacobson wrote:
> On Fri, Feb 17, 2023, at 21:26, Andres Freund wrote:
>> Removing the free_var()s from numeric_add_opt_error() speeds it up from ~361ms
>> to ~338ms.
>
> I notice numeric_add_opt_error() is extern and declared in numeric.h,
> called from e.g. timestamp.c and jsonpath_exec.c. Is that a problem,
> i.e. is there a risk it could be used in a for loop by some code 
> outside Numeric?
>
>> This code really needs some memory management overhead reduction love. Many
>> allocation could be avoided by having a small on-stack "buffer" that's used
>> unless the numeric is large.
>
> Nice idea!
> Could something like the attached patch work?
>
> /Joel
> Attachments:
> * 0001-fixed-buf.patch

-- 
Kind regards,

Joel
Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
On Sun, Feb 19, 2023, at 23:16, Joel Jacobson wrote:
> Hi again,
>
> Ignore previous patch, new correct version attached, that also keeps 
> track of if the buf-field is in use or not.

Ops! One more thinko, detected when trying fixed_buf[8], which caused a seg fault.

To fix, a init_var() in alloc_var() is needed when we will use the fixed_buf instead of allocating,
since alloc_var() could be called in a loop for existing values, like in sqrt_var() for instance.

Attached new version produces similar benchmark results, even with the extra init_var():

HEAD:

        format         | pretty_time | count
-----------------------+-------------+-------
 numeric_in(1234,0,-1) | 31 ns       |    74
 numeric_in(1234,0,-1) | 32 ns       |    26
(2 rows)

0003-fixed-buf.patch:

        format         | pretty_time | count
-----------------------+-------------+-------
 numeric_in(1234,0,-1) | 24 ns       |     1
 numeric_in(1234,0,-1) | 25 ns       |    84
 numeric_in(1234,0,-1) | 26 ns       |    15
(3 rows)

/Joel
Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
Michael Paquier
Date:
On Sun, Feb 19, 2023 at 11:55:38PM +0100, Joel Jacobson wrote:
> To fix, a init_var() in alloc_var() is needed when we will use the
> fixed_buf instead of allocating,
> since alloc_var() could be called in a loop for existing values,
> like in sqrt_var() for instance.
>
> Attached new version produces similar benchmark results, even with the extra init_var():

Perhaps you should register this patch to the commit of March?  Here
it is:
https://commitfest.postgresql.org/42/
--
Michael

Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
On Mon, Feb 20, 2023, at 08:38, Michael Paquier wrote:
> Perhaps you should register this patch to the commit of March?  Here
> it is:
> https://commitfest.postgresql.org/42/

Thanks, done.

/Joel



Re: Missing free_var() at end of accum_sum_final()?

From
Dean Rasheed
Date:
On Mon, 20 Feb 2023 at 08:03, Joel Jacobson <joel@compiler.org> wrote:
>
> On Mon, Feb 20, 2023, at 08:38, Michael Paquier wrote:
> > Perhaps you should register this patch to the commit of March?  Here
> > it is:
> > https://commitfest.postgresql.org/42/
>
> Thanks, done.
>

I have been testing this a bit, and I get less impressive results than
the ones reported so far.

Testing Andres' example:

SELECT max(a + b + '17'::numeric + c)
  FROM (SELECT generate_series(1::numeric, 1000::numeric)) aa(a),
       (SELECT generate_series(1::numeric, 100::numeric)) bb(b),
       (SELECT generate_series(1::numeric, 10::numeric)) cc(c);

with HEAD, I get:

Time: 216.978 ms
Time: 215.376 ms
Time: 214.973 ms
Time: 216.288 ms
Time: 216.494 ms

and removing the free_var() from numeric_add_opt_error() I get:

Time: 212.706 ms
Time: 212.684 ms
Time: 211.378 ms
Time: 213.383 ms
Time: 213.050 ms

That's 1-2% faster, not the 6-7% that Andres saw.

Testing the same example with the latest 0003-fixed-buf.patch, I get:

Time: 224.115 ms
Time: 225.382 ms
Time: 225.691 ms
Time: 224.135 ms
Time: 225.412 ms

which is now 4% slower.

I think the problem is that if you increase the size of NumericVar,
you increase the stack space required, as well as adding some overhead
to alloc_var(). Also, primitive operations like add_var() directly
call digitbuf_alloc(), so as it stands, they don't benefit from the
static buffer. Also, I'm not convinced that a 4-digit static buffer
would really be of much benefit to many numeric computations anyway.

To try to test the real-world benefit to numeric_in(), I re-ran one of
the tests I used while testing the non-decimal integer patches, using
COPY to read a large number of random numerics from a file:

CREATE TEMP TABLE foo(c1 numeric, c2 numeric, c3 numeric, c4 numeric,
                      c5 numeric, c6 numeric, c7 numeric, c8 numeric,
                      c9 numeric, c10 numeric);

INSERT INTO foo
  SELECT trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4))
  FROM generate_series(1,10000000);
COPY foo TO '/tmp/random-numerics.csv';

\timing on
TRUNCATE foo; COPY foo FROM '/tmp/random-numerics.csv';
TRUNCATE foo; COPY foo FROM '/tmp/random-numerics.csv';
TRUNCATE foo; COPY foo FROM '/tmp/random-numerics.csv';
TRUNCATE foo; COPY foo FROM '/tmp/random-numerics.csv';
TRUNCATE foo; COPY foo FROM '/tmp/random-numerics.csv';

With HEAD, this gave:

Time: 10750.298 ms (00:10.750)
Time: 10746.248 ms (00:10.746)
Time: 10772.277 ms (00:10.772)
Time: 10758.282 ms (00:10.758)
Time: 10760.425 ms (00:10.760)

and with 0003-fixed-buf.patch, it gave:

Time: 10623.254 ms (00:10.623)
Time: 10463.814 ms (00:10.464)
Time: 10461.700 ms (00:10.462)
Time: 10429.436 ms (00:10.429)
Time: 10438.359 ms (00:10.438)

So that's a 2-3% gain, which might be worthwhile, if not for the
slowdown in the other case.

I actually had a slightly different idea to improve numeric.c's memory
management, which gives a noticeable improvement for a few of the
simple numeric functions:

A common pattern in these numeric functions is to allocate memory for
the NumericVar's digit buffer while doing the computation, allocate
more memory for the Numeric result, copy the digits across, and then
free the NumericVar's digit buffer.

That can be reduced to just 1 palloc() and no pfree()'s by ensuring
that the initial allocation is large enough to hold the final Numeric
result, and then re-using that memory instead of allocating more. That
can be applied to all the numeric functions, saving a palloc() and a
pfree() in each case, and it fits quite well with the way
make_result() is used in all but one case (generate_series() needs to
be a little more careful to avoid trampling on the digit buffer of the
current value).

In Andres' generate_series() example, this gave:

Time: 203.838 ms
Time: 206.623 ms
Time: 204.672 ms
Time: 202.434 ms
Time: 204.893 ms

which is around 5-6% faster.

In the COPY test, it gave:

Time: 10511.293 ms (00:10.511)
Time: 10504.831 ms (00:10.505)
Time: 10521.736 ms (00:10.522)
Time: 10513.039 ms (00:10.513)
Time: 10511.979 ms (00:10.512)

which is around 2% faster than HEAD, and around 0.3% slower than
0003-fixed-buf.patch

None of this had any noticeable impact on the time to run the
regression tests, and I tried a few other simple examples, but it was
difficult to get consistent results, above the normal variation of the
test timings.

TBH, I'm yet to be convinced that any of this is actually worthwhile.
We might shave a few percent off some simple numeric operations, but I
doubt it will make much difference to more complex computations. I'd
need to see some more realistic test results, or some real evidence of
palloc/pfree causing significant overhead in a numeric computation.

Regards,
Dean

Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
Hi,

I found another small but significant improvement of the previous patch:

else if (ndigits < var->buf_len)
{
-    memset(var->buf, 0, var->buf_len);
+    var->buf[0] = 0;
    var->digits = var->buf + 1;
    var->ndigits = ndigits;
}


We don't need to set all buf elements to zero, only the first one.
This is not an improvement of HEAD, it's just a mistake I made in my previous patch.


COPY foo FROM '/tmp/random-numerics.csv';

HEAD:

Time: 8431.325 ms (00:08.431)
Time: 8424.749 ms (00:08.425)
Time: 8425.387 ms (00:08.425)
Time: 8519.869 ms (00:08.520)
Time: 8452.585 ms (00:08.453)

0004-fixed-buf.patch:
Time: 8539.475 ms (00:08.539)
Time: 8401.628 ms (00:08.402)
Time: 8399.440 ms (00:08.399)
Time: 8373.861 ms (00:08.374)
Time: 8388.002 ms (00:08.388)

0005-fixed-buf.patch:

Time: 8038.218 ms (00:08.038)
Time: 8082.898 ms (00:08.083)
Time: 7999.950 ms (00:08.000)
Time: 8039.640 ms (00:08.040)
Time: 7994.816 ms (00:07.995)

Almost half a second faster!

/Joel
Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
Hi,

My apologies, it seems my email didn't reach the list, probably due to the
benchmark plot images being to large. Here is the email again, but with
URLs to the images instead, and benchmark updated with results for the
0005-fixed-buf.patch.

--

On Mon, Feb 20, 2023, at 12:32, Dean Rasheed wrote:
> TBH, I'm yet to be convinced that any of this is actually worthwhile.
> We might shave a few percent off some simple numeric operations, but I
> doubt it will make much difference to more complex computations. I'd
> need to see some more realistic test results, or some real evidence of
> palloc/pfree causing significant overhead in a numeric computation.

Thanks for testing! Good point, I agree with your conclusion;
the small change to alloc_var() suggested in 0003-fixed-buf.patch probably
didn't provide enough speedups to motivate the change.

In the new attached patch, Andres fixed buffer idea has been implemented
throughout the entire numeric.c code base.

CHANGES:
--------

* Instead of a bool, we now have a buf_len struct field, to keep track of the
allocated capacity, which can be different from the numbers of digits
currently used. buf_len == 0 indicates no memory is allocated, and fixed_buf
is possibly used instead.

* alloc_var() now reuses the allocated buffer if there is one big enough.

* free_var() and zero_var() only call digitbuf_free() if they need to.

* set_var_from_var() now also uses the fixed buffer and also reuses the
allocated buffer if there is one and it's big enough.

* To allow the NumericVar's on the stack to be used without having to allocate
digits buf, we have to change callers e.g. add_var(), div_var(), etc, to ensure
the result variable isn't the same as one of the operand NumericVars.
This wasn't necessary before, since we always allocated a new digits buf for the
result, which could then be assigned to the digits field when done with
computations. However, this prevented us from relying solely on the existing
NumericVar stack variables, so we needed a few new temporary NumericVar stack
variables, to hold intermediate results, and set_var_from_var() to copy the
operand into the temp var.

Assert()'s have been added to such functions, add_abs(), sub_abs(), div_var(),
div_var_fast(), div_var_int64() and div_var_int64(), that enforce result being
a different object than the two operands.

Here is one example from ceil_var() on this from the new 0004-fixed-buf.patch:

HEAD:

    if (var->sign == NUMERIC_POS && cmp_var(var, &tmp) != 0)
        add_var(&tmp, &const_one, &tmp);
    set_var_from_var(&tmp, result);

The add_var() is a problem since &tmp is both the first operand and the result.
Funnily enough, the fix in this particular case, and in floor_var(),
is simpler and should be faster:

0005-fixed-buf.patch:

    if (var->sign == NUMERIC_POS && cmp_var(var, &tmp) != 0)
        add_var(&tmp, &const_one, result);
    else
        set_var_from_var(&tmp, result);

This is avoids the set_var_from_var() if the if-branch is taken, as the result
is written directly to result.

Another example from sqrt_var():

HEAD:

    add_var(&q_var, &a1_var, &q_var);

0005-fixed-buf.patch:

    set_var_from_var(&q_var, &tmp_var);
    add_var(&tmp_var, &a1_var, &q_var);

The extra set_var_from_var() seems to be a net-win in many cases,
except for the generate_series() example which doesn't have any
NumericVar's on the stack, except for the first iteration.

BENCHMARK:
----------

SELECT max(a + b + '17'::numeric + c)
FROM
(SELECT generate_series(1::numeric, 1000::numeric)) aa(a),
(SELECT generate_series(1::numeric, 100::numeric)) bb(b),
(SELECT generate_series(1::numeric, 10::numeric)) cc(c);

HEAD:

Time: 158.698 ms
Time: 142.486 ms
Time: 141.443 ms
Time: 142.044 ms
Time: 141.651 ms

0005-fixed-buf.patch:

Time: 156.371 ms
Time: 149.857 ms
Time: 150.674 ms
Time: 150.409 ms
Time: 150.461 ms

SELECT setseed(0.1234);
CREATE TABLE t (n1 numeric, n2 numeric, n3 numeric, n4 numeric);
INSERT INTO t (n1, n2, n3, n4)
SELECT
    round(random()::numeric,2),
    round(random()::numeric*10,2),
    round(random()::numeric*100,2),
    round(random()::numeric*1000,2)
FROM generate_series(1,1e7);
CHECKPOINT;
SELECT SUM(n1+n2+n3+n4) FROM t;

HEAD:

Time: 758.489 ms
Time: 646.794 ms
Time: 643.237 ms
Time: 642.620 ms
Time: 646.218 ms

0005-fixed-buf.patch:

Time: 748.093 ms
Time: 628.799 ms
Time: 629.853 ms
Time: 629.166 ms
Time: 627.768 ms


CREATE TABLE ledger (amount numeric);
INSERT INTO ledger (amount)
SELECT generate_series(-100000.00,100000,0.01);
CHECKPOINT;
SELECT SUM(amount*1.25 + 0.5) FROM ledger;

HEAD:

Time: 1113.080 ms (00:01.113)
Time: 931.998 ms
Time: 931.009 ms
Time: 932.476 ms
Time: 933.509 ms

0005-fixed-buf.patch:

Time: 1067.298 ms (00:01.067)
Time: 883.972 ms
Time: 880.165 ms
Time: 882.465 ms
Time: 893.646 ms

CREATE TEMP TABLE foo(c1 numeric, c2 numeric, c3 numeric, c4 numeric,
                      c5 numeric, c6 numeric, c7 numeric, c8 numeric,
                      c9 numeric, c10 numeric);

INSERT INTO foo
  SELECT trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4)),
         trim_scale(round(random()::numeric*1e4, 4))
  FROM generate_series(1,10000000);
COPY foo TO '/tmp/random-numerics.csv';

TRUNCATE foo; COPY foo FROM '/tmp/random-numerics.csv';

HEAD:

Time: 8515.644 ms (00:08.516)
Time: 8405.150 ms (00:08.405)
Time: 8399.067 ms (00:08.399)
Time: 8678.949 ms (00:08.679)
Time: 8388.152 ms (00:08.388)

0005-fixed-buf.patch:

Time: 8255.290 ms (00:08.255)
Time: 7986.409 ms (00:07.986)
Time: 8005.748 ms (00:08.006)
Time: 8004.352 ms (00:08.004)
Time: 8160.537 ms (00:08.161)


FIXED_BUF_LEN:
--------------

In 0005-fixed-buf.patch, the new FIXED_BUF_LEN def has been set to 8.

I've benchmarked other values as well, 2, 4, 16, 32, but 8 seems to be the sweet
spot. div_var() would benefit from FIXED_BUF_LEN 16 though.

Here is the test results using different FIXED_BUF_LEN values.

Pardon the images, but it's difficult to textify the two dimensional results,
as we need to vary the digit length of both operands.

This first image shows the execution time for numeric_add() with
operands up to 20 decdigits:


https://gist.githubusercontent.com/joelonsql/59bd2642d577fe4aaf2fb8b1ab7f67c4/raw/d99d8fdc6f34dbed255f310dfc998a429117301a/up-to-20-digits-add.pdf.png

The color scale is execution_time, where more reddish/hotter means slower,
and more blueish/cooler means faster.

As we can see, it gets notably cooler already with FIXED_BUF_LEN 4,
up to 8 decdigits, but it also gets a bit hotter for larger numbers.
With FIXED_BUF_LEN 8, it's cooler both for small numbers,
but it's also cooler for larger numbers.

If instead looking at numeric_div(),
we can see how FIXED_BUF_LEN 16 would be an improvement:


https://gist.githubusercontent.com/joelonsql/59bd2642d577fe4aaf2fb8b1ab7f67c4/raw/d99d8fdc6f34dbed255f310dfc998a429117301a/up-to-20-digits-div.pdf.png

The plot for numeric_mul() is a bit more difficult to read,
but we can see some improvement already at FIXED_BUF_LEN 4:


https://gist.githubusercontent.com/joelonsql/59bd2642d577fe4aaf2fb8b1ab7f67c4/raw/d99d8fdc6f34dbed255f310dfc998a429117301a/up-to-20-digits-mul.pdf.png

Note the scales are different for all these three plots.

In the plot below, the scale is the same for all three operators,
which can be nice to understand their relative execution time:


https://gist.githubusercontent.com/joelonsql/59bd2642d577fe4aaf2fb8b1ab7f67c4/raw/d99d8fdc6f34dbed255f310dfc998a429117301a/up-to-20-digits-overview.pdf.png

In these plots we have only studied operands with up to 20 decdigits.

For completion, here is plots up to 200 decdigits:


https://gist.githubusercontent.com/joelonsql/59bd2642d577fe4aaf2fb8b1ab7f67c4/raw/d99d8fdc6f34dbed255f310dfc998a429117301a/up-to-200-digits-overview.pdf.png

As we can see, there is no significant observable difference, as expected,
since the fixed buffer only improves moderately big numbers.

And finally, here is a plot of up to 131072 decdigits:


https://gist.githubusercontent.com/joelonsql/59bd2642d577fe4aaf2fb8b1ab7f67c4/raw/d99d8fdc6f34dbed255f310dfc998a429117301a/full-range-overview.pdf.png

Some additional plots can be viewed at the end of this gist:
https://gist.github.com/joelonsql/59bd2642d577fe4aaf2fb8b1ab7f67c4

It would be nice to avoid the additional tmp vars, as it glutters the interface.
One idea maybe could be to use an additional struct field for the writing of the result.
Or at least add helper-functions to avoid an extra line of code for the affected places in the code.

/Joel

Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
Peter Eisentraut
Date:
On 20.02.23 23:16, Joel Jacobson wrote:
> In the new attached patch, Andres fixed buffer idea has been implemented
> throughout the entire numeric.c code base.

I think the definition of the "preinitialized constants" could be 
adapted to this as well.  For example, instead of

     static const NumericDigit const_one_data[1] = {1};
     static const NumericVar const_one =
     {1, 0, NUMERIC_POS, 0, NULL, (NumericDigit *) const_one_data};

it could be something like

     static const NumericVar const_one =
     {0, 0, NUMERIC_POS, 0, NULL, NULL, 1, {1}};

Or perhaps with designators:

     static const NumericVar const_one =
     {.sign = NUMERIC_POS, .buflen = 1, .fixed_buf = {1}};




Re: Missing free_var() at end of accum_sum_final()?

From
Dean Rasheed
Date:
> On 20.02.23 23:16, Joel Jacobson wrote:
> > In the new attached patch, Andres fixed buffer idea has been implemented
> > throughout the entire numeric.c code base.
>

I have been going over this patch, and IMO it's far too invasive for
the fairly modest performance gains (and performance regressions in
some cases) that it gives (which seem to be somewhat smaller on my
machine).

One code change that I am particularly opposed to is changing all the
low-level functions like add_var(), mul_var(), etc., so that they no
longer accept the result being the same variable as any of the inputs.
That is a particularly convenient thing to be able to do, and without
it, various functions become more complex and less readable, and have
to resort to using more temporary variables.

I actually find the whole business of attaching a static buffer and
new buf_len fields to NumericVar quite ugly, and the associated extra
complexity in alloc_var(), free_var(), zero_var(), and
set_var_from_var() is all part of that. Now that might be worth it, if
this gave significant performance gains across the board, but the
trouble is it doesn't. AFAICS, it seems to be just as likely to
degrade performance. For example:

SELECT sqrt(6*sum(1/x/x)) FROM generate_series(1::numeric
,10000000::numeric) g(x);

is consistently 1-2% slower for me, with this patch. That's not much,
but then neither are most of the gains. In a lot of cases, it's so
close to the level of noise that I don't think most users will notice
one way or the other.

So IMO the results just don't justify such an extensive set of
changes, and I think we should abandon this fixed buffer approach.

Having said that, I think the results from the COPY test are worth
looking at more closely. Your results seem to suggest around a 5%
improvement. On my machine it was only around 3%, but that still might
be worth having, if it didn't involve such invasive changes throughout
the rest of the code.

As an experiment, I took another look at my earlier patch, making
make_result() construct the result using the same allocated memory as
the variable's digit buffer (if allocated). That eliminates around a
third of all free_var() calls from numeric.c, and for most functions,
it saves both a palloc() and a pfree(). In the case of numeric_in(), I
realised that it's possible to go further, by reusing the decimal
digits buffer for the NumericVar's digits, and then later for the
final Numeric result. Also, by carefully aligning things, it's
possible to arrange it so that the final make_result() doesn't need to
copy/move the digits at all. With that I get something closer to a 15%
improvement in the COPY test, which is definitely worth having.

In the pi series above, it gave a 3-4% performance improvement, and
that seemed to be a common pattern across a number of other tests.
It's also a much less invasive change, since it's only really changing
make_result(), which makes the knock-on effects much more manageable,
and reduces the chances of any performance regressions.

I didn't do all the tests that you did though, so it would be
interesting to see how it fares in those.

Regards,
Dean

Attachment

Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
On Fri, Mar 3, 2023, at 16:11, Dean Rasheed wrote:
> So IMO the results just don't justify such an extensive set of
> changes, and I think we should abandon this fixed buffer approach.

I agree. I was hoping it would be possible to reduce the invasiveness,
but I think it's difficult and probably not worth it.

> Having said that, I think the results from the COPY test are worth
> looking at more closely. Your results seem to suggest around a 5%
> improvement. On my machine it was only around 3%, but that still might
> be worth having, if it didn't involve such invasive changes throughout
> the rest of the code.
>
> As an experiment, I took another look at my earlier patch, making
> make_result() construct the result using the same allocated memory as
> the variable's digit buffer (if allocated). That eliminates around a
> third of all free_var() calls from numeric.c, and for most functions,
> it saves both a palloc() and a pfree(). In the case of numeric_in(), I
> realised that it's possible to go further, by reusing the decimal
> digits buffer for the NumericVar's digits, and then later for the
> final Numeric result. Also, by carefully aligning things, it's
> possible to arrange it so that the final make_result() doesn't need to
> copy/move the digits at all. With that I get something closer to a 15%
> improvement in the COPY test, which is definitely worth having.

Nice! Patch LGTM.

> I didn't do all the tests that you did though, so it would be
> interesting to see how it fares in those.

SELECT count(*) FROM generate_series(1::numeric, 10000000::numeric);
Time: 1196.801 ms (00:01.197) -- HEAD
Time: 1278.376 ms (00:01.278) -- make-result-using-vars-buf-v2.patch

TRUNCATE foo; COPY foo FROM '/tmp/random-numerics.csv';
Time: 8450.551 ms (00:08.451) -- HEAD
Time: 7176.838 ms (00:07.177) -- make-result-using-vars-buf-v2.patch

SELECT SUM(n1+n2+n3+n4) FROM t;
Time: 643.961 ms -- HEAD
Time: 620.303 ms -- make-result-using-vars-buf-v2.patch

SELECT max(a + b + '17'::numeric + c)
FROM
(SELECT generate_series(1::numeric, 1000::numeric)) aa(a),
(SELECT generate_series(1::numeric, 100::numeric)) bb(b),
(SELECT generate_series(1::numeric, 10::numeric)) cc(c);
Time: 141.070 ms -- HEAD
Time: 139.562 ms -- make-result-using-vars-buf-v2.patch

SELECT SUM(amount*1.25 + 0.5) FROM ledger;
Time: 933.461 ms -- HEAD
Time: 862.619 ms -- make-result-using-vars-buf-v2.patch

Looks like a win in all cases except the first one.

Great work!

/Joel



Re: Missing free_var() at end of accum_sum_final()?

From
"Joel Jacobson"
Date:
On Fri, Mar 3, 2023, at 16:11, Dean Rasheed wrote:
> Attachments:
> * make-result-using-vars-buf-v2.patch

One suggestion: maybe add a comment explaining why the allocated buffer
which size is based on strlen(cp) for the decimal digit values,
is guaranteed to be large enough also for the result's digit buffer?

I.e. some kind of proof why

   (NUMERIC_HDRSZ + strlen(cp) + DEC_DIGITS * 2) >= ((ndigits + 1) * sizeof(NumericDigit)))

holds true in general.

/Joel



Re: Missing free_var() at end of accum_sum_final()?

From
Peter Eisentraut
Date:
On 05.03.23 09:53, Joel Jacobson wrote:
> On Fri, Mar 3, 2023, at 16:11, Dean Rasheed wrote:
>> Attachments:
>> * make-result-using-vars-buf-v2.patch
> 
> One suggestion: maybe add a comment explaining why the allocated buffer
> which size is based on strlen(cp) for the decimal digit values,
> is guaranteed to be large enough also for the result's digit buffer?
> 
> I.e. some kind of proof why
> 
>     (NUMERIC_HDRSZ + strlen(cp) + DEC_DIGITS * 2) >= ((ndigits + 1) * sizeof(NumericDigit)))
> 
> holds true in general.

It looks like this thread has fizzled out, and no one is really working 
on the various proposed patch variants.  Unless someone indicates that 
they are still seriously pursuing this, I will mark this patch as 
"Returned with feedback".