Thread: Add LogicalTapeSetExtend() to logtape.c

Add LogicalTapeSetExtend() to logtape.c

From

Jeff Davis

Date:

27 February 2020, 21:01:08

Attached is a patch that exports a new function from logtape.c:

   extern LogicalTapeSet *LogicalTapeSetExtend(
            LogicalTapeSet *lts, int nAdditional);

The purpose is to allow adding new tapes dynamically for the upcoming
Hash Aggregation work[0]. HashAgg doesn't know in advance how many
tapes it will need, though only a limited number are actually active at
a time.

This was proposed and originally written by Adam Lee[1] (extract only
the changes to logtape.c/h from his posted patch). Strangely, I'm
seeing ~5% regression with his version when doing:

    -- t20m_1_int4 has 20 million random integers
    select * from t20m_1_int4 order by i offset 100000000;

Which seems to be due to using a pointer rather than a flexible array
member (I'm using gcc[2]). My version (attached) still uses a flexible
array member, which doesn't see the regression; but I repalloc the
whole struct so the caller needs to save the new pointer to the tape
set.

That doesn't entirely make sense to me, and I'm wondering if someone
else can repro that result and/or make a suggestion, because I don't
have a good explanation. I'm fine with my version of the patch, but it
would be nice to know why there's such a big difference using a pointer
versus a flexible array member.

Regards,
    Jeff Davis

[0] 
https://postgr.es/m/6e7c269b9a84ff75fefcad8ab2d4758f03581e98.camel%40j-davis.com
[1] https://postgr.es/m/20200108071202.GA1511@mars.local
[2] gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

Attachment

logtape.patch

Re: Add LogicalTapeSetExtend() to logtape.c

From

Adam Lee

Date:

28 February 2020, 06:16:41

On Thu, Feb 27, 2020 at 01:01:08PM -0800, Jeff Davis wrote:
> Attached is a patch that exports a new function from logtape.c:
> 
>    extern LogicalTapeSet *LogicalTapeSetExtend(
>             LogicalTapeSet *lts, int nAdditional);
> 
> The purpose is to allow adding new tapes dynamically for the upcoming
> Hash Aggregation work[0]. HashAgg doesn't know in advance how many
> tapes it will need, though only a limited number are actually active at
> a time.
> 
> This was proposed and originally written by Adam Lee[1] (extract only
> the changes to logtape.c/h from his posted patch). Strangely, I'm
> seeing ~5% regression with his version when doing:
> 
>     -- t20m_1_int4 has 20 million random integers
>     select * from t20m_1_int4 order by i offset 100000000;
> 
> Which seems to be due to using a pointer rather than a flexible array
> member (I'm using gcc[2]). My version (attached) still uses a flexible
> array member, which doesn't see the regression; but I repalloc the
> whole struct so the caller needs to save the new pointer to the tape
> set.
> 
> That doesn't entirely make sense to me, and I'm wondering if someone
> else can repro that result and/or make a suggestion, because I don't
> have a good explanation. I'm fine with my version of the patch, but it
> would be nice to know why there's such a big difference using a pointer
> versus a flexible array member.
> 
> Regards,
>     Jeff Davis

I noticed another difference, I was using palloc0(), which could be one of the
reason, but not sure.

Tested your hashagg-20200226.patch on my laptop(Apple clang version 11.0.0),
the average time is 25.9s:

```
create table t20m_1_int4(i int);
copy t20m_1_int4 from program 'shuf -i 1-4294967295 -n 20000000';
analyze t20m_1_int4;
```
```
explain analyze select * from t20m_1_int4 order by i offset 100000000;
                                                              QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3310741.20..3310741.20 rows=1 width=4) (actual time=25666.471..25666.471 rows=0 loops=1)
   ->  Sort  (cost=3260740.96..3310741.20 rows=20000096 width=4) (actual time=20663.065..24715.269 rows=20000000
loops=1)
         Sort Key: i
         Sort Method: external merge  Disk: 274056kB
         ->  Seq Scan on t20m_1_int4  (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.075..2749.385
rows=20000000loops=1)
 
 Planning Time: 0.109 ms
 Execution Time: 25911.542 ms
(7 rows)
```

But if use the palloc0() or do the MemSet() like:

```
lts = (LogicalTapeSet *) palloc0(offsetof(LogicalTapeSet, tapes) +
                                ntapes * sizeof(LogicalTape));
...
MemSet(lts->tapes, 0, lts->nTapes * sizeof(LogicalTape)); <--- this line doesn't matter as I observed,
                                                               which makes a little sense the compiler
                                                               might know it's already zero.
```

The average time goes up to 26.6s:

```
explain analyze select * from t20m_1_int4 order by i offset 100000000;
                                                              QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3310741.20..3310741.20 rows=1 width=4) (actual time=26419.712..26419.712 rows=0 loops=1)
   ->  Sort  (cost=3260740.96..3310741.20 rows=20000096 width=4) (actual time=21430.044..25468.661 rows=20000000
loops=1)
         Sort Key: i
         Sort Method: external merge  Disk: 274056kB
         ->  Seq Scan on t20m_1_int4  (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.060..2839.452
rows=20000000loops=1)
 
 Planning Time: 0.111 ms
 Execution Time: 26652.255 ms
(7 rows)
```

-- 
Adam Lee

Re: Add LogicalTapeSetExtend() to logtape.c

From

Jeff Davis

Date:

28 February 2020, 20:38:55

On Fri, 2020-02-28 at 14:16 +0800, Adam Lee wrote:
> I noticed another difference, I was using palloc0(), which could be
> one of the
> reason, but not sure.

I changed the palloc0()'s in your code to plain palloc(), and it didn't
make any perceptible difference. Still slower than the version I posted
that keeps the flexible array.

Did you compare all 3? Master, with your patch, and with my patch? I'd
like to see if you're seeing the same thing that I am.

> Tested your hashagg-20200226.patch on my laptop(Apple clang version
> 11.0.0),
> the average time is 25.9s:

That sounds high -- my runs are about half that time. Is that with a
debug build or an optimized one?

Regards,
    Jeff Davis

Re: Add LogicalTapeSetExtend() to logtape.c

From

Adam Lee

Date:

03 March 2020, 01:49:35

On Fri, Feb 28, 2020 at 12:38:55PM -0800, Jeff Davis wrote:
> On Fri, 2020-02-28 at 14:16 +0800, Adam Lee wrote:
> > I noticed another difference, I was using palloc0(), which could be
> > one of the
> > reason, but not sure.
> 
> I changed the palloc0()'s in your code to plain palloc(), and it didn't
> make any perceptible difference. Still slower than the version I posted
> that keeps the flexible array.
> 
> Did you compare all 3? Master, with your patch, and with my patch? I'd
> like to see if you're seeing the same thing that I am.
> 
> > Tested your hashagg-20200226.patch on my laptop(Apple clang version
> > 11.0.0),
> > the average time is 25.9s:
> 
> That sounds high -- my runs are about half that time. Is that with a
> debug build or an optimized one?
> 
> Regards,
>     Jeff Davis

Yes, I was running a debug version. I usually do 'CFLAGS=-O0 -g3'
'--enable-cassert' '--enable-debug'.

Test with a general build:

Master: 12729ms 12970ms 12999ms
With my patch(a pointer): 12965ms 13273ms 13116ms
With your patch(flexible array): 12906ms 12991ms 13043ms

Not obvious I suppose, anyway, your patch looks good to me.

-- 
Adam Lee

Re: Add LogicalTapeSetExtend() to logtape.c

From

Jeff Davis

Date:

03 March 2020, 16:46:24

On Tue, 2020-03-03 at 09:49 +0800, Adam Lee wrote:
> Master: 12729ms 12970ms 12999ms
> With my patch(a pointer): 12965ms 13273ms 13116ms
> With your patch(flexible array): 12906ms 12991ms 13043ms

Hmm.. looks like you didn't reproduce the difference I saw. What
compiler/version?

Regards,
    Jeff Davis

Re: Add LogicalTapeSetExtend() to logtape.c

From

Adam Lee

Date:

04 March 2020, 03:57:19

On Tue, Mar 03, 2020 at 08:46:24AM -0800, Jeff Davis wrote:
> On Tue, 2020-03-03 at 09:49 +0800, Adam Lee wrote:
> > Master: 12729ms 12970ms 12999ms
> > With my patch(a pointer): 12965ms 13273ms 13116ms
> > With your patch(flexible array): 12906ms 12991ms 13043ms
> 
> Hmm.. looks like you didn't reproduce the difference I saw. What
> compiler/version?

It was "Apple clang version 11.0.0 (clang-1100.0.33.17)".

Then I changed to "gcc-9 (Homebrew GCC 9.2.0_3) 9.2.0" this morning.

Master(e537aed61d): 13342.844 ms 13195.982 ms 13271.023 ms
With my patch(a pointer): 13020.029 ms 13008.158 ms 13063.658 ms
With your patch(flexible array): 12870.117 ms 12814.725 ms 13119.255 ms

-- 
Adam Lee

Re: Add LogicalTapeSetExtend() to logtape.c

From

Jeff Davis

Date:

04 March 2020, 20:06:07

On Wed, 2020-03-04 at 11:57 +0800, Adam Lee wrote:
> Master(e537aed61d): 13342.844 ms 13195.982 ms 13271.023 ms
> With my patch(a pointer): 13020.029 ms 13008.158 ms 13063.658 ms
> With your patch(flexible array): 12870.117 ms 12814.725 ms 13119.255
> ms

I tracked the problem down.

When we change from a flexible array to a pointer and a separately-
allocated chunk, then it causes some unrelated code in
LogicalTapeWrite() to be optimized differently in my environment.

Specifically, the memcpy() in LogicalTapeWrite() gets turned into an
inline implementation using the "rep movsq" instruction. For my
particular case, actually calling memcpy (rather than using the inlined
version) is a lot faster.

In my test case, LogicalTapeWrite() was being called with size of 4 or
10, so perhaps those small values are just handled much more
efficiently in the real memcpy.

To get it to use the real memcpy(), I had to '#undef _FORTIFY_SOURCE'
at the top of the file, and pass -fno-builtin-memcpy. Then the
regression went away.

I don't care for the version I posted where it repalloc()s the entire
struct. That seems needlessly odd and could cause confusion; and it
also changes the API so that the caller needs to update its pointers.

I'm inclined to use a version similar to Adam's, where it has a pointer
and a separate palloc() chunk (attached). That causes a regression in
my setup, but it's not a terrible regression, and it's not really the
"fault" of the change. Trying to twist code around to satisfy a
particular compiler/libc seems like a bad idea. It also might depend on
the exact query, and may even be faster for some.

Any objections?

Regards,
    Jeff Davis

Attachment

logtape-20200303.patch

Re: Add LogicalTapeSetExtend() to logtape.c

From

Jeff Davis

Date:

07 March 2020, 21:00:55

On Wed, 2020-03-04 at 12:06 -0800, Jeff Davis wrote:
> I tracked the problem down.

Because of the name _FORTIFY_SOURCE, I got somewhat concerned that this
change (logtape-20200303) was somehow interfering with a safety
mechanism in the compiler.

There's a mechanism in GCC called object size tracking[1]. memcpy() is
replaced by __builtin___memcpy_chk(), which verifies that the amount
being copied doesn't exceed the destination object size -- but of
course this only works if GCC knows the destination object size. If it
doesn't know the object size, or if it can prove at compile time that
it will never be exceeded, then it replaces the checked memcpy with a
call to normal memcpy (at least [1] seems to suggest that it will).

But I did some experiments and GCC is clearly not able to know the
destination object size either before or after the logtape-20200303
change:

 * palloc (and friends) lack the alloc_size function attribute[2],
which is required for GCC to try to track it (aside: maybe we should
add it as an unrelated change?)
 * if I add the alloc_size attribute to palloc, it is able to do very
basic tracking of the object size; but not in more complex cases like
the buffer in logtape.c

Therefore, from [1], I'd expect the call to checked memcpy to be
replaced by a call to normal memcpy() either before or after the
change.

It is replaced by normal memcpy() before the change, but not after.
I conclude that this is arbitrary and not fundamentally related to
object size checking or _FORTIFY_SOURCE.

I don't think I should hold up this change because of an arbitrary
decision by the compiler.

Regards,
    Jeff Davis

[1] https://gcc.gnu.org/onlinedocs/gcc/Object-Size-Checking.html
[2] 
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-alloc_005fsize-function-attribute