Thread: Add LogicalTapeSetExtend() to logtape.c
Attached is a patch that exports a new function from logtape.c: extern LogicalTapeSet *LogicalTapeSetExtend( LogicalTapeSet *lts, int nAdditional); The purpose is to allow adding new tapes dynamically for the upcoming Hash Aggregation work[0]. HashAgg doesn't know in advance how many tapes it will need, though only a limited number are actually active at a time. This was proposed and originally written by Adam Lee[1] (extract only the changes to logtape.c/h from his posted patch). Strangely, I'm seeing ~5% regression with his version when doing: -- t20m_1_int4 has 20 million random integers select * from t20m_1_int4 order by i offset 100000000; Which seems to be due to using a pointer rather than a flexible array member (I'm using gcc[2]). My version (attached) still uses a flexible array member, which doesn't see the regression; but I repalloc the whole struct so the caller needs to save the new pointer to the tape set. That doesn't entirely make sense to me, and I'm wondering if someone else can repro that result and/or make a suggestion, because I don't have a good explanation. I'm fine with my version of the patch, but it would be nice to know why there's such a big difference using a pointer versus a flexible array member. Regards, Jeff Davis [0] https://postgr.es/m/6e7c269b9a84ff75fefcad8ab2d4758f03581e98.camel%40j-davis.com [1] https://postgr.es/m/20200108071202.GA1511@mars.local [2] gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Attachment
On Thu, Feb 27, 2020 at 01:01:08PM -0800, Jeff Davis wrote: > Attached is a patch that exports a new function from logtape.c: > > extern LogicalTapeSet *LogicalTapeSetExtend( > LogicalTapeSet *lts, int nAdditional); > > The purpose is to allow adding new tapes dynamically for the upcoming > Hash Aggregation work[0]. HashAgg doesn't know in advance how many > tapes it will need, though only a limited number are actually active at > a time. > > This was proposed and originally written by Adam Lee[1] (extract only > the changes to logtape.c/h from his posted patch). Strangely, I'm > seeing ~5% regression with his version when doing: > > -- t20m_1_int4 has 20 million random integers > select * from t20m_1_int4 order by i offset 100000000; > > Which seems to be due to using a pointer rather than a flexible array > member (I'm using gcc[2]). My version (attached) still uses a flexible > array member, which doesn't see the regression; but I repalloc the > whole struct so the caller needs to save the new pointer to the tape > set. > > That doesn't entirely make sense to me, and I'm wondering if someone > else can repro that result and/or make a suggestion, because I don't > have a good explanation. I'm fine with my version of the patch, but it > would be nice to know why there's such a big difference using a pointer > versus a flexible array member. > > Regards, > Jeff Davis I noticed another difference, I was using palloc0(), which could be one of the reason, but not sure. Tested your hashagg-20200226.patch on my laptop(Apple clang version 11.0.0), the average time is 25.9s: ``` create table t20m_1_int4(i int); copy t20m_1_int4 from program 'shuf -i 1-4294967295 -n 20000000'; analyze t20m_1_int4; ``` ``` explain analyze select * from t20m_1_int4 order by i offset 100000000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=3310741.20..3310741.20 rows=1 width=4) (actual time=25666.471..25666.471 rows=0 loops=1) -> Sort (cost=3260740.96..3310741.20 rows=20000096 width=4) (actual time=20663.065..24715.269 rows=20000000 loops=1) Sort Key: i Sort Method: external merge Disk: 274056kB -> Seq Scan on t20m_1_int4 (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.075..2749.385 rows=20000000loops=1) Planning Time: 0.109 ms Execution Time: 25911.542 ms (7 rows) ``` But if use the palloc0() or do the MemSet() like: ``` lts = (LogicalTapeSet *) palloc0(offsetof(LogicalTapeSet, tapes) + ntapes * sizeof(LogicalTape)); ... MemSet(lts->tapes, 0, lts->nTapes * sizeof(LogicalTape)); <--- this line doesn't matter as I observed, which makes a little sense the compiler might know it's already zero. ``` The average time goes up to 26.6s: ``` explain analyze select * from t20m_1_int4 order by i offset 100000000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=3310741.20..3310741.20 rows=1 width=4) (actual time=26419.712..26419.712 rows=0 loops=1) -> Sort (cost=3260740.96..3310741.20 rows=20000096 width=4) (actual time=21430.044..25468.661 rows=20000000 loops=1) Sort Key: i Sort Method: external merge Disk: 274056kB -> Seq Scan on t20m_1_int4 (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.060..2839.452 rows=20000000loops=1) Planning Time: 0.111 ms Execution Time: 26652.255 ms (7 rows) ``` -- Adam Lee
On Fri, 2020-02-28 at 14:16 +0800, Adam Lee wrote: > I noticed another difference, I was using palloc0(), which could be > one of the > reason, but not sure. I changed the palloc0()'s in your code to plain palloc(), and it didn't make any perceptible difference. Still slower than the version I posted that keeps the flexible array. Did you compare all 3? Master, with your patch, and with my patch? I'd like to see if you're seeing the same thing that I am. > Tested your hashagg-20200226.patch on my laptop(Apple clang version > 11.0.0), > the average time is 25.9s: That sounds high -- my runs are about half that time. Is that with a debug build or an optimized one? Regards, Jeff Davis
On Fri, Feb 28, 2020 at 12:38:55PM -0800, Jeff Davis wrote: > On Fri, 2020-02-28 at 14:16 +0800, Adam Lee wrote: > > I noticed another difference, I was using palloc0(), which could be > > one of the > > reason, but not sure. > > I changed the palloc0()'s in your code to plain palloc(), and it didn't > make any perceptible difference. Still slower than the version I posted > that keeps the flexible array. > > Did you compare all 3? Master, with your patch, and with my patch? I'd > like to see if you're seeing the same thing that I am. > > > Tested your hashagg-20200226.patch on my laptop(Apple clang version > > 11.0.0), > > the average time is 25.9s: > > That sounds high -- my runs are about half that time. Is that with a > debug build or an optimized one? > > Regards, > Jeff Davis Yes, I was running a debug version. I usually do 'CFLAGS=-O0 -g3' '--enable-cassert' '--enable-debug'. Test with a general build: Master: 12729ms 12970ms 12999ms With my patch(a pointer): 12965ms 13273ms 13116ms With your patch(flexible array): 12906ms 12991ms 13043ms Not obvious I suppose, anyway, your patch looks good to me. -- Adam Lee
On Tue, 2020-03-03 at 09:49 +0800, Adam Lee wrote: > Master: 12729ms 12970ms 12999ms > With my patch(a pointer): 12965ms 13273ms 13116ms > With your patch(flexible array): 12906ms 12991ms 13043ms Hmm.. looks like you didn't reproduce the difference I saw. What compiler/version? Regards, Jeff Davis
On Tue, Mar 03, 2020 at 08:46:24AM -0800, Jeff Davis wrote: > On Tue, 2020-03-03 at 09:49 +0800, Adam Lee wrote: > > Master: 12729ms 12970ms 12999ms > > With my patch(a pointer): 12965ms 13273ms 13116ms > > With your patch(flexible array): 12906ms 12991ms 13043ms > > Hmm.. looks like you didn't reproduce the difference I saw. What > compiler/version? It was "Apple clang version 11.0.0 (clang-1100.0.33.17)". Then I changed to "gcc-9 (Homebrew GCC 9.2.0_3) 9.2.0" this morning. Master(e537aed61d): 13342.844 ms 13195.982 ms 13271.023 ms With my patch(a pointer): 13020.029 ms 13008.158 ms 13063.658 ms With your patch(flexible array): 12870.117 ms 12814.725 ms 13119.255 ms -- Adam Lee
On Wed, 2020-03-04 at 11:57 +0800, Adam Lee wrote: > Master(e537aed61d): 13342.844 ms 13195.982 ms 13271.023 ms > With my patch(a pointer): 13020.029 ms 13008.158 ms 13063.658 ms > With your patch(flexible array): 12870.117 ms 12814.725 ms 13119.255 > ms I tracked the problem down. When we change from a flexible array to a pointer and a separately- allocated chunk, then it causes some unrelated code in LogicalTapeWrite() to be optimized differently in my environment. Specifically, the memcpy() in LogicalTapeWrite() gets turned into an inline implementation using the "rep movsq" instruction. For my particular case, actually calling memcpy (rather than using the inlined version) is a lot faster. In my test case, LogicalTapeWrite() was being called with size of 4 or 10, so perhaps those small values are just handled much more efficiently in the real memcpy. To get it to use the real memcpy(), I had to '#undef _FORTIFY_SOURCE' at the top of the file, and pass -fno-builtin-memcpy. Then the regression went away. I don't care for the version I posted where it repalloc()s the entire struct. That seems needlessly odd and could cause confusion; and it also changes the API so that the caller needs to update its pointers. I'm inclined to use a version similar to Adam's, where it has a pointer and a separate palloc() chunk (attached). That causes a regression in my setup, but it's not a terrible regression, and it's not really the "fault" of the change. Trying to twist code around to satisfy a particular compiler/libc seems like a bad idea. It also might depend on the exact query, and may even be faster for some. Any objections? Regards, Jeff Davis
Attachment
On Wed, 2020-03-04 at 12:06 -0800, Jeff Davis wrote: > I tracked the problem down. Because of the name _FORTIFY_SOURCE, I got somewhat concerned that this change (logtape-20200303) was somehow interfering with a safety mechanism in the compiler. There's a mechanism in GCC called object size tracking[1]. memcpy() is replaced by __builtin___memcpy_chk(), which verifies that the amount being copied doesn't exceed the destination object size -- but of course this only works if GCC knows the destination object size. If it doesn't know the object size, or if it can prove at compile time that it will never be exceeded, then it replaces the checked memcpy with a call to normal memcpy (at least [1] seems to suggest that it will). But I did some experiments and GCC is clearly not able to know the destination object size either before or after the logtape-20200303 change: * palloc (and friends) lack the alloc_size function attribute[2], which is required for GCC to try to track it (aside: maybe we should add it as an unrelated change?) * if I add the alloc_size attribute to palloc, it is able to do very basic tracking of the object size; but not in more complex cases like the buffer in logtape.c Therefore, from [1], I'd expect the call to checked memcpy to be replaced by a call to normal memcpy() either before or after the change. It is replaced by normal memcpy() before the change, but not after. I conclude that this is arbitrary and not fundamentally related to object size checking or _FORTIFY_SOURCE. I don't think I should hold up this change because of an arbitrary decision by the compiler. Regards, Jeff Davis [1] https://gcc.gnu.org/onlinedocs/gcc/Object-Size-Checking.html [2] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-alloc_005fsize-function-attribute