Thread: [HACKERS] Poor memory context performance in large hash joins

[HACKERS] Poor memory context performance in large hash joins

From
Jeff Janes
Date:

When doing a hash join with large work_mem, you can have a large number of chunks.  Then if ExecHashIncreaseNumBatches gets called, those chunks are walked through, moving the tuples to new chunks (or to disk, if they no longer match the batch's bitmask), and freeing the old chunks.

The number of new chunks can be almost as as large as the number of old chunks, especially if there is a very popular value.  The problem is that every time an old chunk is freed, the code in aset.c around line 968 has to walk over all the newly allocated chunks in the linked list before it can find the old one being freed.  This is an N^2 operation, and I think it has horrible CPU cache hit rates as well.

Is there a good solution to this?  Could the new chunks be put in a different memory context, and then destroy the old context and install the new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy method for memory contexts, it looks like you just reset the parent instead. But I don't think that would work here.

Thanks,

Jeff

Re: [HACKERS] Poor memory context performance in large hash joins

From
Peter Geoghegan
Date:
On Thu, Feb 23, 2017 at 2:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Is there a good solution to this?  Could the new chunks be put in a
> different memory context, and then destroy the old context and install the
> new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy
> method for memory contexts, it looks like you just reset the parent instead.
> But I don't think that would work here.

Are you aware of the fact that tuplesort.c got a second memory context
for 9.6, entirely on performance grounds?


-- 
Peter Geoghegan



Re: [HACKERS] Poor memory context performance in large hash joins

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> The number of new chunks can be almost as as large as the number of old
> chunks, especially if there is a very popular value.  The problem is that
> every time an old chunk is freed, the code in aset.c around line 968 has to
> walk over all the newly allocated chunks in the linked list before it can
> find the old one being freed.  This is an N^2 operation, and I think it has
> horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list.  Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong.  It should
learn to aggregate them into larger requests.
        regards, tom lane



Re: [HACKERS] Poor memory context performance in large hash joins

From
Andres Freund
Date:
On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > The number of new chunks can be almost as as large as the number of old
> > chunks, especially if there is a very popular value.  The problem is that
> > every time an old chunk is freed, the code in aset.c around line 968 has to
> > walk over all the newly allocated chunks in the linked list before it can
> > find the old one being freed.  This is an N^2 operation, and I think it has
> > horrible CPU cache hit rates as well.
> 
> Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.

That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Jeff, do you have a handy demonstrator?


> Although if the
> hash code is producing a whole lot of requests that are only a bit bigger
> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
> learn to aggregate them into larger requests.

That's probably right, but we can't really address that in the
back-branches.  And to me this sounds like something we should address
in the branches, not just in master.  Even if we'd also fix the
hash-aggregation logic, I think such an O(n^2) behaviour in the
allocator is a bad idea in general, and we should fix it anyway.

Regards,

Andres



Re: [HACKERS] Poor memory context performance in large hash joins

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> Yes, I do think so. Given that we only have that for full blocks, not
> for small chunks, the cost seems neglegible.
> That would also, partially, address the performance issue
> http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
> addresses, in a more realistically backpatchable manner.

Yeah, I was wondering if we could get away with back-patching such a
change.  In principle, nothing outside aset.c should know what's in the
header of an AllocBlock, but ...

> Jeff, do you have a handy demonstrator?

A solid test case would definitely help to convince people that it was
worth taking some risk here.
        regards, tom lane



Re: [HACKERS] Poor memory context performance in large hash joins

From
Thomas Munro
Date:
On Fri, Feb 24, 2017 at 12:17 PM, Andres Freund <andres@anarazel.de> wrote:
> Jeff, do you have a handy demonstrator?

If you want to hit ExecHashIncreaseNumBatches a bunch of times, see
the "ugly" example in the attached.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Poor memory context performance in large hash joins

From
Andres Freund
Date:
On 2017-02-24 01:59:01 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> >> Maybe it's time to convert that to a doubly-linked list.
>
> > Yes, I do think so. Given that we only have that for full blocks, not
> > for small chunks, the cost seems neglegible.
> > That would also, partially, address the performance issue
> > http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
> > addresses, in a more realistically backpatchable manner.
>
> Yeah, I was wondering if we could get away with back-patching such a
> change.  In principle, nothing outside aset.c should know what's in the
> header of an AllocBlock, but ...

You'd need to go through a fair amount of intentional pain to be
affected by a change AllocBlockData's structure.  We could add the
->prev pointer to the end of AllocBlockData's definition to make it less
likely that one would be affected in that unlikely case - but I'm a bit
doubtful it's worth the trouble.

- Andres



Re: [HACKERS] Poor memory context performance in large hash joins

From
Jeff Janes
Date:
On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
> The number of new chunks can be almost as as large as the number of old
> chunks, especially if there is a very popular value.  The problem is that
> every time an old chunk is freed, the code in aset.c around line 968 has to
> walk over all the newly allocated chunks in the linked list before it can
> find the old one being freed.  This is an N^2 operation, and I think it has
> horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list. 


I don't think that would help.  You would need some heuristic to guess whether the chunk you are looking for is near the front, or near the end.  And in this case, the desired chunk starts out at the front, and then keeps moving down the list with each iteration as new things are added to the front, until near the end of the ExecHashIncreaseNumBatches it is freeing them from near the end.  But in between, it is freeing them from the middle, so I don't think a doubly-linked list would change it from N^2, just lower the constant, even if you always knew which end to start at.  Or am I misunderstanding what the implications for a doubly-linked-list are?

What would really help here is if it remembered the next pointer of the just-freed chunk, and started the scan from that location the next time, cycling around to the head pointer if it doesn't find anything.  But I don't think that that is a very general solution.  

Or, if you could pass a flag when creating the context telling it whether memory will be freed mostly-LIFO or mostly-FIFO, and have it use a stack or a queue accordingly.
 
Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong.  It should
learn to aggregate them into larger requests.

Right now it is using compiled-in 32KB chunks.  Should it use something like max(32kb,work_mem/128) instead?

Cheers,

Jeff

Re: [HACKERS] Poor memory context performance in large hash joins

From
Jeff Janes
Date:

On Thu, Feb 23, 2017 at 10:47 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > The number of new chunks can be almost as as large as the number of old
> > chunks, especially if there is a very popular value.  The problem is that
> > every time an old chunk is freed, the code in aset.c around line 968 has to
> > walk over all the newly allocated chunks in the linked list before it can
> > find the old one being freed.  This is an N^2 operation, and I think it has
> > horrible CPU cache hit rates as well.
>
> Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.

That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Jeff, do you have a handy demonstrator?


Not exactly "handy", as it takes about 45 minutes to set up and uses >50GB of disk and 16GB of RAM, but here is a demonstration.

create table foobar as select CASE when random()<(420.0/540.0) then 1 else floor(random()*880000)::int END as titleid from generate_series(1,540000000);

create table foobar2 as select distinct titleid from foobar ;

analyze;

set work_mem TO "13GB";

select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);

This will run for effectively forever, unless it gets killed/dies due to OOM first.  If I have other things consuming some of my 16GB of RAM, then it gets OOM (which is not a complaint: it is as one should expect given that I told it work_mem was 13GB). If I have no other demands on RAM, then I can't tell if it would eventually OOM or not because of how long it would take to get that far.


I ran into this while evaluating Tom's responding patch, but you don't need to apply that patch to run this example and see the effect.  I don't have an example of a case that demonstrates the problem in the absence of a degenerate hash bucket.  I think there should be non-degenerate cases that trigger it, but I haven't been able to produce one yet.
 
> Although if the
> hash code is producing a whole lot of requests that are only a bit bigger
> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
> learn to aggregate them into larger requests.

That's probably right, but we can't really address that in the
back-branches.  And to me this sounds like something we should address
in the branches, not just in master.  Even if we'd also fix the
hash-aggregation logic, I think such an O(n^2) behaviour in the
allocator is a bad idea in general, and we should fix it anyway.

I don't know how important a back-patch would be.  This is a toy case for me.  Presumably not for David Hinkle, except he doesn't want a hash join in the first place. While I want one that finishes in a reasonable time. It might depend on what happens to Tom's OOM patch.

It would be great if the allocator was made bullet-proof, but I don't think adding reverse links (or anything else back-patchable) is going to be enough to do that.

Cheers,

Jeff

Re: [HACKERS] Poor memory context performance in large hash joins

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> I don't think that would help.  You would need some heuristic to guess
> whether the chunk you are looking for is near the front, or near the end.

Uh, what?  In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching.  It basically becomes     item->prev->next = item->next;     item->next->prev =
item->prev;
modulo possible special cases for the head and tail elements.

>> Although if the
>> hash code is producing a whole lot of requests that are only a bit bigger
>> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
>> learn to aggregate them into larger requests.

> Right now it is using compiled-in 32KB chunks.  Should it use something
> like max(32kb,work_mem/128) instead?

I'd say it should double the size of the request each time.  That's what
we do in most places.
        regards, tom lane



Re: [HACKERS] Poor memory context performance in large hash joins

From
Jeff Janes
Date:
On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> I don't think that would help.  You would need some heuristic to guess
> whether the chunk you are looking for is near the front, or near the end.

Uh, what?  In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching.  It basically becomes
      item->prev->next = item->next;
      item->next->prev = item->prev;
modulo possible special cases for the head and tail elements.

Currently it is walking the chain to identify which block holds the chunk in the first place, not just to get the pointer to the previous block.  But I see that that could be fixed by pointer arithmetic once there is a reason to fix it. Which there isn't a reason to as long as you need to walk the chain to get the prev pointer anyway.  Like this:?

targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);
 
>> Although if the
>> hash code is producing a whole lot of requests that are only a bit bigger
>> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
>> learn to aggregate them into larger requests.

> Right now it is using compiled-in 32KB chunks.  Should it use something
> like max(32kb,work_mem/128) instead?

I'd say it should double the size of the request each time.  That's what
we do in most places.

I thought the design goal there was that the space in old chunks could get re-used into the new chunks in a reasonably fine-grained way. If the last chunk contains just over half of all the data, that couldn't happen.

Cheers,

Jeff

Re: [HACKERS] Poor memory context performance in large hash joins

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Uh, what?  In a doubly-linked list, you can remove an element in O(1)
>> time, you don't need any searching.

> Currently it is walking the chain to identify which block holds the chunk
> in the first place, not just to get the pointer to the previous block.  But
> I see that that could be fixed by pointer arithmetic once there is a reason
> to fix it. Which there isn't a reason to as long as you need to walk the
> chain to get the prev pointer anyway.  Like this:?
> targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

Right.  We're just turning around the existing address calculation.  It'd
be a good idea to provide some cross-check that we found a sane-looking
block header, but that's not that hard --- we know what ought to be in
aset, freeptr, and endptr for a single-chunk block's header, and that
seems like enough of a crosscheck to me.

Concretely, something like the attached.  This passes regression tests
but I've not pushed on it any harder than that.

One argument against this is that it adds some nonzero amount of overhead
to block management; but considering that we are calling malloc or realloc
or free alongside any one of these manipulations, that overhead should be
pretty negligible.  I'm a bit more worried about increasing the alloc
block header size by 8 bytes, but that would only really matter with a
very small allocChunkLimit.

            regards, tom lane

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 4dfc3ec..c250bae 100644
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
*************** typedef AllocSetContext *AllocSet;
*** 201,207 ****
  typedef struct AllocBlockData
  {
      AllocSet    aset;            /* aset that owns this block */
!     AllocBlock    next;            /* next block in aset's blocks list */
      char       *freeptr;        /* start of free space in this block */
      char       *endptr;            /* end of space in this block */
  }    AllocBlockData;
--- 201,208 ----
  typedef struct AllocBlockData
  {
      AllocSet    aset;            /* aset that owns this block */
!     AllocBlock    prev;            /* prev block in aset's blocks list, if any */
!     AllocBlock    next;            /* next block in aset's blocks list, if any */
      char       *freeptr;        /* start of free space in this block */
      char       *endptr;            /* end of space in this block */
  }    AllocBlockData;
*************** AllocSetContextCreate(MemoryContext pare
*** 522,528 ****
--- 523,532 ----
          block->aset = set;
          block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
          block->endptr = ((char *) block) + blksize;
+         block->prev = NULL;
          block->next = set->blocks;
+         if (block->next)
+             block->next->prev = block;
          set->blocks = block;
          /* Mark block as not to be released at reset time */
          set->keeper = block;
*************** AllocSetReset(MemoryContext context)
*** 604,609 ****
--- 608,614 ----
              VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart);
  #endif
              block->freeptr = datastart;
+             block->prev = NULL;
              block->next = NULL;
          }
          else
*************** AllocSetAlloc(MemoryContext context, Siz
*** 710,725 ****
  #endif

          /*
!          * Stick the new block underneath the active allocation block, so that
!          * we don't lose the use of the space remaining therein.
           */
          if (set->blocks != NULL)
          {
              block->next = set->blocks->next;
              set->blocks->next = block;
          }
          else
          {
              block->next = NULL;
              set->blocks = block;
          }
--- 715,734 ----
  #endif

          /*
!          * Stick the new block underneath the active allocation block, if any,
!          * so that we don't lose the use of the space remaining therein.
           */
          if (set->blocks != NULL)
          {
+             block->prev = set->blocks;
              block->next = set->blocks->next;
+             if (block->next)
+                 block->next->prev = block;
              set->blocks->next = block;
          }
          else
          {
+             block->prev = NULL;
              block->next = NULL;
              set->blocks = block;
          }
*************** AllocSetAlloc(MemoryContext context, Siz
*** 900,906 ****
--- 909,918 ----
          VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
                                     blksize - ALLOC_BLOCKHDRSZ);

+         block->prev = NULL;
          block->next = set->blocks;
+         if (block->next)
+             block->next->prev = block;
          set->blocks = block;
      }

*************** AllocSetFree(MemoryContext context, void
*** 960,988 ****
      {
          /*
           * Big chunks are certain to have been allocated as single-chunk
!          * blocks.  Find the containing block and return it to malloc().
           */
!         AllocBlock    block = set->blocks;
!         AllocBlock    prevblock = NULL;

!         while (block != NULL)
!         {
!             if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ))
!                 break;
!             prevblock = block;
!             block = block->next;
!         }
!         if (block == NULL)
              elog(ERROR, "could not find block containing chunk %p", chunk);
-         /* let's just make sure chunk is the only one in the block */
-         Assert(block->freeptr == ((char *) block) +
-                (chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ));

          /* OK, remove block from aset's list and free it */
!         if (prevblock == NULL)
              set->blocks = block->next;
          else
!             prevblock->next = block->next;
  #ifdef CLOBBER_FREED_MEMORY
          wipe_mem(block, block->freeptr - ((char *) block));
  #endif
--- 972,999 ----
      {
          /*
           * Big chunks are certain to have been allocated as single-chunk
!          * blocks.  Just unlink that block and return it to malloc().
           */
!         AllocBlock    block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);

!         /*
!          * Try to verify that we have a sane block pointer: it should
!          * reference the correct aset, and freeptr and endptr should point
!          * just past the chunk.
!          */
!         if (block->aset != set ||
!             block->freeptr != block->endptr ||
!             block->freeptr != ((char *) block) +
!             (chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ))
              elog(ERROR, "could not find block containing chunk %p", chunk);

          /* OK, remove block from aset's list and free it */
!         if (block->prev == NULL)
              set->blocks = block->next;
          else
!             block->prev->next = block->next;
!         if (block->next != NULL)
!             block->next->prev = block->prev;
  #ifdef CLOBBER_FREED_MEMORY
          wipe_mem(block, block->freeptr - ((char *) block));
  #endif
*************** AllocSetRealloc(MemoryContext context, v
*** 1088,1114 ****
      if (oldsize > set->allocChunkLimit)
      {
          /*
!          * The chunk must have been allocated as a single-chunk block.  Find
!          * the containing block and use realloc() to make it bigger with
!          * minimum space wastage.
           */
!         AllocBlock    block = set->blocks;
!         AllocBlock    prevblock = NULL;
          Size        chksize;
          Size        blksize;

!         while (block != NULL)
!         {
!             if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ))
!                 break;
!             prevblock = block;
!             block = block->next;
!         }
!         if (block == NULL)
              elog(ERROR, "could not find block containing chunk %p", chunk);
-         /* let's just make sure chunk is the only one in the block */
-         Assert(block->freeptr == ((char *) block) +
-                (chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ));

          /* Do the realloc */
          chksize = MAXALIGN(size);
--- 1099,1122 ----
      if (oldsize > set->allocChunkLimit)
      {
          /*
!          * The chunk must have been allocated as a single-chunk block.  Use
!          * realloc() to make the containing block bigger with minimum space
!          * wastage.
           */
!         AllocBlock    block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
          Size        chksize;
          Size        blksize;

!         /*
!          * Try to verify that we have a sane block pointer: it should
!          * reference the correct aset, and freeptr and endptr should point
!          * just past the chunk.
!          */
!         if (block->aset != set ||
!             block->freeptr != block->endptr ||
!             block->freeptr != ((char *) block) +
!             (chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ))
              elog(ERROR, "could not find block containing chunk %p", chunk);

          /* Do the realloc */
          chksize = MAXALIGN(size);
*************** AllocSetRealloc(MemoryContext context, v
*** 1121,1130 ****
          /* Update pointers since block has likely been moved */
          chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
          pointer = AllocChunkGetPointer(chunk);
!         if (prevblock == NULL)
              set->blocks = block;
          else
!             prevblock->next = block;
          chunk->size = chksize;

  #ifdef MEMORY_CONTEXT_CHECKING
--- 1129,1140 ----
          /* Update pointers since block has likely been moved */
          chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
          pointer = AllocChunkGetPointer(chunk);
!         if (block->prev == NULL)
              set->blocks = block;
          else
!             block->prev->next = block;
!         if (block->next != NULL)
!             block->next->prev = block;
          chunk->size = chksize;

  #ifdef MEMORY_CONTEXT_CHECKING
*************** AllocSetCheck(MemoryContext context)
*** 1315,1323 ****
  {
      AllocSet    set = (AllocSet) context;
      char       *name = set->header.name;
      AllocBlock    block;

!     for (block = set->blocks; block != NULL; block = block->next)
      {
          char       *bpoz = ((char *) block) + ALLOC_BLOCKHDRSZ;
          long        blk_used = block->freeptr - bpoz;
--- 1325,1336 ----
  {
      AllocSet    set = (AllocSet) context;
      char       *name = set->header.name;
+     AllocBlock    prevblock;
      AllocBlock    block;

!     for (prevblock = NULL, block = set->blocks;
!          block != NULL;
!          prevblock = block, block = block->next)
      {
          char       *bpoz = ((char *) block) + ALLOC_BLOCKHDRSZ;
          long        blk_used = block->freeptr - bpoz;
*************** AllocSetCheck(MemoryContext context)
*** 1335,1340 ****
--- 1348,1363 ----
          }

          /*
+          * Check block header fields
+          */
+         if (block->aset != set ||
+             block->prev != prevblock ||
+             block->freeptr < bpoz ||
+             block->freeptr > block->endptr)
+             elog(WARNING, "problem in alloc set %s: corrupt header in block %p",
+                  name, block);
+
+         /*
           * Chunk walker
           */
          while (bpoz < block->freeptr)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Poor memory context performance in large hash joins

From
Andres Freund
Date:
On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> Concretely, something like the attached.  This passes regression tests
> but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

- Andres



Re: [HACKERS] Poor memory context performance in large hash joins

From
Andres Freund
Date:
On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > Concretely, something like the attached.  This passes regression tests
> > but I've not pushed on it any harder than that.
> 
> Heh, I'd just gotten something that didn't immediately crash anymore ;)
> 
> Running your patch against Jeff's test-case, verified before that I
> could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);
┌───────┐
│ count │
├───────┤
│     0 │
└───────┘
(1 row)

Time: 268043.710 ms (04:28.044)

Profile: 13.49%  postgres            [.] ExecHashTableInsert 11.16%  postgres            [.] BufFileRead  9.20%
postgres           [.] ExecHashIncreaseNumBatches  9.19%  libc-2.24.so        [.] __memmove_avx_unaligned_erms  6.74%
postgres           [.] dense_alloc.isra.0  5.53%  postgres            [.] ExecStoreMinimalTuple  5.14%  postgres
   [.] ExecHashJoinGetSavedTuple.isra.0  4.68%  postgres            [.] AllocSetAlloc  4.12%  postgres            [.]
AllocSetFree 2.31%  postgres            [.] palloc  2.06%  [kernel]            [k] free_pcppages_bulk
 


I'd previously run the test for 30min, with something like 99% of the
profile being in AllocSetFree().

- Andres



Re: [HACKERS] Poor memory context performance in large hash joins

From
Andres Freund
Date:
On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
> On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> > On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > > Concretely, something like the attached.  This passes regression tests
> > > but I've not pushed on it any harder than that.
> > 
> > Heh, I'd just gotten something that didn't immediately crash anymore ;)
> > 
> > Running your patch against Jeff's test-case, verified before that I
> > could easily reproduce the O(N^2) cost.
> 
> Oh, that didn't take as long as I was afraid (optimized/non-assert build):
> 
> postgres[26268][1]=# SET work_mem = '13GB';
> SET
> Time: 2.591 ms
> postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);
> Time: 268043.710 ms (04:28.044)

As another datapoint, I measured this patch against the problem from
https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
(see top post in thread), and it indeed fixes the runtime issue -
there's still considerably higher memory usage and some runtime
overhead, but the quadratic behaviour is gone.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

Regards,

Andres



Re: [HACKERS] Poor memory context performance in large hash joins

From
Tomas Vondra
Date:
On 02/27/2017 12:55 PM, Andres Freund wrote:
> On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
>> On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
>>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
>>>> Concretely, something like the attached.  This passes regression tests
>>>> but I've not pushed on it any harder than that.
>>>
>>> Heh, I'd just gotten something that didn't immediately crash anymore ;)
>>>
>>> Running your patch against Jeff's test-case, verified before that I
>>> could easily reproduce the O(N^2) cost.
>>
>> Oh, that didn't take as long as I was afraid (optimized/non-assert build):
>>
>> postgres[26268][1]=# SET work_mem = '13GB';
>> SET
>> Time: 2.591 ms
>> postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);
>> Time: 268043.710 ms (04:28.044)
>
> As another datapoint, I measured this patch against the problem from
> https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
> (see top post in thread), and it indeed fixes the runtime issue -
> there's still considerably higher memory usage and some runtime
> overhead, but the quadratic behaviour is gone.
>
> I think we should go forward with something like this patch in all
> branches, and only use Tomas' patch in master, because they're
> considerably larger.
>

So you've tried to switch hashjoin to the slab allocators? Or what have 
you compared?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Poor memory context performance in large hash joins

From
Andres Freund
Date:
On 2017-02-27 19:20:56 +0100, Tomas Vondra wrote:
> On 02/27/2017 12:55 PM, Andres Freund wrote:
> > On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
> > > On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> > > > On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > > > > Concretely, something like the attached.  This passes regression tests
> > > > > but I've not pushed on it any harder than that.
> > > >
> > > > Heh, I'd just gotten something that didn't immediately crash anymore ;)
> > > >
> > > > Running your patch against Jeff's test-case, verified before that I
> > > > could easily reproduce the O(N^2) cost.
> > >
> > > Oh, that didn't take as long as I was afraid (optimized/non-assert build):
> > >
> > > postgres[26268][1]=# SET work_mem = '13GB';
> > > SET
> > > Time: 2.591 ms
> > > postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);
> > > Time: 268043.710 ms (04:28.044)
> >
> > As another datapoint, I measured this patch against the problem from
> > https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
> > (see top post in thread), and it indeed fixes the runtime issue -
> > there's still considerably higher memory usage and some runtime
> > overhead, but the quadratic behaviour is gone.
> >
> > I think we should go forward with something like this patch in all
> > branches, and only use Tomas' patch in master, because they're
> > considerably larger.
> >
>
> So you've tried to switch hashjoin to the slab allocators? Or what have you
> compared?

No, sorry for not being more explicit about this.  Meant that Tom's
patch addresses the performance issue in the reorderbuffer.c to a good
degree (i.e. gets rid of the quadratic cost, even though constants are
higher than w/ your patches).  As the patch here is a lot smaller, it
seems like a better choice for the back-branches than backporting
slab.c/generation.c.

Andres



Re: [HACKERS] Poor memory context performance in large hash joins

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
>>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
>>>> Concretely, something like the attached.  This passes regression tests
>>>> but I've not pushed on it any harder than that.

> I think we should go forward with something like this patch in all
> branches, and only use Tomas' patch in master, because they're
> considerably larger.

While I'm willing to back-patch the proposed patch to some extent,
I'm a bit hesitant to shove it into all branches, because I don't
think that usage patterns that create a problem occurred till recently.
You won't get a whole lot of standalone-block allocations unless you
have code that allocates many chunks bigger than 8K without applying a
double-the-request-each-time type of strategy.  And even then, it doesn't
hurt unless you try to resize those chunks, or delete them in a non-LIFO
order.

The hashjoin issue is certainly new, and the reorderbuffer issue can't
go back further than 9.4.  So my inclination is to patch back to 9.4
and call it good.
        regards, tom lane



Re: [HACKERS] Poor memory context performance in large hash joins

From
Andres Freund
Date:
On 2017-03-07 13:06:39 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> >>>> Concretely, something like the attached.  This passes regression tests
> >>>> but I've not pushed on it any harder than that.
> 
> > I think we should go forward with something like this patch in all
> > branches, and only use Tomas' patch in master, because they're
> > considerably larger.
> 
> While I'm willing to back-patch the proposed patch to some extent,
> I'm a bit hesitant to shove it into all branches, because I don't
> think that usage patterns that create a problem occurred till recently.
> You won't get a whole lot of standalone-block allocations unless you
> have code that allocates many chunks bigger than 8K without applying a
> double-the-request-each-time type of strategy.  And even then, it doesn't
> hurt unless you try to resize those chunks, or delete them in a non-LIFO
> order.
> 
> The hashjoin issue is certainly new, and the reorderbuffer issue can't
> go back further than 9.4.  So my inclination is to patch back to 9.4
> and call it good.

That works for me.  If we find further cases later, we can easily enough
backpatch it then.

Regards,

Andres



Re: [HACKERS] Poor memory context performance in large hash joins

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-07 13:06:39 -0500, Tom Lane wrote:
>> The hashjoin issue is certainly new, and the reorderbuffer issue can't
>> go back further than 9.4.  So my inclination is to patch back to 9.4
>> and call it good.

> That works for me.  If we find further cases later, we can easily enough
> backpatch it then.

Done like that.
        regards, tom lane