Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
Date
Msg-id 20230217034000.ddcklsr6hfdifyqk@awork3.anarazel.de
Whole thread Raw
In response to Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations  ("Jonah H. Harris" <jonah.harris@gmail.com>)
Responses Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
List pgsql-hackers
Hi,

On 2023-02-16 21:34:18 -0500, Jonah H. Harris wrote:
> On Thu, Feb 16, 2023 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:
> Given not much changed regarding that allocation context IIRC, I’d think
> all recents. It was observed in 13, 14, and 15.

We did have a fair bit of changes in related code in the last few years,
including some in 16. I wouldn't expect them to help *hugely*, but also
wouldn't be surprised if it showed.


> I’ll have to create one - it was most evident on a TPC-C or sysbench test
> using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible
> with pgbench as well.

I'd like a workload that hits a perf issue with this, because I think there
likely are some general performance improvements that we could make, without
changing the initial size or the "growth rate".

Perhaps, as a starting point, you could get
  MemoryContextStats(queryDesc->estate->es_query_cxt)
both at the end of standard_ExecutorStart() and at the beginning of
standard_ExecutorFinish(), for one of the queries triggering the performance
issues?


> > If the issue is a specific FDW needing to make a lot of allocations, I can
> > see
> > adding an API to tell a memory context that it ought to be ready to
> > allocate a
> > certain amount of memory efficiently (e.g. by increasing the block size of
> > the
> > next allocation by more than 2x).
>
>
> While I’m happy to be wrong, it seems is an inherent problem not really
> specific to FDW implementations themselves but the general expectation that
> all FDWs are using more of that context than non-FDW cases for similar
> types of operations, which wasn’t really a consideration in that allocation
> over time.

Lots of things can end up in the query context, it's really not FDW specific
for it to be of nontrivial size. E.g. most tuples passed around end up in it.

Similar performance issues also exists for plenty other memory contexts. Which
for me that's a reason *not* to make it configurable just for
CreateExecutorState. Or were you proposing to make ALLOCSET_DEFAULT_INITSIZE
configurable? That would end up with a lot of waste, I think.


The executor context case might actually be a comparatively easy case to
address. There's really two "phases" of use for es_query_ctx. First, we create
the entire executor tree in it, during standard_ExecutorStart(). Second,
during query execution, we allocate things with query lifetime (be that
because they need to live till the end, or because they are otherwise
managed, like tuples).

Even very simple queries end up with multiple blocks at the end:
E.g.
  SELECT relname FROM pg_class WHERE relkind = 'r' AND relname = 'frak';
yields:
  ExecutorState: 43784 total in 3 blocks; 8960 free (5 chunks); 34824 used
    ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Grand total: 51976 bytes in 4 blocks; 16888 free (5 chunks); 35088 used

So quite justifiably we could just increase the hardcoded size in
CreateExecutorState. I'd expect that starting a few size classes up would help
noticeably.


But I think we likely could do better here. The amount of memory that ends up
in es_query_cxt during "phase 1" strongly correlates with the complexity of
the statement, as the whole executor tree ends up in it.  Using information
about the complexity of the planned statement to influence es_query_cxt's
block sizes would make sense to me.  I suspect it's a decent enough proxy for
"phase 2" as well.


Medium-long term I really want to allocate at least all the executor nodes
themselves in a single allocation. But that's a bit further out than what
we're talking about here.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Introduce list_reverse() to make lcons() usage less inefficient
Next
From: Andres Freund
Date:
Subject: Should CreateExprContext() be using ALLOCSET_DEFAULT_SIZES?