Thread: Re: Trim the heap free memory
Hi Shawn, On Fri, Aug 23, 2024 at 2:24 PM shawn wang <shawn.wang.pg@gmail.com> wrote: > > Hi hackers, > > Currently, all processes in PostgreSQL actually use malloc to allocate and free memory. In the case of long connectionswhere business queries are executed over extended periods, the distribution of memory can become extremely complex. > > Under certain circumstances, a common issue in memory usage due to the caching strategy of malloc may arise: even if memoryis released through the free function, it may not be returned to the OS in a timely manner. This can lead to high systemmemory usage, affecting performance and the operation of other applications, and may even result in Out-Of-Memory (OOM)errors. > > To address this issue, I have developed a new function called pg_trim_backend_heap_free_memory, based on the existing pg_log_backend_memory_contextsfunction. This function triggers the specified process to execute the malloc_trim operationby sending signals, thereby releasing as much unreturned memory to the operating system as possible. This not onlyhelps to optimize memory usage but can also significantly enhance system performance under memory pressure. > > Here is an example of using the pg_trim_backend_heap_free_memory function to demonstrate its effect: >> >> CREATE OR REPLACE FUNCTION public.partition_create(schemaname character varying, numberofpartition integer) >> RETURNS integer >> LANGUAGE plpgsql >> AS $function$ >> declare >> currentTableId integer; >> currentSchemaName varchar(100); >> currentTableName varchar(100); >> begin >> execute 'create schema ' || schemaname; >> execute 'create table ' || schemaname || '.' || schemaname || 'hashtable (p1 text, p2 text, p3 text, p4 int, p5 int, p6int, p7 int, p8 text, p9 name, p10 varchar, p11 text, p12 text, p13 text) PARTITION BY HASH(p1);'; >> currentTableId := 1; >> loop >> currentTableName := schemaname || '.' || schemaname || 'hashtable' || ltrim(currentTableId::varchar(10)); >> execute 'create table ' || currentTableName || ' PARTITION OF ' || schemaname || '.' || schemaname || 'hashtable' || 'FOR VALUES WITH(MODULUS ' || numberofpartition || ', REMAINDER ' || currentTableId - 1 || ')'; >> currentTableId := currentTableId + 1; >> if (currentTableId > numberofpartition) then exit; end if; >> end loop; >> return currentTableId - 1; >> END $function$; >> >> select public.partition_create('test3', 5000); >> select public.partition_create('test4', 5000); >> select count(*) from test4.test4hashtable a, test3.test3hashtable b where a.p1=b.p1; > > You are now about to see the memory size of the process executing the query. >> >> postgres 68673 1.2 0.0 610456 124768 ? Ss 08:25 0:01 postgres: postgres postgres [local] idle >> Size: 89600 kB >> KernelPageSize: 4 kB >> MMUPageSize: 4 kB >> Rss: 51332 kB >> Pss: 51332 kB >> >> 02b65000-082e5000 rw-p 00000000 00:00 0 [heap] > > > > After use pg_trim_backend_heap_free_memory, you will see: >> >> postgres=# select pg_trim_backend_heap_free_memory(pg_backend_pid()); >> 2024-08-23 08:27:53.958 UTC [68673] LOG: trimming heap free memory of PID 68673 >> pg_trim_backend_heap_free_memory >> ---------------------------------- >> t >> (1 row) >> 02b65000-082e5000 rw-p 00000000 00:00 0 [heap] >> Size: 89600 kB >> KernelPageSize: 4 kB >> MMUPageSize: 4 kB >> Rss: 4888 kB >> Pss: 4888 kB >> >> postgres 68673 1.2 0.0 610456 75244 ? Ss 08:26 0:01 postgres: postgres postgres [local] idle > > > Looking forward to your feedback, Looks useful. How much time does malloc_trim() take to finish? Does it affect the current database activity in that backend? It may be good to see effect of this function by firing the function on random backends while the query is running through pgbench. In the patch I don't see definitions of ProcessTrimHeapFreeMemoryInterrupt() and HandleTrimHeapFreeMemoryInterrupt(). Am I missing something? -- Best Wishes, Ashutosh Bapat
Hi Shawn,
On Fri, Aug 23, 2024 at 2:24 PM shawn wang <shawn.wang.pg@gmail.com> wrote:
>
> Hi hackers,
>
> Currently, all processes in PostgreSQL actually use malloc to allocate and free memory. In the case of long connections where business queries are executed over extended periods, the distribution of memory can become extremely complex.
>
> Under certain circumstances, a common issue in memory usage due to the caching strategy of malloc may arise: even if memory is released through the free function, it may not be returned to the OS in a timely manner. This can lead to high system memory usage, affecting performance and the operation of other applications, and may even result in Out-Of-Memory (OOM) errors.
>
> To address this issue, I have developed a new function called pg_trim_backend_heap_free_memory, based on the existing pg_log_backend_memory_contexts function. This function triggers the specified process to execute the malloc_trim operation by sending signals, thereby releasing as much unreturned memory to the operating system as possible. This not only helps to optimize memory usage but can also significantly enhance system performance under memory pressure.
>
> Here is an example of using the pg_trim_backend_heap_free_memory function to demonstrate its effect:
>>
>> CREATE OR REPLACE FUNCTION public.partition_create(schemaname character varying, numberofpartition integer)
>> RETURNS integer
>> LANGUAGE plpgsql
>> AS $function$
>> declare
>> currentTableId integer;
>> currentSchemaName varchar(100);
>> currentTableName varchar(100);
>> begin
>> execute 'create schema ' || schemaname;
>> execute 'create table ' || schemaname || '.' || schemaname || 'hashtable (p1 text, p2 text, p3 text, p4 int, p5 int, p6 int, p7 int, p8 text, p9 name, p10 varchar, p11 text, p12 text, p13 text) PARTITION BY HASH(p1);';
>> currentTableId := 1;
>> loop
>> currentTableName := schemaname || '.' || schemaname || 'hashtable' || ltrim(currentTableId::varchar(10));
>> execute 'create table ' || currentTableName || ' PARTITION OF ' || schemaname || '.' || schemaname || 'hashtable' || ' FOR VALUES WITH(MODULUS ' || numberofpartition || ', REMAINDER ' || currentTableId - 1 || ')';
>> currentTableId := currentTableId + 1;
>> if (currentTableId > numberofpartition) then exit; end if;
>> end loop;
>> return currentTableId - 1;
>> END $function$;
>>
>> select public.partition_create('test3', 5000);
>> select public.partition_create('test4', 5000);
>> select count(*) from test4.test4hashtable a, test3.test3hashtable b where a.p1=b.p1;
>
> You are now about to see the memory size of the process executing the query.
>>
>> postgres 68673 1.2 0.0 610456 124768 ? Ss 08:25 0:01 postgres: postgres postgres [local] idle
>> Size: 89600 kB
>> KernelPageSize: 4 kB
>> MMUPageSize: 4 kB
>> Rss: 51332 kB
>> Pss: 51332 kB
>>
>> 02b65000-082e5000 rw-p 00000000 00:00 0 [heap]
>
>
>
> After use pg_trim_backend_heap_free_memory, you will see:
>>
>> postgres=# select pg_trim_backend_heap_free_memory(pg_backend_pid());
>> 2024-08-23 08:27:53.958 UTC [68673] LOG: trimming heap free memory of PID 68673
>> pg_trim_backend_heap_free_memory
>> ----------------------------------
>> t
>> (1 row)
>> 02b65000-082e5000 rw-p 00000000 00:00 0 [heap]
>> Size: 89600 kB
>> KernelPageSize: 4 kB
>> MMUPageSize: 4 kB
>> Rss: 4888 kB
>> Pss: 4888 kB
>>
>> postgres 68673 1.2 0.0 610456 75244 ? Ss 08:26 0:01 postgres: postgres postgres [local] idle
>
>
> Looking forward to your feedback,
Looks useful.
How much time does malloc_trim() take to finish? Does it affect the
current database activity in that backend? It may be good to see
effect of this function by firing the function on random backends
while the query is running through pgbench.
In the patch I don't see definitions of
ProcessTrimHeapFreeMemoryInterrupt() and
HandleTrimHeapFreeMemoryInterrupt(). Am I missing something?
--
Best Wishes,
Ashutosh Bapat
Hi Shawn, It will be good to document usage of this function. Please add document changes in your patch. We need to document the impact of this function so that users can judiciously decide whether or not to use this function and under what conditions. Also they would know what to expect when they use this function. Running it after a query finishes is one thing but that can't be guaranteed because of the asynchronous nature of signal handlers. malloc_trim() may be called while a query is being executed. We need to assess that impact as well. Can you please share some numbers - TPS, latency etc. with and without this function invoked during a benchmark run? -- Best Wishes, Ashutosh Bapat On Sat, Aug 24, 2024 at 8:12 AM shawn wang <shawn.wang.pg@gmail.com> wrote: > > Hi Ashutosh, thank you for your response. > Firstly, the purpose of caching memory in malloc is for performance, so when we execute malloc_trim(), it will affect theefficiency of memory usage in the subsequent operation. Secondly, the function of malloc_trim() is to lock and traversethe bins, then execute madvise on the memory that can be released. When there is a lot of memory in the bins, thetraversal time will also increase. I once placed malloc_trim() to execute at the end of each query, which resulted ina 20% performance drop. Therefore, I use it as such a function. The new v2 patch has included the omitted code. > > Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> 于2024年8月23日周五 20:02写道: >> >> Hi Shawn, >> >> >> On Fri, Aug 23, 2024 at 2:24 PM shawn wang <shawn.wang.pg@gmail.com> wrote: >> > >> > Hi hackers, >> > >> > Currently, all processes in PostgreSQL actually use malloc to allocate and free memory. In the case of long connectionswhere business queries are executed over extended periods, the distribution of memory can become extremely complex. >> > >> > Under certain circumstances, a common issue in memory usage due to the caching strategy of malloc may arise: even ifmemory is released through the free function, it may not be returned to the OS in a timely manner. This can lead to highsystem memory usage, affecting performance and the operation of other applications, and may even result in Out-Of-Memory(OOM) errors. >> > >> > To address this issue, I have developed a new function called pg_trim_backend_heap_free_memory, based on the existingpg_log_backend_memory_contexts function. This function triggers the specified process to execute the malloc_trimoperation by sending signals, thereby releasing as much unreturned memory to the operating system as possible.This not only helps to optimize memory usage but can also significantly enhance system performance under memorypressure. >> > >> > Here is an example of using the pg_trim_backend_heap_free_memory function to demonstrate its effect: >> >> >> >> CREATE OR REPLACE FUNCTION public.partition_create(schemaname character varying, numberofpartition integer) >> >> RETURNS integer >> >> LANGUAGE plpgsql >> >> AS $function$ >> >> declare >> >> currentTableId integer; >> >> currentSchemaName varchar(100); >> >> currentTableName varchar(100); >> >> begin >> >> execute 'create schema ' || schemaname; >> >> execute 'create table ' || schemaname || '.' || schemaname || 'hashtable (p1 text, p2 text, p3 text, p4 int, p5 int,p6 int, p7 int, p8 text, p9 name, p10 varchar, p11 text, p12 text, p13 text) PARTITION BY HASH(p1);'; >> >> currentTableId := 1; >> >> loop >> >> currentTableName := schemaname || '.' || schemaname || 'hashtable' || ltrim(currentTableId::varchar(10)); >> >> execute 'create table ' || currentTableName || ' PARTITION OF ' || schemaname || '.' || schemaname || 'hashtable' ||' FOR VALUES WITH(MODULUS ' || numberofpartition || ', REMAINDER ' || currentTableId - 1 || ')'; >> >> currentTableId := currentTableId + 1; >> >> if (currentTableId > numberofpartition) then exit; end if; >> >> end loop; >> >> return currentTableId - 1; >> >> END $function$; >> >> >> >> select public.partition_create('test3', 5000); >> >> select public.partition_create('test4', 5000); >> >> select count(*) from test4.test4hashtable a, test3.test3hashtable b where a.p1=b.p1; >> > >> > You are now about to see the memory size of the process executing the query. >> >> >> >> postgres 68673 1.2 0.0 610456 124768 ? Ss 08:25 0:01 postgres: postgres postgres [local] idle >> >> Size: 89600 kB >> >> KernelPageSize: 4 kB >> >> MMUPageSize: 4 kB >> >> Rss: 51332 kB >> >> Pss: 51332 kB >> >> >> >> 02b65000-082e5000 rw-p 00000000 00:00 0 [heap] >> > >> > >> > >> > After use pg_trim_backend_heap_free_memory, you will see: >> >> >> >> postgres=# select pg_trim_backend_heap_free_memory(pg_backend_pid()); >> >> 2024-08-23 08:27:53.958 UTC [68673] LOG: trimming heap free memory of PID 68673 >> >> pg_trim_backend_heap_free_memory >> >> ---------------------------------- >> >> t >> >> (1 row) >> >> 02b65000-082e5000 rw-p 00000000 00:00 0 [heap] >> >> Size: 89600 kB >> >> KernelPageSize: 4 kB >> >> MMUPageSize: 4 kB >> >> Rss: 4888 kB >> >> Pss: 4888 kB >> >> >> >> postgres 68673 1.2 0.0 610456 75244 ? Ss 08:26 0:01 postgres: postgres postgres [local] idle >> > >> > >> > Looking forward to your feedback, >> Looks useful. >> >> How much time does malloc_trim() take to finish? Does it affect the >> current database activity in that backend? It may be good to see >> effect of this function by firing the function on random backends >> while the query is running through pgbench. >> >> In the patch I don't see definitions of >> ProcessTrimHeapFreeMemoryInterrupt() and >> HandleTrimHeapFreeMemoryInterrupt(). Am I missing something? >> >> -- >> Best Wishes, >> Ashutosh Bapat
#include <malloc.h>
^~~~~~~~~~
1 error generated.
Hi Ashutosh,Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> 于2024年8月26日周一 19:05写道:Hi Shawn,
It will be good to document usage of this function. Please add
document changes in your patch. We need to document the impact of this
function so that users can judiciously decide whether or not to use
this function and under what conditions. Also they would know what to
expect when they use this function.I have already incorporated the usage of this function into the new patch.Currently, there is no memory information that can be extremely accurate to
reflect whether a trim operation should be performed. Here are two conditions
that can be used as references:
1. Check the difference between the process's memory usage (for example,
thetop
command, due to the relationship with shared memory, it is necessary
to subtract SHR from RES) and the statistics of the memory context. If the
difference is very large, this function should be used to release memory;
2. Executemalloc_stats()
. If the system bytes are greater than the
in-use bytes, this indicates that this function can be used to release memory.Running it after a query finishes is one thing but that can't be
guaranteed because of the asynchronous nature of signal handlers.
malloc_trim() may be called while a query is being executed. We need
to assess that impact as well.
Can you please share some numbers - TPS, latency etc. with and without
this function invoked during a benchmark run?I have placedmalloc_trim()
at the end of theexec_simple_query
function,
so thatmalloc_trim()
is executed once for each SQL statement executed. I
usedpgbench
to reproduce the performance impact,
and the results are as follows.Database preparation:create database testc;
create user t1;
alter database testc owner to t1;
./pgbench testc -U t1 -i -s 100
./pgbench testc -U t1 -S -c 100 -j 100 -T 600Without Trim:
$./pgbench testc -U t1 -S -c 100 -j 100 -T 600
pgbench (18devel)
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 600 s
number of transactions actually processed: 551984376
number of failed transactions: 0 (0.000%)
latency average = 0.109 ms
initial connection time = 23.569 ms
tps = 920001.842189 (without initial connection time)With Trim :$./pgbench testc -U t1 -S -c 100 -j 100 -T 600
pgbench (18devel)
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 600 s
number of transactions actually processed: 470690787
number of failed transactions: 0 (0.000%)
latency average = 0.127 ms
initial connection time = 23.632 ms
tps = 784511.901558 (without initial connection time)
Rafia Sabih
On Thu, 12 Sept 2024 at 14:40, shawn wang <shawn.wang.pg@gmail.com> wrote: > Could you please perform another round of testing to ensure that everything is functioning as expected with this modification? One way to get a few machines with various build systems testing this is to register the patch on the commitfest app in [1]. You can then see if the patch is passing the continuous integration tests in [2]. One day soon the features of [2] should be combined with [1]. David [1] https://commitfest.postgresql.org/50/ [2] http://cfbot.cputube.org/
shawn wang <shawn.wang.pg@gmail.com> writes: > I have successfully registered my patch for the commitfest. However, upon > integration, I encountered several errors during the testing phase. I am > currently investigating the root causes of these issues and will work on > providing the necessary fixes. I should think the root cause is pretty obvious: malloc_trim() is a glibc-ism. I'm fairly doubtful that this is something we should spend time on. It can never work on any non-glibc platform. Even granting that a Linux-only feature could be worth having, I'm really doubtful that our memory allocation patterns are such that malloc_trim() could be expected to free a useful amount of memory mid-query. The single test case you showed suggested that maybe we could usefully prod glibc to free memory at query completion, but we don't need all this interrupt infrastructure to do that. I think we could likely get 95% of the benefit with about a five-line patch. regards, tom lane
I wrote: > The single test case you showed suggested that maybe we could > usefully prod glibc to free memory at query completion, but we > don't need all this interrupt infrastructure to do that. I think > we could likely get 95% of the benefit with about a five-line > patch. To try to quantify that a little, I wrote a very quick-n-dirty patch to apply malloc_trim during finish_xact_command and log the effects. (I am not asserting this is the best place to call malloc_trim; it's just one plausible possibility.) Patch attached, as well as statistics collected from a run of the core regression tests followed by grep malloc_trim postmaster.log | sed 's/.*LOG:/LOG:/' | sort -k4n | uniq -c >trim_savings.txt We can see that out of about 43K test queries, 32K saved nothing whatever, and in only four was more than a couple of meg saved. That's pretty discouraging IMO. It might be useful to look closer at the behavior of those top four though. I see them as 2024-09-15 14:58:06.146 EDT [960138] LOG: malloc_trim saved 7228 kB 2024-09-15 14:58:06.146 EDT [960138] STATEMENT: ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d); 2024-09-15 14:58:09.861 EDT [960949] LOG: malloc_trim saved 12488 kB 2024-09-15 14:58:09.861 EDT [960949] STATEMENT: with recursive search_graph(f, t, label, is_cycle, path) as ( select *, false, array[row(g.f, g.t)] from graph g union distinct select g.*, row(g.f, g.t) = any(path), path || row(g.f, g.t) from graph g, search_graph sg where g.f = sg.t and not is_cycle ) select * from search_graph; 2024-09-15 14:58:09.866 EDT [960949] LOG: malloc_trim saved 12488 kB 2024-09-15 14:58:09.866 EDT [960949] STATEMENT: with recursive search_graph(f, t, label) as ( select * from graph g union distinct select g.* from graph g, search_graph sg where g.f = sg.t ) cycle f, t set is_cycle to 'Y' default 'N' using path select * from search_graph; 2024-09-15 14:58:09.853 EDT [960949] LOG: malloc_trim saved 12616 kB 2024-09-15 14:58:09.853 EDT [960949] STATEMENT: with recursive search_graph(f, t, label) as ( select * from graph0 g union distinct select g.* from graph0 g, search_graph sg where g.f = sg.t ) search breadth first by f, t set seq select * from search_graph order by seq; I don't understand why WITH RECURSIVE queries might be more prone to leave non-garbage-collected memory behind than other queries, but maybe that is worth looking into. regards, tom lane diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 8bc6bea113..9efb4f7636 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -21,6 +21,7 @@ #include <fcntl.h> #include <limits.h> +#include <malloc.h> #include <signal.h> #include <unistd.h> #include <sys/resource.h> @@ -2797,6 +2798,16 @@ finish_xact_command(void) MemoryContextStats(TopMemoryContext); #endif + { + char *old_brk = sbrk(0); + char *new_brk; + + malloc_trim(0); + new_brk = sbrk(0); + elog(LOG, "malloc_trim saved %zu kB", + (old_brk - new_brk + 1023) / 1024); + } + xact_started = false; } } 32293 LOG: malloc_trim saved 0 kB 4 LOG: malloc_trim saved 4 kB 12 LOG: malloc_trim saved 8 kB 7 LOG: malloc_trim saved 12 kB 57 LOG: malloc_trim saved 16 kB 3 LOG: malloc_trim saved 20 kB 22 LOG: malloc_trim saved 24 kB 14 LOG: malloc_trim saved 28 kB 288 LOG: malloc_trim saved 32 kB 20 LOG: malloc_trim saved 36 kB 26 LOG: malloc_trim saved 40 kB 18 LOG: malloc_trim saved 44 kB 26 LOG: malloc_trim saved 48 kB 27 LOG: malloc_trim saved 52 kB 37 LOG: malloc_trim saved 56 kB 26 LOG: malloc_trim saved 60 kB 218 LOG: malloc_trim saved 64 kB 20 LOG: malloc_trim saved 68 kB 44 LOG: malloc_trim saved 72 kB 44 LOG: malloc_trim saved 76 kB 45 LOG: malloc_trim saved 80 kB 29 LOG: malloc_trim saved 84 kB 91 LOG: malloc_trim saved 88 kB 31 LOG: malloc_trim saved 92 kB 191 LOG: malloc_trim saved 96 kB 30 LOG: malloc_trim saved 100 kB 81 LOG: malloc_trim saved 104 kB 24 LOG: malloc_trim saved 108 kB 214 LOG: malloc_trim saved 112 kB 32 LOG: malloc_trim saved 116 kB 178 LOG: malloc_trim saved 120 kB 86 LOG: malloc_trim saved 124 kB 4498 LOG: malloc_trim saved 128 kB 29 LOG: malloc_trim saved 132 kB 286 LOG: malloc_trim saved 136 kB 34 LOG: malloc_trim saved 140 kB 563 LOG: malloc_trim saved 144 kB 20 LOG: malloc_trim saved 148 kB 821 LOG: malloc_trim saved 152 kB 987 LOG: malloc_trim saved 156 kB 212 LOG: malloc_trim saved 160 kB 8 LOG: malloc_trim saved 164 kB 22 LOG: malloc_trim saved 168 kB 8 LOG: malloc_trim saved 172 kB 19 LOG: malloc_trim saved 176 kB 7 LOG: malloc_trim saved 180 kB 10 LOG: malloc_trim saved 184 kB 5 LOG: malloc_trim saved 188 kB 128 LOG: malloc_trim saved 192 kB 7 LOG: malloc_trim saved 196 kB 14 LOG: malloc_trim saved 200 kB 14 LOG: malloc_trim saved 204 kB 9 LOG: malloc_trim saved 208 kB 11 LOG: malloc_trim saved 212 kB 11 LOG: malloc_trim saved 216 kB 12 LOG: malloc_trim saved 224 kB 6 LOG: malloc_trim saved 228 kB 5 LOG: malloc_trim saved 232 kB 2 LOG: malloc_trim saved 236 kB 3 LOG: malloc_trim saved 240 kB 7 LOG: malloc_trim saved 244 kB 3 LOG: malloc_trim saved 248 kB 3 LOG: malloc_trim saved 252 kB 186 LOG: malloc_trim saved 256 kB 17 LOG: malloc_trim saved 260 kB 28 LOG: malloc_trim saved 264 kB 51 LOG: malloc_trim saved 268 kB 33 LOG: malloc_trim saved 272 kB 33 LOG: malloc_trim saved 276 kB 69 LOG: malloc_trim saved 280 kB 25 LOG: malloc_trim saved 284 kB 39 LOG: malloc_trim saved 288 kB 12 LOG: malloc_trim saved 292 kB 20 LOG: malloc_trim saved 296 kB 12 LOG: malloc_trim saved 300 kB 34 LOG: malloc_trim saved 304 kB 5 LOG: malloc_trim saved 308 kB 2 LOG: malloc_trim saved 312 kB 6 LOG: malloc_trim saved 316 kB 12 LOG: malloc_trim saved 320 kB 1 LOG: malloc_trim saved 324 kB 1 LOG: malloc_trim saved 328 kB 1 LOG: malloc_trim saved 332 kB 6 LOG: malloc_trim saved 336 kB 3 LOG: malloc_trim saved 344 kB 1 LOG: malloc_trim saved 348 kB 6 LOG: malloc_trim saved 352 kB 1 LOG: malloc_trim saved 356 kB 43 LOG: malloc_trim saved 360 kB 103 LOG: malloc_trim saved 364 kB 1 LOG: malloc_trim saved 368 kB 119 LOG: malloc_trim saved 384 kB 1 LOG: malloc_trim saved 388 kB 1 LOG: malloc_trim saved 392 kB 2 LOG: malloc_trim saved 396 kB 4 LOG: malloc_trim saved 412 kB 4 LOG: malloc_trim saved 416 kB 1 LOG: malloc_trim saved 420 kB 3 LOG: malloc_trim saved 424 kB 1 LOG: malloc_trim saved 428 kB 5 LOG: malloc_trim saved 432 kB 3 LOG: malloc_trim saved 436 kB 1 LOG: malloc_trim saved 440 kB 27 LOG: malloc_trim saved 448 kB 6 LOG: malloc_trim saved 452 kB 5 LOG: malloc_trim saved 464 kB 10 LOG: malloc_trim saved 480 kB 2 LOG: malloc_trim saved 484 kB 1 LOG: malloc_trim saved 492 kB 1 LOG: malloc_trim saved 496 kB 1 LOG: malloc_trim saved 500 kB 4 LOG: malloc_trim saved 508 kB 16 LOG: malloc_trim saved 512 kB 1 LOG: malloc_trim saved 528 kB 2 LOG: malloc_trim saved 536 kB 1 LOG: malloc_trim saved 540 kB 1 LOG: malloc_trim saved 552 kB 2 LOG: malloc_trim saved 556 kB 1 LOG: malloc_trim saved 576 kB 1 LOG: malloc_trim saved 580 kB 6 LOG: malloc_trim saved 588 kB 1 LOG: malloc_trim saved 604 kB 1 LOG: malloc_trim saved 616 kB 1 LOG: malloc_trim saved 620 kB 1 LOG: malloc_trim saved 628 kB 1 LOG: malloc_trim saved 632 kB 3 LOG: malloc_trim saved 640 kB 2 LOG: malloc_trim saved 668 kB 2 LOG: malloc_trim saved 672 kB 1 LOG: malloc_trim saved 716 kB 2 LOG: malloc_trim saved 728 kB 1 LOG: malloc_trim saved 732 kB 1 LOG: malloc_trim saved 736 kB 2 LOG: malloc_trim saved 748 kB 1 LOG: malloc_trim saved 760 kB 11 LOG: malloc_trim saved 768 kB 2 LOG: malloc_trim saved 776 kB 1 LOG: malloc_trim saved 800 kB 10 LOG: malloc_trim saved 844 kB 1 LOG: malloc_trim saved 880 kB 1 LOG: malloc_trim saved 884 kB 1 LOG: malloc_trim saved 896 kB 2 LOG: malloc_trim saved 904 kB 6 LOG: malloc_trim saved 960 kB 4 LOG: malloc_trim saved 996 kB 1 LOG: malloc_trim saved 1024 kB 3 LOG: malloc_trim saved 1084 kB 1 LOG: malloc_trim saved 1088 kB 1 LOG: malloc_trim saved 1136 kB 7 LOG: malloc_trim saved 1152 kB 1 LOG: malloc_trim saved 1160 kB 1 LOG: malloc_trim saved 1248 kB 1 LOG: malloc_trim saved 1256 kB 1 LOG: malloc_trim saved 1280 kB 1 LOG: malloc_trim saved 1288 kB 2 LOG: malloc_trim saved 1296 kB 3 LOG: malloc_trim saved 1300 kB 2 LOG: malloc_trim saved 1412 kB 2 LOG: malloc_trim saved 1416 kB 1 LOG: malloc_trim saved 1512 kB 3 LOG: malloc_trim saved 1520 kB 5 LOG: malloc_trim saved 1536 kB 2 LOG: malloc_trim saved 1568 kB 1 LOG: malloc_trim saved 1608 kB 2 LOG: malloc_trim saved 1612 kB 1 LOG: malloc_trim saved 1648 kB 1 LOG: malloc_trim saved 1768 kB 1 LOG: malloc_trim saved 1772 kB 1 LOG: malloc_trim saved 1792 kB 1 LOG: malloc_trim saved 1860 kB 1 LOG: malloc_trim saved 1872 kB 1 LOG: malloc_trim saved 1908 kB 1 LOG: malloc_trim saved 1936 kB 2 LOG: malloc_trim saved 2004 kB 1 LOG: malloc_trim saved 2104 kB 1 LOG: malloc_trim saved 2112 kB 1 LOG: malloc_trim saved 2116 kB 1 LOG: malloc_trim saved 2176 kB 1 LOG: malloc_trim saved 2228 kB 2 LOG: malloc_trim saved 2240 kB 1 LOG: malloc_trim saved 2348 kB 1 LOG: malloc_trim saved 2496 kB 1 LOG: malloc_trim saved 2624 kB 1 LOG: malloc_trim saved 2676 kB 1 LOG: malloc_trim saved 3568 kB 1 LOG: malloc_trim saved 3712 kB 1 LOG: malloc_trim saved 3760 kB 1 LOG: malloc_trim saved 3836 kB 1 LOG: malloc_trim saved 7228 kB 2 LOG: malloc_trim saved 12488 kB 1 LOG: malloc_trim saved 12616 kB
Thank you very much for your response and suggestions.
As you mentioned, the patch here is actually designed for glibc's ptmalloc2 andis not applicable to other platforms. I will consider supporting it only on the Linux platform in the future. In the memory management strategy of ptmalloc2, there is a certain amount of non-garbage-collected memory, which is closely related to the order and method of memory allocation and release. To reduce the performance overhead caused by frequent allocation and release of small blocks of memory, ptmalloc2 intentionally retains this part of the memory. The malloc_trim function locks, traverses memory blocks, and uses madvise to release this part of the memory, but this process may also have a negative impact on performance. In the process of exploring solutions, I also considered a variety of strategies, including scheduling malloc_trim to be executed at regular intervals or triggering malloc_trim after a specific number of free operations. However, we found that these methods are not optimal solutions.
We can see that out of about 43K test queries, 32K saved nothing
whatever, and in only four was more than a couple of meg saved.
That's pretty discouraging IMO. It might be useful to look closer
at the behavior of those top four though. I see them as
In addition, I have considered the following optimization strategies:
Adjust the configuration of ptmalloc2 through the mallopt function to use mmap rather than sbrk for memory allocation. This can immediately return the memory to the operating system when it is released, but it may affect performance due to the higher overhead of mmap.
Use other memory allocators such as jemalloc or tcmalloc, and adjust relevant parameters to reduce the generation of non-garbage-collected memory. However, these allocators are designed for multi-threaded and may lead to increased memory usage per process.
Build a set of memory context (memory context) allocation functions based on mmap, delegating the responsibility of memory management entirely to the database level. Although this solution can effectively control memory allocation, it requires a large-scale engineering implementation.
I look forward to further discussing these solutions with you and exploring the best memory management practices together.
Best regards, Shawn
I wrote:
> The single test case you showed suggested that maybe we could
> usefully prod glibc to free memory at query completion, but we
> don't need all this interrupt infrastructure to do that. I think
> we could likely get 95% of the benefit with about a five-line
> patch.
To try to quantify that a little, I wrote a very quick-n-dirty
patch to apply malloc_trim during finish_xact_command and log
the effects. (I am not asserting this is the best place to
call malloc_trim; it's just one plausible possibility.) Patch
attached, as well as statistics collected from a run of the
core regression tests followed by
grep malloc_trim postmaster.log | sed 's/.*LOG:/LOG:/' | sort -k4n | uniq -c >trim_savings.txt
We can see that out of about 43K test queries, 32K saved nothing
whatever, and in only four was more than a couple of meg saved.
That's pretty discouraging IMO. It might be useful to look closer
at the behavior of those top four though. I see them as
2024-09-15 14:58:06.146 EDT [960138] LOG: malloc_trim saved 7228 kB
2024-09-15 14:58:06.146 EDT [960138] STATEMENT: ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
2024-09-15 14:58:09.861 EDT [960949] LOG: malloc_trim saved 12488 kB
2024-09-15 14:58:09.861 EDT [960949] STATEMENT: with recursive search_graph(f, t, label, is_cycle, path) as (
select *, false, array[row(g.f, g.t)] from graph g
union distinct
select g.*, row(g.f, g.t) = any(path), path || row(g.f, g.t)
from graph g, search_graph sg
where g.f = sg.t and not is_cycle
)
select * from search_graph;
2024-09-15 14:58:09.866 EDT [960949] LOG: malloc_trim saved 12488 kB
2024-09-15 14:58:09.866 EDT [960949] STATEMENT: with recursive search_graph(f, t, label) as (
select * from graph g
union distinct
select g.*
from graph g, search_graph sg
where g.f = sg.t
) cycle f, t set is_cycle to 'Y' default 'N' using path
select * from search_graph;
2024-09-15 14:58:09.853 EDT [960949] LOG: malloc_trim saved 12616 kB
2024-09-15 14:58:09.853 EDT [960949] STATEMENT: with recursive search_graph(f, t, label) as (
select * from graph0 g
union distinct
select g.*
from graph0 g, search_graph sg
where g.f = sg.t
) search breadth first by f, t set seq
select * from search_graph order by seq;
I don't understand why WITH RECURSIVE queries might be more prone
to leave non-garbage-collected memory behind than other queries,
but maybe that is worth looking into.
regards, tom lane
On 9/18/24 04:56, shawn wang wrote: > Thank you very much for your response and suggestions. > > As you mentioned, the patch here is actually designed for glibc's > ptmalloc2 andis not applicable to other platforms. I will consider > supporting it only on the Linux platform in the future. In the memory > management strategy of ptmalloc2, there is a certain amount of non- > garbage-collected memory, which is closely related to the order and > method of memory allocation and release. To reduce the performance > overhead caused by frequent allocation and release of small blocks of > memory, ptmalloc2 intentionally retains this part of the memory. The > malloc_trim function locks, traverses memory blocks, and uses madvise to > release this part of the memory, but this process may also have a > negative impact on performance. In the process of exploring solutions, I > also considered a variety of strategies, including scheduling > malloc_trim to be executed at regular intervals or triggering > malloc_trim after a specific number of free operations. However, we > found that these methods are not optimal solutions. > > We can see that out of about 43K test queries, 32K saved nothing > whatever, and in only four was more than a couple of meg saved. > That's pretty discouraging IMO. It might be useful to look closer > at the behavior of those top four though. I see them as > > > I have previously encountered situations where the non-garbage-collected > memory of wal_sender was approximately hundreds of megabytes or even > exceeded 1GB, but I was unable to reproduce this situation using simple > SQL. Therefore, I introduced an asynchronous processing function, hoping > to manage memory more efficiently without affecting performance. > I doubt a system function is the right approach to deal with these memory allocation issues. The function has to be called by the user, which means the user is expected to monitor the system and decide when to invoke the function. That seems far from trivial - it would require collecting OS-level information about memory usage, and I suppose it'd need to happen fairly often to actually help with OOM reliably. > > In addition, I have considered the following optimization strategies: > > 1. > > Adjust the configuration of ptmalloc2 through the mallopt function > to use mmap rather than sbrk for memory allocation. This can > immediately return the memory to the operating system when it is > released, but it may affect performance due to the higher overhead > of mmap. > Sure, forcing the system to release memory more aggressively may affect performance - that's the tradeoff done by glibc. But calling the new pg_trim_backend_heap_free_memory() function is not free either. But why would it force returning the memory to be returned immediately? The decision whether to trim memory is driven by M_TRIM_THRESHOLD, and that does not need to be 0. In fact, it's 128kB by default, i.e. glibc trims memory automatically, if it can trim at least 128kB. Yes, by default the thresholds are adjusted dynamically, which I guess is one way to get excessive memory usage that could have been solved by calling malloc_trim(). But setting the option to any value disabled the dynamic behavior, it doesn't need to be set to 0. > 2. > > Use other memory allocators such as jemalloc or tcmalloc, and adjust > relevant parameters to reduce the generation of non-garbage- > collected memory. However, these allocators are designed for multi- > threaded and may lead to increased memory usage per process. > Right, that's kinda the opposite of trying to not waste memory. But it also suggests syscalls (done by malloc) may be a problem under high concurrency - not just with multi-threading, but even with regular processes. And for glibc that matters too, of course - in fact, it may be pretty important to allow glibc to cache more memory (by increasing M_TOP_PAD) to get good throughput in certain workloads ... > 3. > > Build a set of memory context (memory context) allocation functions > based on mmap, delegating the responsibility of memory management > entirely to the database level. Although this solution can > effectively control memory allocation, it requires a large-scale > engineering implementation. > Why would it be complex? You could just as well set M_MMAP_THRESHOLD to some low value, so that all malloc() calls are handled by mmap() internally. Not sure it's a good idea, though. > I look forward to further discussing these solutions with you and > exploring the best memory management practices together. > Adjusting the glibc malloc() behavior may be important, but I don't think a system function is a good approach. It's possible to change the behavior by setting environment variables, which is pretty easy, but maybe we could have some thing that does the same thing using mallopt(). That's what Ronan Dunklau proposed in thread [1] a year ago ... I like that approach much more, it's much simpler for the user. [1] https://www.postgresql.org/message-id/flat/3424675.QJadu78ljV%40aivenlaptop regards -- Tomas Vondra
On 12/8/24 05:23, Tomas Vondra wrote: > On 9/18/24 04:56, shawn wang wrote: >> Thank you very much for your response and suggestions. >> >> As you mentioned, the patch here is actually designed for glibc's >> ptmalloc2 andis not applicable to other platforms. I will consider >> supporting it only on the Linux platform in the future. In the memory >> management strategy of ptmalloc2, there is a certain amount of non- >> garbage-collected memory, which is closely related to the order and >> method of memory allocation and release. To reduce the performance >> overhead caused by frequent allocation and release of small blocks of >> memory, ptmalloc2 intentionally retains this part of the memory. The >> malloc_trim function locks, traverses memory blocks, and uses madvise to >> release this part of the memory, but this process may also have a >> negative impact on performance. In the process of exploring solutions, I >> also considered a variety of strategies, including scheduling >> malloc_trim to be executed at regular intervals or triggering >> malloc_trim after a specific number of free operations. However, we >> found that these methods are not optimal solutions. >> >> We can see that out of about 43K test queries, 32K saved nothing >> whatever, and in only four was more than a couple of meg saved. >> That's pretty discouraging IMO. It might be useful to look closer >> at the behavior of those top four though. I see them as >> >> >> I have previously encountered situations where the non-garbage-collected >> memory of wal_sender was approximately hundreds of megabytes or even >> exceeded 1GB, but I was unable to reproduce this situation using simple >> SQL. Therefore, I introduced an asynchronous processing function, hoping >> to manage memory more efficiently without affecting performance. >> > > I doubt a system function is the right approach to deal with these > memory allocation issues. The function has to be called by the user, > which means the user is expected to monitor the system and decide when > to invoke the function. That seems far from trivial - it would require > collecting OS-level information about memory usage, and I suppose it'd > need to happen fairly often to actually help with OOM reliably. > >> >> In addition, I have considered the following optimization strategies: >> >> 1. >> >> Adjust the configuration of ptmalloc2 through the mallopt function >> to use mmap rather than sbrk for memory allocation. This can >> immediately return the memory to the operating system when it is >> released, but it may affect performance due to the higher overhead >> of mmap. >> > > Sure, forcing the system to release memory more aggressively may affect > performance - that's the tradeoff done by glibc. But calling the new > pg_trim_backend_heap_free_memory() function is not free either. > > But why would it force returning the memory to be returned immediately? > The decision whether to trim memory is driven by M_TRIM_THRESHOLD, and > that does not need to be 0. In fact, it's 128kB by default, i.e. glibc > trims memory automatically, if it can trim at least 128kB. > > Yes, by default the thresholds are adjusted dynamically, which I guess > is one way to get excessive memory usage that could have been solved by > calling malloc_trim(). But setting the option to any value disabled the > dynamic behavior, it doesn't need to be set to 0. > > >> 2. >> >> Use other memory allocators such as jemalloc or tcmalloc, and adjust >> relevant parameters to reduce the generation of non-garbage- >> collected memory. However, these allocators are designed for multi- >> threaded and may lead to increased memory usage per process. >> > > Right, that's kinda the opposite of trying to not waste memory. > > But it also suggests syscalls (done by malloc) may be a problem under > high concurrency - not just with multi-threading, but even with regular > processes. And for glibc that matters too, of course - in fact, it may > be pretty important to allow glibc to cache more memory (by increasing > M_TOP_PAD) to get good throughput in certain workloads ... > >> 3. >> >> Build a set of memory context (memory context) allocation functions >> based on mmap, delegating the responsibility of memory management >> entirely to the database level. Although this solution can >> effectively control memory allocation, it requires a large-scale >> engineering implementation. >> > > Why would it be complex? You could just as well set M_MMAP_THRESHOLD to > some low value, so that all malloc() calls are handled by mmap() > internally. Not sure it's a good idea, though. > >> I look forward to further discussing these solutions with you and >> exploring the best memory management practices together. >> > > Adjusting the glibc malloc() behavior may be important, but I don't > think a system function is a good approach. It's possible to change the > behavior by setting environment variables, which is pretty easy, but > maybe we could have some thing that does the same thing using mallopt(). > > That's what Ronan Dunklau proposed in thread [1] a year ago ... I like > that approach much more, it's much simpler for the user. > To propose something less abstract / more tangible, I think we should do something like this: 1) add a bit of code for glibc-based systems, that adjusts selected malloc parameters using mallopt() during startup 2) add a GUC that enables this, with the default being the regular glibc behavior (with dynamic adjustment of various thresholds) Which exact parameters would this set is an open question, but based on my earlier experiments, Ronan's earlier patches, etc. I think it should adjust at least M_TRIM_THRESHOLD - to make sure we trim heap regularly M_TOP_PAD - to make sure we cache some allocated memory I wonder if maybe we should tune M_MMAP_THRESHOLD, which on 64-bit systems defaults to 32MB, so we don't really mmap() very often for regular memory contexts. But I don't know if that's a good idea, that would need some experiments. I believe that's essentially what Ronan Dunklau proposed, but it stalled. Not because of some inherent complexity, but because of concerns about introducing glibc-specific code. Based on my recent experiments I think it's clearly worth it (esp. with high concurrency workloads). If glibc was a niche, it'd be a different situation, but I'd guess vast majority of databases runs on glibc. Yes, it's possible to do these changes without new code (e.g. by setting the environment variables), but that's rather inconvenient. Perhaps it'd be possible to make it a bit smarter by looking at malloc stats, and adjust the trim/pad thresholds, but I'd leave that for the future. It might even lead to similar issues with excessive memory usage just like the logic built into glibc. But maybe we could at least print / provide some debugging information? That would help with adjusting the GUC ... regards -- Tomas Vondra