Thread: Stack overflow issue

Stack overflow issue

From
Егор Чиндяскин
Date:
Hello, I recently got a server crash (bug #17583 [1]) caused by a stack overflow. 
 
Tom Lane and Richard Guo, in a discussion of this bug, suggested that there could be more such places. 
Therefore, Alexander Lakhin and I decided to deal with this issue and Alexander developed a methodology. We processed src/backend/*/*.c with "clang -emit-llvm  ... | opt -analyze -print-calgraph" to find all the functions that call themselves directly. I checked each of them for features that protect against stack overflows.
We analyzed 4 catalogs: regex, tsearch, snowball and adt.
Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.
 
zaptreesubs
markst
next
nfatree
numst
repeat
 
We have tried to exploit the recursion in the function zaptreesubs():
select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1', 11000) || ')?');
 
ERROR:  invalid regular expression: regular expression is too complex
 
repeat():
select regexp_match('abc01234xyz',repeat('a{0,2}',100001));
 
ERROR:  invalid regular expression: regular expression is too complex
 
numst():
select regexp_match('abc01234xyz',repeat('(.)\1e',100001));
 
ERROR:  invalid regular expression: regular expression is too complex
 
markst():
markst is called in the code after v->tree = parse(...);
it is necessary that the tree be successfully parsed, but with a nesting level of about 100,000 this will not work - stack protection will work during parsing and v->ntree = numst(...); is also there.
 
next():
we were able to crash the server with the following query:
(printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<1000000;i++)); do printf "(?#)"; done; printf "b')" ) | psql
 
Secondly, we have tried to exploit the recursion in the adt catalog functions and Alexander was able to crash the server with the following query:
 
regex_selectivity_sub(): 
SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 200000) || 'b)');
 
And this query:
 
(n=100000;
printf "SELECT polygon '((0,0),(0,1000000))' <@ polygon '((-200000,1000000),";
for ((i=1;i<$n;i++)); do printf "(100000,$(( 300000 + $i))),(-100000,$((800000 + $i))),"; done;
printf "(200000,900000),(200000,0))';"
) | psql
 
Thirdly, the snowball catalog, Alexander has tried to exploit the recursion in the r_stem_suffix_chain_before_ki function and crashed a server using this query:
 
r_stem_suffix_chain_before_ki():
SELECT ts_lexize('turkish_stem', repeat('lerdeki', 1000000));
 
The last one is the tsearch catalog. We have found 4 functions that didn't have check_stach_depth() function: 
 
SplitToVariants
mkANode
mkSPNode
LexizeExec
 
We have tried to exploit the recursion in the SplitToVariants function and Alexander crashed a server using this:
 
SplitToVariants():
CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell, DictFile=ispell_sample,AffFile=ispell_sample);
SELECT ts_lexize('ispell', repeat('bally', 10000));
 
After trying to exploit the recursion in the LexizeExec function Alexander made this conlusion: 
 
LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and "ld->curDictId != InvalidOid" (multiword mode) - we start with the first one, then make recursive call to switch to the multiword mode, but then we return to the usual mode again.
 
mkANode and mkSPNode deal with the dictionary structs, not with user-supplied data, so we believe these functions are not vulnerable.
 
[1] https://www.postgresql.org/message-id/flat/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg%40mail.gmail.com#03ad703cf4bc8d28ccba69913e1e8106

Re: Stack overflow issue

From
mahendrakar s
Date:
Hi,
Can we have a parameter to control the recursion depth in these cases to avoid crashes? 
Just a thought.

Thanks,
Mahendrakar.

On Wed, 24 Aug, 2022, 3:21 pm Егор Чиндяскин, <kyzevan23@mail.ru> wrote:
Hello, I recently got a server crash (bug #17583 [1]) caused by a stack overflow. 
 
Tom Lane and Richard Guo, in a discussion of this bug, suggested that there could be more such places. 
Therefore, Alexander Lakhin and I decided to deal with this issue and Alexander developed a methodology. We processed src/backend/*/*.c with "clang -emit-llvm  ... | opt -analyze -print-calgraph" to find all the functions that call themselves directly. I checked each of them for features that protect against stack overflows.
We analyzed 4 catalogs: regex, tsearch, snowball and adt.
Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.
 
zaptreesubs
markst
next
nfatree
numst
repeat
 
We have tried to exploit the recursion in the function zaptreesubs():
select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1', 11000) || ')?');
 
ERROR:  invalid regular expression: regular expression is too complex
 
repeat():
select regexp_match('abc01234xyz',repeat('a{0,2}',100001));
 
ERROR:  invalid regular expression: regular expression is too complex
 
numst():
select regexp_match('abc01234xyz',repeat('(.)\1e',100001));
 
ERROR:  invalid regular expression: regular expression is too complex
 
markst():
markst is called in the code after v->tree = parse(...);
it is necessary that the tree be successfully parsed, but with a nesting level of about 100,000 this will not work - stack protection will work during parsing and v->ntree = numst(...); is also there.
 
next():
we were able to crash the server with the following query:
(printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<1000000;i++)); do printf "(?#)"; done; printf "b')" ) | psql
 
Secondly, we have tried to exploit the recursion in the adt catalog functions and Alexander was able to crash the server with the following query:
 
regex_selectivity_sub(): 
SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 200000) || 'b)');
 
And this query:
 
(n=100000;
printf "SELECT polygon '((0,0),(0,1000000))' <@ polygon '((-200000,1000000),";
for ((i=1;i<$n;i++)); do printf "(100000,$(( 300000 + $i))),(-100000,$((800000 + $i))),"; done;
printf "(200000,900000),(200000,0))';"
) | psql
 
Thirdly, the snowball catalog, Alexander has tried to exploit the recursion in the r_stem_suffix_chain_before_ki function and crashed a server using this query:
 
r_stem_suffix_chain_before_ki():
SELECT ts_lexize('turkish_stem', repeat('lerdeki', 1000000));
 
The last one is the tsearch catalog. We have found 4 functions that didn't have check_stach_depth() function: 
 
SplitToVariants
mkANode
mkSPNode
LexizeExec
 
We have tried to exploit the recursion in the SplitToVariants function and Alexander crashed a server using this:
 
SplitToVariants():
CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell, DictFile=ispell_sample,AffFile=ispell_sample);
SELECT ts_lexize('ispell', repeat('bally', 10000));
 
After trying to exploit the recursion in the LexizeExec function Alexander made this conlusion: 
 
LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and "ld->curDictId != InvalidOid" (multiword mode) - we start with the first one, then make recursive call to switch to the multiword mode, but then we return to the usual mode again.
 
mkANode and mkSPNode deal with the dictionary structs, not with user-supplied data, so we believe these functions are not vulnerable.
 

Re: Stack overflow issue

From
Alvaro Herrera
Date:
On 2022-Aug-24, mahendrakar s wrote:

> Hi,
> Can we have a parameter to control the recursion depth in these cases to
> avoid crashes?

We already have one (max_stack_depth).  The problem is lack of calling
the control function in a few places.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: Stack overflow issue

From
mahendrakar s
Date:
Thanks. 

On Wed, 24 Aug, 2022, 4:19 pm Alvaro Herrera, <alvherre@alvh.no-ip.org> wrote:
On 2022-Aug-24, mahendrakar s wrote:

> Hi,
> Can we have a parameter to control the recursion depth in these cases to
> avoid crashes?

We already have one (max_stack_depth).  The problem is lack of calling
the control function in a few places.

--
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Stack overflow issue

From
Richard Guo
Date:

On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2022-Aug-24, mahendrakar s wrote:

> Hi,
> Can we have a parameter to control the recursion depth in these cases to
> avoid crashes?

We already have one (max_stack_depth).  The problem is lack of calling
the control function in a few places.
 
Thanks Egor and Alexander for the work! I think we can just add
check_stack_depth checks in these cases.

Thanks
Richard

Re: Stack overflow issue

From
Richard Guo
Date:

On Wed, Aug 24, 2022 at 7:12 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2022-Aug-24, mahendrakar s wrote:

> Hi,
> Can we have a parameter to control the recursion depth in these cases to
> avoid crashes?

We already have one (max_stack_depth).  The problem is lack of calling
the control function in a few places.
 
Thanks Egor and Alexander for the work! I think we can just add
check_stack_depth checks in these cases.
 
Attached adds the checks in these places. But I'm not sure about the
snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly?

Thanks
Richard
Attachment

Re: Stack overflow issue

From
mahendrakar s
Date:
Hi Richard,

Patch is looking good to me. Would request others to take a look at it as well.

Thanks,
Mahendrakar.

On Wed, 24 Aug 2022 at 17:24, Richard Guo <guofenglinux@gmail.com> wrote:

On Wed, Aug 24, 2022 at 7:12 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2022-Aug-24, mahendrakar s wrote:

> Hi,
> Can we have a parameter to control the recursion depth in these cases to
> avoid crashes?

We already have one (max_stack_depth).  The problem is lack of calling
the control function in a few places.
 
Thanks Egor and Alexander for the work! I think we can just add
check_stack_depth checks in these cases.
 
Attached adds the checks in these places. But I'm not sure about the
snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly?

Thanks
Richard

Re: Stack overflow issue

From
Tom Lane
Date:
=?UTF-8?B?0JXQs9C+0YAg0KfQuNC90LTRj9GB0LrQuNC9?= <kyzevan23@mail.ru> writes:
> Therefore, Alexander Lakhin and I decided to deal with this issue and Alexander developed a methodology. We processed
src/backend/*/*.cwith "clang -emit-llvm  ... | opt -analyze -print-calgraph" to find all the functions that call
themselvesdirectly. I checked each of them for features that protect against stack overflows. 
> We analyzed 4 catalogs: regex, tsearch, snowball and adt.
> Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.

Nice work!  I wonder if you can make the regex crashes reachable by
reducing the value of max_stack_depth enough that it's hit before
reaching the "regular expression is too complex" limit.

            regards, tom lane



Re: Stack overflow issue

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> Attached adds the checks in these places. But I'm not sure about the
> snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly?

No, that file is generated code, as it says right at the top.

I think most likely we should report this to Snowball upstream
and see what they think is an appropriate fix.

            regards, tom lane



Re: Stack overflow issue

From
Tom Lane
Date:
=?UTF-8?B?0JXQs9C+0YAg0KfQuNC90LTRj9GB0LrQuNC9?= <kyzevan23@mail.ru> writes:
> Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.

> zaptreesubs
> markst
> next
> nfatree
> numst
> repeat

I took a closer look at these.  I think the markst, numst, and nfatree
cases are non-issues.  They are recursing on a subre tree that was just
built by parse(), so parse() must have successfully recursed the same
number of levels.  parse() surely has a larger stack frame, and it
does have a stack overflow guard (in subre()), so it would have failed
cleanly before making a data structure that could be hazardous here.
Also, having markst error out would be problematic for the reasons
discussed in its comment, so I'm disinclined to try to add checks
that have no use.

BTW, I wonder why your test didn't notice freesubre()?  But the
same analysis applies there, as does the concern that we can't
just error out.

Likewise, zaptreesubs() can't recurse more levels than cdissect() did,
and that has a stack check, so I'm not very excited about adding
another one there.

I believe that repeat() is a non-issue because (a) the number of
recursion levels in it is limited by DUPMAX, which is generally going
to be 255, or at least not enormous, and (b) it will recurse at most
once before calling dupnfa(), which contains stack checks.

I think next() is a legit issue, although your example doesn't crash
for me.  I suppose that's because my compiler turned the tail recursion
into a loop, and I suggest that we fix it by doing that manually.
(Richard's proposed fix is wrong anyway: we can't just throw elog(ERROR)
in the regex code without creating memory leaks.)

            regards, tom lane



Re: Stack overflow issue

From
Tom Lane
Date:
I wrote:
> I think most likely we should report this to Snowball upstream
> and see what they think is an appropriate fix.

Done at [1], and I pushed the other fixes.  Thanks again for the report!

            regards, tom lane

[1] https://lists.tartarus.org/pipermail/snowball-discuss/2022-August/001734.html



Re: Stack overflow issue

From
Tom Lane
Date:
I wrote:
>> I think most likely we should report this to Snowball upstream
>> and see what they think is an appropriate fix.

> Done at [1], and I pushed the other fixes.  Thanks again for the report!

The upstream recommendation, which seems pretty sane to me, is to
simply reject any string exceeding some threshold length as not
possibly being a word.  Apparently it's common to use thresholds
as small as 64 bytes, but in the attached I used 1000 bytes.

            regards, tom lane

diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index 68c9213f69..aaf4ff72b6 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -272,11 +272,25 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
     DictSnowball *d = (DictSnowball *) PG_GETARG_POINTER(0);
     char       *in = (char *) PG_GETARG_POINTER(1);
     int32        len = PG_GETARG_INT32(2);
-    char       *txt = lowerstr_with_len(in, len);
     TSLexeme   *res = palloc0(sizeof(TSLexeme) * 2);
+    char       *txt;

+    /*
+     * Reject strings exceeding 1000 bytes, as they're surely not words in any
+     * human language.  This restriction avoids wasting cycles on stuff like
+     * base64-encoded data, and it protects us against possible inefficiency
+     * or misbehavior in the stemmers (for example, the Turkish stemmer has an
+     * indefinite recursion so it can crash on long-enough strings).
+     */
+    if (len <= 0 || len > 1000)
+        PG_RETURN_POINTER(res);
+
+    txt = lowerstr_with_len(in, len);
+
+    /* txt is probably not zero-length now, but we'll check anyway */
     if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
     {
+        /* empty or stopword, so reject */
         pfree(txt);
     }
     else

Re: Stack overflow issue

From
Tom Lane
Date:
I wrote:
> The upstream recommendation, which seems pretty sane to me, is to
> simply reject any string exceeding some threshold length as not
> possibly being a word.  Apparently it's common to use thresholds
> as small as 64 bytes, but in the attached I used 1000 bytes.

On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word".  While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising.  We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold.  This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.

            regards, tom lane

diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index 68c9213f69..1d5dfff5a0 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -275,8 +275,24 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
     char       *txt = lowerstr_with_len(in, len);
     TSLexeme   *res = palloc0(sizeof(TSLexeme) * 2);

-    if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+    /*
+     * Do not pass strings exceeding 1000 bytes to the stemmer, as they're
+     * surely not words in any human language.  This restriction avoids
+     * wasting cycles on stuff like base64-encoded data, and it protects us
+     * against possible inefficiency or misbehavior in the stemmer.  (For
+     * example, the Turkish stemmer has an indefinite recursion, so it can
+     * crash on long-enough strings.)  However, Snowball dictionaries are
+     * defined to recognize all strings, so we can't reject the string as an
+     * unknown word.
+     */
+    if (len > 1000)
+    {
+        /* return the lexeme lowercased, but otherwise unmodified */
+        res->lexeme = txt;
+    }
+    else if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
     {
+        /* empty or stopword, so report as stopword */
         pfree(txt);
     }
     else

Re: Stack overflow issue

From
Richard Guo
Date:

On Wed, Aug 31, 2022 at 6:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> The upstream recommendation, which seems pretty sane to me, is to
> simply reject any string exceeding some threshold length as not
> possibly being a word.  Apparently it's common to use thresholds
> as small as 64 bytes, but in the attached I used 1000 bytes.

On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word".  While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising.  We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold.  This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.
 
This patch looks good to me. It avoids overly-long words (> 1000 bytes)
going through the stemmer so the stack overflow issue in Turkish stemmer
should not exist any more.

Thanks
Richard

Re: Stack overflow issue

From
Egor Chindyaskin
Date:
24.08.2022 20:58, Tom Lane writes:
> Nice work!  I wonder if you can make the regex crashes reachable by
> reducing the value of max_stack_depth enough that it's hit before
> reaching the "regular expression is too complex" limit.
>
>             regards, tom lane
Hello everyone! It's been a while since me and Alexander Lakhin have 
published a list of functions that have a stack overflow illness. We are 
back to tell you more about such places.
During our analyze we made a conclusion that some functions can be 
crashed without changing any of the parameters and some can be crashed 
only if we change some stuff.

The first function crashes without any changes:

# CheckAttributeType

(n=60000; printf "create domain dint as int; create domain dinta0 as 
dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as 
dinta$(( $i - 1 ))[]; "; done; ) | psql
psql -c "create table t(f1 dinta60000[]);"

Some of the others crash if we change "max_locks_per_transaction" 
parameter:

# findDependentObjects

max_locks_per_transaction = 200

(n=10000; printf "create table t (i int); create view v0 as select * 
from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select * 
from v$(( $i - 1 )); "; done; ) | psql
psql -c "drop table t"

# ATExecDropColumn

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int); "; for 
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 
))); "; done; printf "alter table t0 drop b;" ) | psql

# ATExecDropConstraint

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int, constraint bc check (b 
 > 0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i() 
inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop 
constraint bc;" ) | psql

# ATExecAddColumn

max_locks_per_transaction = 200

(n=50000; printf "create table t0 (a int, b int);"; for 
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 
))); "; done; printf "alter table t0 add column c int;" ) | psql

# ATExecAlterConstrRecurse

max_locks_per_transaction = 300

(n=50000;
printf "create table t(a int primary key); create table pt (a int 
primary key, foreign key(a) references t) partition by range (a);";
printf "create table pt0 partition of pt for values from (0) to (100000) 
partition by range (a);";
for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$(( 
$i - 1 )) for values from ($i) to (100000) partition by range (a); "; done;
printf "alter table pt alter constraint pt_a_fkey deferrable initially 
deferred;"
) | psql

This is where the fun begins. According to Tom Lane, a decrease in 
max_stack_depth could lead to new crashes, but it turned out that 
Alexander was able to find new crashes precisely due to the increase in 
this parameter. Also, we had ulimit -s set to 8MB as the default value.

# eval_const_expressions_mutator

max_stack_depth = '7000kB'

(n=10000; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf " 
collate \"C\" "; done; ) | psql

If you didn’t have a crash, like me, when Alexander shared his find, 
then probably you configured your cluster with an optimization flag -Og. 
In the process of trying to break this function, we came to the 
conclusion that the maximum stack depth depends on the optimization flag 
(-O0/-Og). As it turned out, when optimizing, the function frame on the 
stack becomes smaller and because of this, the limit is reached more 
slowly, therefore, the system can withstand more torment. Therefore, 
this query will fail if you have a cluster configured with the -O0 
optimization flag.

The crash of the next function not only depends on the optimization 
flag, but also on a number of other things. While researching, we 
noticed that postgres enforces a distance ~400kB from max_stack_depth to 
ulimit -s. We thought we could hit the max_stack_depth limit and then 
hit the OS limit as well. Therefore, Alexander wrote a recursive SQL 
function, that eats up a stack within max_stack_depth, including a query 
that eats up the remaining ~400kB. And this causes a crash.

# executeBoolItem

max_stack_depth = '7600kB'

create function infinite_recurse(i int) returns int as $$
begin
   raise notice 'Level %', i;
   begin
     perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' || 
repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath);
   exception
     when others then raise notice 'jsonb_path_query error at level %, 
%', i, sqlerrm;
   end;
   begin
     select infinite_recurse(i + 1) into i;
   exception
     when others then raise notice 'Max stack depth reached at level %, 
%', i, sqlerrm;
   end;
   return i;
end;
$$ language plpgsql;

select infinite_recurse(1);

To sum it all up, we have not yet decided on a general approach to such 
functions. Some functions are definitely subject to stack overflow. Some 
are definitely not. This can be seen from the code where the recurse 
flag is passed, or a function that checks the stack is called before a 
recursive call. Some require special conditions - for example, you need 
to parse the query and build a plan, and at that stage the stack is 
eaten faster (and checked) than by the function that we are interested in.

We keep researching and hope to come up with a good solution sooner or 
later.



Re: Stack overflow issue

From
Егор Чиндяскин
Date:
Среда, 26 октября 2022, 21:47 +07:00 от Egor Chindyaskin <kyzevan23@mail.ru>:
 
24.08.2022 20:58, Tom Lane writes:
> Nice work! I wonder if you can make the regex crashes reachable by
> reducing the value of max_stack_depth enough that it's hit before
> reaching the "regular expression is too complex" limit.
>
> regards, tom lane
Hello everyone! It's been a while since me and Alexander Lakhin have
published a list of functions that have a stack overflow illness. We are
back to tell you more about such places.
During our analyze we made a conclusion that some functions can be
crashed without changing any of the parameters and some can be crashed
only if we change some stuff.

The first function crashes without any changes:

# CheckAttributeType

(n=60000; printf "create domain dint as int; create domain dinta0 as
dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as
dinta$(( $i - 1 ))[]; "; done; ) | psql
psql -c "create table t(f1 dinta60000[]);"

Some of the others crash if we change "max_locks_per_transaction"
parameter:

# findDependentObjects

max_locks_per_transaction = 200

(n=10000; printf "create table t (i int); create view v0 as select *
from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select *
from v$(( $i - 1 )); "; done; ) | psql
psql -c "drop table t"

# ATExecDropColumn

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int); "; for
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1
))); "; done; printf "alter table t0 drop b;" ) | psql

# ATExecDropConstraint

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int, constraint bc check (b
 > 0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i()
inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop
constraint bc;" ) | psql

# ATExecAddColumn

max_locks_per_transaction = 200

(n=50000; printf "create table t0 (a int, b int);"; for
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1
))); "; done; printf "alter table t0 add column c int;" ) | psql

# ATExecAlterConstrRecurse

max_locks_per_transaction = 300

(n=50000;
printf "create table t(a int primary key); create table pt (a int
primary key, foreign key(a) references t) partition by range (a);";
printf "create table pt0 partition of pt for values from (0) to (100000)
partition by range (a);";
for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$((
$i - 1 )) for values from ($i) to (100000) partition by range (a); "; done;
printf "alter table pt alter constraint pt_a_fkey deferrable initially
deferred;"
) | psql

This is where the fun begins. According to Tom Lane, a decrease in
max_stack_depth could lead to new crashes, but it turned out that
Alexander was able to find new crashes precisely due to the increase in
this parameter. Also, we had ulimit -s set to 8MB as the default value.

# eval_const_expressions_mutator

max_stack_depth = '7000kB'

(n=10000; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf "
collate \"C\" "; done; ) | psql

If you didn’t have a crash, like me, when Alexander shared his find,
then probably you configured your cluster with an optimization flag -Og.
In the process of trying to break this function, we came to the
conclusion that the maximum stack depth depends on the optimization flag
(-O0/-Og). As it turned out, when optimizing, the function frame on the
stack becomes smaller and because of this, the limit is reached more
slowly, therefore, the system can withstand more torment. Therefore,
this query will fail if you have a cluster configured with the -O0
optimization flag.

The crash of the next function not only depends on the optimization
flag, but also on a number of other things. While researching, we
noticed that postgres enforces a distance ~400kB from max_stack_depth to
ulimit -s. We thought we could hit the max_stack_depth limit and then
hit the OS limit as well. Therefore, Alexander wrote a recursive SQL
function, that eats up a stack within max_stack_depth, including a query
that eats up the remaining ~400kB. And this causes a crash.

# executeBoolItem

max_stack_depth = '7600kB'

create function infinite_recurse(i int) returns int as $$
begin
   raise notice 'Level %', i;
   begin
     perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' ||
repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath);
   exception
     when others then raise notice 'jsonb_path_query error at level %,
%', i, sqlerrm;
   end;
   begin
     select infinite_recurse(i + 1) into i;
   exception
     when others then raise notice 'Max stack depth reached at level %,
%', i, sqlerrm;
   end;
   return i;
end;
$$ language plpgsql;

select infinite_recurse(1);

To sum it all up, we have not yet decided on a general approach to such
functions. Some functions are definitely subject to stack overflow. Some
are definitely not. This can be seen from the code where the recurse
flag is passed, or a function that checks the stack is called before a
recursive call. Some require special conditions - for example, you need
to parse the query and build a plan, and at that stage the stack is
eaten faster (and checked) than by the function that we are interested in.

We keep researching and hope to come up with a good solution sooner or
later.
Hello, in continuation of the topic of the stack overflow problem, Alexander Lakhin was able to find a few more similar places.
 
An important point for the first function is that the server must be built with asserts enabled, otherwise the crash will not happen.
Also, the result in the form of a server crash will be achieved only after 2-3 hours.
 
#MemoryContextCheck
(n=1000000; printf "begin;"; for ((i=1;i<=$n;i++)); do printf "savepoint s$i;"; done; printf "release s1;" ) | psql >/dev/null
 
Other functions could be crashed without asserts enabled.
 
#CommitTransactionCommand
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null
 
#MemoryContextStatsInternal
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SELECT pg_log_backend_memory_contexts(pg_backend_pid())") | psql >/dev/null
 
#ShowTransactionStateRec
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | psql >/dev/null
 
The following next two functions call each other; the following way was found to overflow the stack (with modified server configuration):
 
#MemoryContextDeleteChildren with MemoryContextDelete
 
max_connections = 1000
max_stack_depth = '7600kB'
 
create table idxpart (a int) partition by range (a);
 
select 'create index on idxpart (a)' from generate_series(1, 40000);
\gexec
 
create table idxpart (a int) partition by range (a);
 
select 'create index on idxpart (a)' from generate_series(1, 40000);
\gexec
 
create function infinite_recurse(level int) returns int as $$
declare l int;
begin
   begin
     select infinite_recurse(level + 1) into level;
   exception
     when others then raise notice 'Max stack depth reached at level %, %', level, sqlerrm;
 
     create table idxpart1 partition of idxpart for values from (1) to (2) partition by range (a);  
 
   end;
   return level;
end;
$$ language plpgsql;
 
select infinite_recurse(1);
 
Finally, there are yet two recursive functions in mcxt.c:
 
#MemoryContextResetChildren - could be vulnerable but not used at all after eaa5808e.
 
#MemoryContextMemAllocated - at present called only with local contexts.

Re: Stack overflow issue

From
Sascha Kuhl
Date:
Great work. Max Stack depth is memory dependent? Processor dependent? 

Егор Чиндяскин <kyzevan23@mail.ru> schrieb am Mi., 24. Aug. 2022, 11:51:
Hello, I recently got a server crash (bug #17583 [1]) caused by a stack overflow. 
 
Tom Lane and Richard Guo, in a discussion of this bug, suggested that there could be more such places. 
Therefore, Alexander Lakhin and I decided to deal with this issue and Alexander developed a methodology. We processed src/backend/*/*.c with "clang -emit-llvm  ... | opt -analyze -print-calgraph" to find all the functions that call themselves directly. I checked each of them for features that protect against stack overflows.
We analyzed 4 catalogs: regex, tsearch, snowball and adt.
Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.
 
zaptreesubs
markst
next
nfatree
numst
repeat
 
We have tried to exploit the recursion in the function zaptreesubs():
select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1', 11000) || ')?');
 
ERROR:  invalid regular expression: regular expression is too complex
 
repeat():
select regexp_match('abc01234xyz',repeat('a{0,2}',100001));
 
ERROR:  invalid regular expression: regular expression is too complex
 
numst():
select regexp_match('abc01234xyz',repeat('(.)\1e',100001));
 
ERROR:  invalid regular expression: regular expression is too complex
 
markst():
markst is called in the code after v->tree = parse(...);
it is necessary that the tree be successfully parsed, but with a nesting level of about 100,000 this will not work - stack protection will work during parsing and v->ntree = numst(...); is also there.
 
next():
we were able to crash the server with the following query:
(printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<1000000;i++)); do printf "(?#)"; done; printf "b')" ) | psql
 
Secondly, we have tried to exploit the recursion in the adt catalog functions and Alexander was able to crash the server with the following query:
 
regex_selectivity_sub(): 
SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 200000) || 'b)');
 
And this query:
 
(n=100000;
printf "SELECT polygon '((0,0),(0,1000000))' <@ polygon '((-200000,1000000),";
for ((i=1;i<$n;i++)); do printf "(100000,$(( 300000 + $i))),(-100000,$((800000 + $i))),"; done;
printf "(200000,900000),(200000,0))';"
) | psql
 
Thirdly, the snowball catalog, Alexander has tried to exploit the recursion in the r_stem_suffix_chain_before_ki function and crashed a server using this query:
 
r_stem_suffix_chain_before_ki():
SELECT ts_lexize('turkish_stem', repeat('lerdeki', 1000000));
 
The last one is the tsearch catalog. We have found 4 functions that didn't have check_stach_depth() function: 
 
SplitToVariants
mkANode
mkSPNode
LexizeExec
 
We have tried to exploit the recursion in the SplitToVariants function and Alexander crashed a server using this:
 
SplitToVariants():
CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell, DictFile=ispell_sample,AffFile=ispell_sample);
SELECT ts_lexize('ispell', repeat('bally', 10000));
 
After trying to exploit the recursion in the LexizeExec function Alexander made this conlusion: 
 
LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and "ld->curDictId != InvalidOid" (multiword mode) - we start with the first one, then make recursive call to switch to the multiword mode, but then we return to the usual mode again.
 
mkANode and mkSPNode deal with the dictionary structs, not with user-supplied data, so we believe these functions are not vulnerable.
 

Re: Stack overflow issue

From
Egor Chindyaskin
Date:
Hello! In continuation of the topic, I, under the leadership of 
Alexander Lakhin, prepared patches that fix these problems.
We decided that these checks would be enough and put them in the places 
we saw fit.
Attachment

Re: Stack overflow issue

From
Egor Chindyaskin
Date:
03.01.2023 22:45, Sascha Kuhl writes:
> Great work. Max Stack depth is memory dependent? Processor dependent?
Hello! These situations are not specific to the x86_64 architecture, but 
also manifest themselves, for example, on aarch64 architecture.
For example this query, ran on aarch64, (n=1000000;printf "begin;"; for 
((i=1;i<=$n;i++)); do printf "savepoint s$i;"; done; printf "release 
s1;" ) | psql > /dev/null
crashed the server on the savepoint174617 with the following stacktrace:

Core was generated by `postgres: test test [local] 
SAVEPOINT                     '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  AllocSetCheck (context=<error reading variable: Cannot access memory 
at address 0xffffe2397fe8>) at aset.c:1409
1409    {
(gdb) bt
#0  AllocSetCheck (context=<error reading variable: Cannot access memory 
at address 0xffffe2397fe8>) at aset.c:1409
#1  0x0000aaaad78c38c4 in MemoryContextCheck (context=0xaaab39ee16a0) at 
mcxt.c:740
#2  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39edf690) at 
mcxt.c:742
#3  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39edd680) at 
mcxt.c:742
#4  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39edb670) at 
mcxt.c:742
#5  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed9660) at 
mcxt.c:742
#6  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed7650) at 
mcxt.c:742
#7  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed5640) at 
mcxt.c:742
#8  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed3630) at 
mcxt.c:742
#9  0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed1620) at 
mcxt.c:742
#10 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ecf610) at 
mcxt.c:742
...
#174617 0x0000aaaad78c38dc in MemoryContextCheck 
(context=0xaaaae47994b0) at mcxt.c:742
#174618 0x0000aaaad78c38dc in MemoryContextCheck 
(context=0xaaaae476dcd0) at mcxt.c:742
#174619 0x0000aaaad78c38dc in MemoryContextCheck 
(context=0xaaaae46ead50) at mcxt.c:742
#174620 0x0000aaaad76c7e24 in finish_xact_command () at postgres.c:2739
#174621 0x0000aaaad76c55b8 in exec_simple_query 
(query_string=0xaaaae46f0540 "savepoint s174617;") at postgres.c:1238
#174622 0x0000aaaad76ca7a4 in PostgresMain (argc=1, argv=0xffffe2b96898, 
dbname=0xaaaae471c098 "test", username=0xaaaae471c078 "test") at 
postgres.c:4508
#174623 0x0000aaaad75e263c in BackendRun (port=0xaaaae4711470) at 
postmaster.c:4530
#174624 0x0000aaaad75e1f70 in BackendStartup (port=0xaaaae4711470) at 
postmaster.c:4252
#174625 0x0000aaaad75dd4c0 in ServerLoop () at postmaster.c:1745
#174626 0x0000aaaad75dcd3c in PostmasterMain (argc=3, 
argv=0xaaaae46eacb0) at postmaster.c:1417
#174627 0x0000aaaad74d462c in main (argc=3, argv=0xaaaae46eacb0) at 
main.c:209



Re: Stack overflow issue

From
Egor Chindyaskin
Date:
Hello! In continuation of the topic I would like to suggest solution. 
This patch adds several checks to the vulnerable functions above.
Attachment

Re: Stack overflow issue

From
Heikki Linnakangas
Date:
On 21/06/2023 16:45, Egor Chindyaskin wrote:
> Hello! In continuation of the topic I would like to suggest solution.
> This patch adds several checks to the vulnerable functions above.

I looked at this last patch. The depth checks are clearly better than 
segfaulting, but I think we can also avoid the recursions and having to 
error out. That seems nice especially for MemoryContextDelete(), which 
is called at transaction cleanup.

1. CommitTransactionCommand

This is just tail recursion. The compiler will almost certainly optimize 
it away unless you're using -O0. We can easily turn it into iteration 
ourselves to avoid that hazard, per attached 
0001-Turn-tail-recursion-into-iteration-in-CommitTransact.patch.

2. ShowTransactionStateRec

Since this is just a debugging aid, I think we can just stop recursing 
if we're about to run out of stack space. Seems nicer than erroring out, 
although it can still error if you run out of memory. See 
0002-Avoid-stack-overflow-in-ShowTransactionStateRec.patch.

3. All the MemoryContext functions

I'm reluctant to add stack checks to these, because they are called in 
places like cleaning up after transaction abort. MemoryContextDelete() 
in particular. If you ereport an error, it's not clear that you can 
recover cleanly; you'll leak memory if nothing else.

Fortunately MemoryContext contains pointers to parent and siblings, so 
we can traverse a tree of MemoryContexts iteratively, without using stack.

MemoryContextStats() is a bit tricky, but we can put a limit on how much 
it recurses, and just print a summary line if the limit is reached. 
That's what we already do if a memory context has a lot of children. 
(Actually, if we didn't try keep track of the # of children at each 
level, to trigger the summarization, we could traverse the tree without 
using stack. But a limit seems useful.)

What do you think?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Stack overflow issue

From
Egor Chindyaskin
Date:
On 24/11/2023 21:14, Heikki Linnakangas wrote:
> What do you think?
Hello! Thank you for researching the problem! I'm more of a tester than 
a developer, so I was able to check the patches from that side.
I've configured the server with CFLAGS=" -O0" and cassert enabled and 
checked the following queries:

#CommitTransactionCommand
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT 
s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null

#ShowTransactionStateRec
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT 
s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | 
psql >/dev/null

#MemoryContextCheck
(n=1000000; printf "begin;"; for ((i=1;i<=$n;i++)); do printf "savepoint 
s$i;"; done; printf "release s1;" ) | psql >/dev/null

#MemoryContextStatsInternal
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT 
s$i;"; done; printf "SELECT 
pg_log_backend_memory_contexts(pg_backend_pid())") | psql >/dev/null

On my system, every of that queries led to a server crash at a number of 
savepoints in the range from 174,400 to 174,700.
With your patches applied, the savepoint counter goes well beyond these 
values, I settled on an amount of approximately 300,000 savepoints.
Your patches look good to me.

Best regards,
Egor Chindyaskin
Postgres Professional: http://postgrespro.com/



Re: Stack overflow issue

From
Robert Haas
Date:
On Fri, Nov 24, 2023 at 10:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> What do you think?

At least for 0001 and 0002, I think we should just add the stack depth checks.

With regard to 0001, CommitTransactionCommand() and friends are hard
enough to understand as it is; they need "goto" like I need an extra
hole in my head.

With regard to 0002, this function isn't sufficiently important to
justify adding special-case code for an extremely rare event. We
should just handle it the way we do in general.

I agree that in the memory-context case it might be worth expending
some more code to be more clever. But I probably wouldn't do that for
MemoryContextStats(); check_stack_depth() seems fine for that one.

In general, I think we should try to keep the number of places that
handle stack overflow in "special" ways as small as possible.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Stack overflow issue

From
Andres Freund
Date:
Hi,

On 2024-01-05 12:23:25 -0500, Robert Haas wrote:
> I agree that in the memory-context case it might be worth expending
> some more code to be more clever. But I probably wouldn't do that for
> MemoryContextStats(); check_stack_depth() seems fine for that one.

We run MemoryContextStats() when we fail to allocate memory, including during
abort processing after a previous error. So I think it qualifies for being
somewhat special. Thus I suspect check_stack_depth() wouldn't be a good idea -
but we could make the stack_is_too_deep() path simpler and just return in the
existing MemoryContextStatsInternal() when that's the case.

Greetings,

Andres Freund



Re: Stack overflow issue

From
Robert Haas
Date:
On Fri, Jan 5, 2024 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
> On 2024-01-05 12:23:25 -0500, Robert Haas wrote:
> > I agree that in the memory-context case it might be worth expending
> > some more code to be more clever. But I probably wouldn't do that for
> > MemoryContextStats(); check_stack_depth() seems fine for that one.
>
> We run MemoryContextStats() when we fail to allocate memory, including during
> abort processing after a previous error. So I think it qualifies for being
> somewhat special.

OK.

> Thus I suspect check_stack_depth() wouldn't be a good idea -
> but we could make the stack_is_too_deep() path simpler and just return in the
> existing MemoryContextStatsInternal() when that's the case.

Since this kind of code will be exercised so rarely, it's highly
vulnerable to bugs, so I favor keeping it as simple as we can.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Stack overflow issue

From
Heikki Linnakangas
Date:
On 05/01/2024 19:23, Robert Haas wrote:
> On Fri, Nov 24, 2023 at 10:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> What do you think?
> 
> At least for 0001 and 0002, I think we should just add the stack depth checks.
> 
> With regard to 0001, CommitTransactionCommand() and friends are hard
> enough to understand as it is; they need "goto" like I need an extra
> hole in my head.
> 
> With regard to 0002, this function isn't sufficiently important to
> justify adding special-case code for an extremely rare event. We
> should just handle it the way we do in general.
> 
> I agree that in the memory-context case it might be worth expending
> some more code to be more clever. But I probably wouldn't do that for
> MemoryContextStats(); check_stack_depth() seems fine for that one.
> 
> In general, I think we should try to keep the number of places that
> handle stack overflow in "special" ways as small as possible.

The problem with CommitTransactionCommand (or rather 
AbortCurrentTransaction() which has the same problem)
and ShowTransactionStateRec is that they get called in a state where 
aborting can lead to a panic. If you add a "check_stack_depth()" to them 
and try to reproducer scripts that Egor posted, you still get a panic.

I'm not sure if MemoryContextStats() could safely elog(ERROR). But at 
least it would mask the "out of memory" that caused the stats to be 
printed in the first place.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Stack overflow issue

From
Robert Haas
Date:
On Wed, Jan 10, 2024 at 4:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The problem with CommitTransactionCommand (or rather
> AbortCurrentTransaction() which has the same problem)
> and ShowTransactionStateRec is that they get called in a state where
> aborting can lead to a panic. If you add a "check_stack_depth()" to them
> and try to reproducer scripts that Egor posted, you still get a panic.

Hmm, that's unfortunate. I'm not sure what to do about that. But I'd
still suggest looking for a goto-free approach.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Stack overflow issue

From
Heikki Linnakangas
Date:
On 11/01/2024 19:37, Robert Haas wrote:
> On Wed, Jan 10, 2024 at 4:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> The problem with CommitTransactionCommand (or rather
>> AbortCurrentTransaction() which has the same problem)
>> and ShowTransactionStateRec is that they get called in a state where
>> aborting can lead to a panic. If you add a "check_stack_depth()" to them
>> and try to reproducer scripts that Egor posted, you still get a panic.
> 
> Hmm, that's unfortunate. I'm not sure what to do about that. But I'd
> still suggest looking for a goto-free approach.

Here's one goto-free attempt. It adds a local loop to where the 
recursion was, so that if you have a chain of subtransactions that need 
to be aborted in CommitTransactionCommand, they are aborted iteratively. 
The TBLOCK_SUBCOMMIT case already had such a loop.

I added a couple of comments in the patch marked with "REVIEWER NOTE", 
to explain why I changed some things. They are to be removed before 
committing.

I'm not sure if this is better than a goto. In fact, even if we commit 
this, I think I'd still prefer to replace the remaining recursive calls 
with a goto. Recursion feels a weird to me, when we're unwinding the 
states from the stack as we go.

Of course we could use a "for (;;) { ... continue }" construct around 
the whole function, instead of a goto, but I don't think that's better 
than a goto in this case.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Stack overflow issue

From
Robert Haas
Date:
On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Here's one goto-free attempt. It adds a local loop to where the
> recursion was, so that if you have a chain of subtransactions that need
> to be aborted in CommitTransactionCommand, they are aborted iteratively.
> The TBLOCK_SUBCOMMIT case already had such a loop.
>
> I added a couple of comments in the patch marked with "REVIEWER NOTE",
> to explain why I changed some things. They are to be removed before
> committing.
>
> I'm not sure if this is better than a goto. In fact, even if we commit
> this, I think I'd still prefer to replace the remaining recursive calls
> with a goto. Recursion feels a weird to me, when we're unwinding the
> states from the stack as we go.

I'm not able to quickly verify whether this version is correct, but I
do think the code looks nicer this way.

I understand that's a question of opinion rather than fact, though.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Stack overflow issue

From
Alexander Korotkov
Date:
Hi!

On Fri, Jan 12, 2024 at 11:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Here's one goto-free attempt. It adds a local loop to where the
> > recursion was, so that if you have a chain of subtransactions that need
> > to be aborted in CommitTransactionCommand, they are aborted iteratively.
> > The TBLOCK_SUBCOMMIT case already had such a loop.
> >
> > I added a couple of comments in the patch marked with "REVIEWER NOTE",
> > to explain why I changed some things. They are to be removed before
> > committing.
> >
> > I'm not sure if this is better than a goto. In fact, even if we commit
> > this, I think I'd still prefer to replace the remaining recursive calls
> > with a goto. Recursion feels a weird to me, when we're unwinding the
> > states from the stack as we go.
>
> I'm not able to quickly verify whether this version is correct, but I
> do think the code looks nicer this way.
>
> I understand that's a question of opinion rather than fact, though.

I'd like to revive this thread.  The attached 0001 patch represents my
attempt to remove recursion in
CommitTransactionCommand()/AbortCurrentTransaction() by adding a
wrapper function.  This method doesn't use goto, doesn't require much
code changes and subjectively provides good readability.

Regarding ShowTransactionStateRec() and memory context function, as I
get from this thread they are called in states where abortion can lead
to a panic.  So, it's preferable to change them into loops too rather
than just adding check_stack_depth().  The 0002 and 0003 patches by
Heikki posted in [1] look good to me.  Can we accept them?

Also there are a number of recursive functions, which seem to be not
used in critical states where abortion can lead to a panic.  I've
extracted them from [2] into an attached 0002 patch.  I'd like to push
it if there is no objection.

Links.
1. https://www.postgresql.org/message-id/6b48c746-9704-46dc-b9be-01fe4137c824%40iki.fi
2. https://www.postgresql.org/message-id/4530546a-3216-eaa9-4c92-92d33290a211%40mail.ru

------
Regards,
Alexander Korotkov

Attachment

Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Wed, Feb 14, 2024 at 2:00 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Fri, Jan 12, 2024 at 11:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > Here's one goto-free attempt. It adds a local loop to where the
> > > recursion was, so that if you have a chain of subtransactions that need
> > > to be aborted in CommitTransactionCommand, they are aborted iteratively.
> > > The TBLOCK_SUBCOMMIT case already had such a loop.
> > >
> > > I added a couple of comments in the patch marked with "REVIEWER NOTE",
> > > to explain why I changed some things. They are to be removed before
> > > committing.
> > >
> > > I'm not sure if this is better than a goto. In fact, even if we commit
> > > this, I think I'd still prefer to replace the remaining recursive calls
> > > with a goto. Recursion feels a weird to me, when we're unwinding the
> > > states from the stack as we go.
> >
> > I'm not able to quickly verify whether this version is correct, but I
> > do think the code looks nicer this way.
> >
> > I understand that's a question of opinion rather than fact, though.
>
> I'd like to revive this thread.  The attached 0001 patch represents my
> attempt to remove recursion in
> CommitTransactionCommand()/AbortCurrentTransaction() by adding a
> wrapper function.  This method doesn't use goto, doesn't require much
> code changes and subjectively provides good readability.
>
> Regarding ShowTransactionStateRec() and memory context function, as I
> get from this thread they are called in states where abortion can lead
> to a panic.  So, it's preferable to change them into loops too rather
> than just adding check_stack_depth().  The 0002 and 0003 patches by
> Heikki posted in [1] look good to me.  Can we accept them?
>
> Also there are a number of recursive functions, which seem to be not
> used in critical states where abortion can lead to a panic.  I've
> extracted them from [2] into an attached 0002 patch.  I'd like to push
> it if there is no objection.

The revised set of remaining patches is attached.

0001 Turn tail recursion into iteration in CommitTransactionCommand()
I did minor revision of comments and code blocks order to improve the
readability.

0002 Avoid stack overflow in ShowTransactionStateRec()
I didn't notice any issues, leave this piece as is.

0003 Avoid recursion in MemoryContext functions
I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
which I think is a bit more intuitive.  Also I fixed
MemoryContextMemConsumed(), which was still trying to use the removed
argument "print" of MemoryContextStatsInternal() function.

Generally, I think this patchset fixes important stack overflow holes.
It is quite straightforward, clear and the code has a good shape.  I'm
going to push this if no objections.

------
Regards,
Alexander Korotkov

Attachment

Re: Stack overflow issue

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> The revised set of remaining patches is attached.
> ...
> 0003 Avoid recursion in MemoryContext functions
> I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> which I think is a bit more intuitive.  Also I fixed
> MemoryContextMemConsumed(), which was still trying to use the removed
> argument "print" of MemoryContextStatsInternal() function.

This patch still doesn't compile for me --- MemoryContextMemConsumed
got modified some more by commit 743112a2e, and needs minor fixes.

I initially didn't like the definition of MemoryContextTraverseNext
because it requires two copies of the "process node" logic.  However,
that seems fine for most of the callers, and even where we are
duplicating logic it's just a line or so, so I guess it's ok.
However, MemoryContextTraverseNext seems undercommented to me, plus
the claim that it traverses in depth-first order is just wrong.

I found some bugs in MemoryContextStatsInternal too: the old
logic assumed that ichild exceeding max_children was the only
way to get into the summarization logic, but now ichild minus
max_children could very well be negative.  Fortunately we can
just reset ichild to zero and not worry about having any
connection between the first loop and the second.

Here's a v5 of 0003 with those issues and some more-cosmetic ones
cleaned up.  I didn't look at 0001 or 0002.

            regards, tom lane

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 1a615becae..5d426795d9 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -145,9 +145,10 @@ MemoryContext CurTransactionContext = NULL;
 /* This is a transient link to the active portal's memory context: */
 MemoryContext PortalContext = NULL;

+static void MemoryContextDeleteOnly(MemoryContext context);
 static void MemoryContextCallResetCallbacks(MemoryContext context);
 static void MemoryContextStatsInternal(MemoryContext context, int level,
-                                       bool print, int max_children,
+                                       int max_level, int max_children,
                                        MemoryContextCounters *totals,
                                        bool print_to_stderr);
 static void MemoryContextStatsPrint(MemoryContext context, void *passthru,
@@ -219,6 +220,50 @@ GetMemoryChunkHeader(const void *pointer)
     return header;
 }

+/*
+ * MemoryContextTraverseNext
+ *        Helper function to traverse all descendants of a memory context
+ *        without recursion.
+ *
+ * Recursion could lead to out-of-stack errors with deep context hierarchies,
+ * which would be unpleasant in error cleanup code paths.
+ *
+ * To process 'context' and all its descendants, use a loop like this:
+ *
+ *     <process 'context'>
+ *     for (MemoryContext curr = context->firstchild;
+ *          curr != NULL;
+ *          curr = MemoryContextTraverseNext(curr, context))
+ *     {
+ *         <process 'curr'>
+ *     }
+ *
+ * This visits all the contexts in pre-order, that is a node is visited
+ * before its children.
+ */
+static MemoryContext
+MemoryContextTraverseNext(MemoryContext curr, MemoryContext top)
+{
+    /* After processing a node, traverse to its first child if any */
+    if (curr->firstchild != NULL)
+        return curr->firstchild;
+
+    /*
+     * After processing a childless node, traverse to its next sibling if
+     * there is one.  If there isn't, traverse back up to the parent (which
+     * has already been visited, and now so have all its descendants).  We're
+     * done if that is "top", otherwise traverse to its next sibling if any,
+     * otherwise repeat moving up.
+     */
+    while (curr->nextchild == NULL)
+    {
+        curr = curr->parent;
+        if (curr == top)
+            return NULL;
+    }
+    return curr->nextchild;
+}
+
 /*
  * Support routines to trap use of invalid memory context method IDs
  * (from calling pfree or the like on a bogus pointer).  As a possible
@@ -375,14 +420,13 @@ MemoryContextResetOnly(MemoryContext context)
 void
 MemoryContextResetChildren(MemoryContext context)
 {
-    MemoryContext child;
-
     Assert(MemoryContextIsValid(context));

-    for (child = context->firstchild; child != NULL; child = child->nextchild)
+    for (MemoryContext curr = context->firstchild;
+         curr != NULL;
+         curr = MemoryContextTraverseNext(curr, context))
     {
-        MemoryContextResetChildren(child);
-        MemoryContextResetOnly(child);
+        MemoryContextResetOnly(curr);
     }
 }

@@ -392,21 +436,60 @@ MemoryContextResetChildren(MemoryContext context)
  *        allocated therein.
  *
  * The type-specific delete routine removes all storage for the context,
- * but we have to recurse to handle the children.
- * We must also delink the context from its parent, if it has one.
+ * but we have to deal with descendant nodes here.
  */
 void
 MemoryContextDelete(MemoryContext context)
+{
+    MemoryContext curr;
+
+    Assert(MemoryContextIsValid(context));
+
+    /*
+     * Delete subcontexts from the bottom up.
+     *
+     * Note: Do not use recursion here.  A "stack depth limit exceeded" error
+     * would be unpleasant if we're already in the process of cleaning up from
+     * transaction abort.  We also cannot use MemoryContextTraverseNext() here
+     * because we modify the tree as we go.
+     */
+    curr = context;
+    for (;;)
+    {
+        MemoryContext parent;
+
+        /* Descend down until we find a leaf context with no children */
+        while (curr->firstchild != NULL)
+            curr = curr->firstchild;
+
+        /*
+         * We're now at a leaf with no children. Free it and continue from the
+         * parent.  Or if this was the original node, we're all done.
+         */
+        parent = curr->parent;
+        MemoryContextDeleteOnly(curr);
+
+        if (curr == context)
+            break;
+        curr = parent;
+    }
+}
+
+/*
+ * Subroutine of MemoryContextDelete,
+ * to delete a context that has no children.
+ * We must also delink the context from its parent, if it has one.
+ */
+static void
+MemoryContextDeleteOnly(MemoryContext context)
 {
     Assert(MemoryContextIsValid(context));
     /* We had better not be deleting TopMemoryContext ... */
     Assert(context != TopMemoryContext);
     /* And not CurrentMemoryContext, either */
     Assert(context != CurrentMemoryContext);
-
-    /* save a function call in common case where there are no children */
-    if (context->firstchild != NULL)
-        MemoryContextDeleteChildren(context);
+    /* All the children should've been deleted already */
+    Assert(context->firstchild == NULL);

     /*
      * It's not entirely clear whether 'tis better to do this before or after
@@ -672,12 +755,12 @@ MemoryContextMemAllocated(MemoryContext context, bool recurse)

     if (recurse)
     {
-        MemoryContext child;
-
-        for (child = context->firstchild;
-             child != NULL;
-             child = child->nextchild)
-            total += MemoryContextMemAllocated(child, true);
+        for (MemoryContext curr = context->firstchild;
+             curr != NULL;
+             curr = MemoryContextTraverseNext(curr, context))
+        {
+            total += curr->mem_allocated;
+        }
     }

     return total;
@@ -691,9 +774,20 @@ void
 MemoryContextMemConsumed(MemoryContext context,
                          MemoryContextCounters *consumed)
 {
+    Assert(MemoryContextIsValid(context));
+
     memset(consumed, 0, sizeof(*consumed));

-    MemoryContextStatsInternal(context, 0, false, 0, consumed, false);
+    /* Examine the context itself */
+    context->methods->stats(context, NULL, NULL, consumed, false);
+
+    /* Examine children, using iteration not recursion */
+    for (MemoryContext curr = context->firstchild;
+         curr != NULL;
+         curr = MemoryContextTraverseNext(curr, context))
+    {
+        curr->methods->stats(curr, NULL, NULL, consumed, false);
+    }
 }

 /*
@@ -707,8 +801,8 @@ MemoryContextMemConsumed(MemoryContext context,
 void
 MemoryContextStats(MemoryContext context)
 {
-    /* A hard-wired limit on the number of children is usually good enough */
-    MemoryContextStatsDetail(context, 100, true);
+    /* Hard-wired limits are usually good enough */
+    MemoryContextStatsDetail(context, 100, 100, true);
 }

 /*
@@ -720,14 +814,16 @@ MemoryContextStats(MemoryContext context)
  * with fprintf(stderr), otherwise use ereport().
  */
 void
-MemoryContextStatsDetail(MemoryContext context, int max_children,
+MemoryContextStatsDetail(MemoryContext context,
+                         int max_level, int max_children,
                          bool print_to_stderr)
 {
     MemoryContextCounters grand_totals;

     memset(&grand_totals, 0, sizeof(grand_totals));

-    MemoryContextStatsInternal(context, 0, true, max_children, &grand_totals, print_to_stderr);
+    MemoryContextStatsInternal(context, 0, max_level, max_children,
+                               &grand_totals, print_to_stderr);

     if (print_to_stderr)
         fprintf(stderr,
@@ -736,7 +832,7 @@ MemoryContextStatsDetail(MemoryContext context, int max_children,
                 grand_totals.freespace, grand_totals.freechunks,
                 grand_totals.totalspace - grand_totals.freespace);
     else
-
+    {
         /*
          * Use LOG_SERVER_ONLY to prevent the memory contexts from being sent
          * to the connected client.
@@ -754,22 +850,22 @@ MemoryContextStatsDetail(MemoryContext context, int max_children,
                                  grand_totals.totalspace, grand_totals.nblocks,
                                  grand_totals.freespace, grand_totals.freechunks,
                                  grand_totals.totalspace - grand_totals.freespace)));
+    }
 }

 /*
  * MemoryContextStatsInternal
  *        One recursion level for MemoryContextStats
  *
- * Print this context if print is true, but in any case accumulate counts into
- * *totals (if given).
+ * Print stats for this context if possible, but in any case accumulate counts
+ * into *totals (if not NULL).
  */
 static void
 MemoryContextStatsInternal(MemoryContext context, int level,
-                           bool print, int max_children,
+                           int max_level, int max_children,
                            MemoryContextCounters *totals,
                            bool print_to_stderr)
 {
-    MemoryContextCounters local_totals;
     MemoryContext child;
     int            ichild;

@@ -777,65 +873,72 @@ MemoryContextStatsInternal(MemoryContext context, int level,

     /* Examine the context itself */
     context->methods->stats(context,
-                            print ? MemoryContextStatsPrint : NULL,
+                            MemoryContextStatsPrint,
                             (void *) &level,
                             totals, print_to_stderr);

     /*
-     * Examine children.  If there are more than max_children of them, we do
-     * not print the rest explicitly, but just summarize them.
+     * Examine children.
+     *
+     * If we are past the recursion depth limit or already running low on
+     * stack, do not print them explicitly but just summarize them. Similarly,
+     * if there are more than max_children of them, we do not print the rest
+     * explicitly, but just summarize them.
      */
-    memset(&local_totals, 0, sizeof(local_totals));
-
-    for (child = context->firstchild, ichild = 0;
-         child != NULL;
-         child = child->nextchild, ichild++)
+    child = context->firstchild;
+    ichild = 0;
+    if (level < max_level && !stack_is_too_deep())
     {
-        if (ichild < max_children)
+        for (; child != NULL && ichild < max_children;
+             child = child->nextchild, ichild++)
+        {
             MemoryContextStatsInternal(child, level + 1,
-                                       print, max_children,
+                                       max_level, max_children,
                                        totals,
                                        print_to_stderr);
-        else
-            MemoryContextStatsInternal(child, level + 1,
-                                       false, max_children,
-                                       &local_totals,
-                                       print_to_stderr);
+        }
     }

-    /* Deal with excess children */
-    if (ichild > max_children)
+    if (child != NULL)
     {
-        if (print)
+        /* Summarize the rest of the children, avoiding recursion. */
+        MemoryContextCounters local_totals;
+
+        memset(&local_totals, 0, sizeof(local_totals));
+
+        ichild = 0;
+        while (child != NULL)
+        {
+            child->methods->stats(child, NULL, NULL, &local_totals, false);
+            ichild++;
+            child = MemoryContextTraverseNext(child, context);
+        }
+
+        if (print_to_stderr)
         {
-            if (print_to_stderr)
-            {
-                int            i;
-
-                for (i = 0; i <= level; i++)
-                    fprintf(stderr, "  ");
-                fprintf(stderr,
-                        "%d more child contexts containing %zu total in %zu blocks; %zu free (%zu chunks); %zu
used\n",
-                        ichild - max_children,
-                        local_totals.totalspace,
-                        local_totals.nblocks,
-                        local_totals.freespace,
-                        local_totals.freechunks,
-                        local_totals.totalspace - local_totals.freespace);
-            }
-            else
-                ereport(LOG_SERVER_ONLY,
-                        (errhidestmt(true),
-                         errhidecontext(true),
-                         errmsg_internal("level: %d; %d more child contexts containing %zu total in %zu blocks; %zu
free(%zu chunks); %zu used", 
-                                         level,
-                                         ichild - max_children,
-                                         local_totals.totalspace,
-                                         local_totals.nblocks,
-                                         local_totals.freespace,
-                                         local_totals.freechunks,
-                                         local_totals.totalspace - local_totals.freespace)));
+            for (int i = 0; i <= level; i++)
+                fprintf(stderr, "  ");
+            fprintf(stderr,
+                    "%d more child contexts containing %zu total in %zu blocks; %zu free (%zu chunks); %zu used\n",
+                    ichild,
+                    local_totals.totalspace,
+                    local_totals.nblocks,
+                    local_totals.freespace,
+                    local_totals.freechunks,
+                    local_totals.totalspace - local_totals.freespace);
         }
+        else
+            ereport(LOG_SERVER_ONLY,
+                    (errhidestmt(true),
+                     errhidecontext(true),
+                     errmsg_internal("level: %d; %d more child contexts containing %zu total in %zu blocks; %zu free
(%zuchunks); %zu used", 
+                                     level,
+                                     ichild,
+                                     local_totals.totalspace,
+                                     local_totals.nblocks,
+                                     local_totals.freespace,
+                                     local_totals.freechunks,
+                                     local_totals.totalspace - local_totals.freespace)));

         if (totals)
         {
@@ -928,7 +1031,7 @@ MemoryContextStatsPrint(MemoryContext context, void *passthru,

 /*
  * MemoryContextCheck
- *        Check all chunks in the named context.
+ *        Check all chunks in the named context and its children.
  *
  * This is just a debugging utility, so it's not fancy.
  */
@@ -936,13 +1039,16 @@ MemoryContextStatsPrint(MemoryContext context, void *passthru,
 void
 MemoryContextCheck(MemoryContext context)
 {
-    MemoryContext child;
-
     Assert(MemoryContextIsValid(context));
-
     context->methods->check(context);
-    for (child = context->firstchild; child != NULL; child = child->nextchild)
-        MemoryContextCheck(child);
+
+    for (MemoryContext curr = context->firstchild;
+         curr != NULL;
+         curr = MemoryContextTraverseNext(curr, context))
+    {
+        Assert(MemoryContextIsValid(curr));
+        curr->methods->check(curr);
+    }
 }
 #endif

@@ -1183,14 +1289,15 @@ ProcessLogMemoryContextInterrupt(void)
     /*
      * When a backend process is consuming huge memory, logging all its memory
      * contexts might overrun available disk space. To prevent this, we limit
-     * the number of child contexts to log per parent to 100.
+     * the depth of the hierarchy, as well as the number of child contexts to
+     * log per parent to 100.
      *
      * As with MemoryContextStats(), we suppose that practical cases where the
      * dump gets long will typically be huge numbers of siblings under the
      * same parent context; while the additional debugging value from seeing
      * details about individual siblings beyond 100 will not be large.
      */
-    MemoryContextStatsDetail(TopMemoryContext, 100, false);
+    MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);
 }

 void *
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 7fd41d20ca..6e5fa72b0e 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -87,7 +87,8 @@ extern Size MemoryContextMemAllocated(MemoryContext context, bool recurse);
 extern void MemoryContextMemConsumed(MemoryContext context,
                                      MemoryContextCounters *consumed);
 extern void MemoryContextStats(MemoryContext context);
-extern void MemoryContextStatsDetail(MemoryContext context, int max_children,
+extern void MemoryContextStatsDetail(MemoryContext context,
+                                     int max_level, int max_children,
                                      bool print_to_stderr);
 extern void MemoryContextAllowInCriticalSection(MemoryContext context,
                                                 bool allow);

Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Thu, Mar 7, 2024 at 12:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> The revised set of remaining patches is attached.
> ...
> 0003 Avoid recursion in MemoryContext functions
> I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> which I think is a bit more intuitive.  Also I fixed
> MemoryContextMemConsumed(), which was still trying to use the removed
> argument "print" of MemoryContextStatsInternal() function.

This patch still doesn't compile for me --- MemoryContextMemConsumed
got modified some more by commit 743112a2e, and needs minor fixes.

I initially didn't like the definition of MemoryContextTraverseNext
because it requires two copies of the "process node" logic.  However,
that seems fine for most of the callers, and even where we are
duplicating logic it's just a line or so, so I guess it's ok.
However, MemoryContextTraverseNext seems undercommented to me, plus
the claim that it traverses in depth-first order is just wrong.

I found some bugs in MemoryContextStatsInternal too: the old
logic assumed that ichild exceeding max_children was the only
way to get into the summarization logic, but now ichild minus
max_children could very well be negative.  Fortunately we can
just reset ichild to zero and not worry about having any
connection between the first loop and the second.

Here's a v5 of 0003 with those issues and some more-cosmetic ones
cleaned up.  I didn't look at 0001 or 0002.

Tom, thank you for your revision of this patch!

Sorry for tediousness, but isn't pre-order a variation of depth-first order [1]?

Links.

------
Regards,
Alexander Korotkov 

Re: Stack overflow issue

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Sorry for tediousness, but isn't pre-order a variation of depth-first order
> [1]?

To me, depth-first implies visiting children before parents.
Do I have the terminology wrong?

            regards, tom lane



Re: Stack overflow issue

From
Alexander Korotkov
Date:
Hi, Egor!

On Thu, Mar 7, 2024 at 9:53 AM Egor Chindyaskin <kyzevan23@mail.ru> wrote:
>
> > 6 march 2024 г., at 19:17, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > The revised set of remaining patches is attached.
> >
> > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > I did minor revision of comments and code blocks order to improve the
> > readability.
> >
> > 0002 Avoid stack overflow in ShowTransactionStateRec()
> > I didn't notice any issues, leave this piece as is.
> >
> > 0003 Avoid recursion in MemoryContext functions
> > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> > which I think is a bit more intuitive.  Also I fixed
> > MemoryContextMemConsumed(), which was still trying to use the removed
> > argument "print" of MemoryContextStatsInternal() function.
> >
> > Generally, I think this patchset fixes important stack overflow holes.
> > It is quite straightforward, clear and the code has a good shape.  I'm
> > going to push this if no objections.
>
> I have tested the scripts from message [1]. After applying these patches and Tom Lane’s patch from message [2], all
ofthe above mentioned functions no longer caused the server to crash. I also tried increasing the values in the
presentedscripts, which also did not lead to server crashes. Thank you! 
> Also, I would like to clarify something. Will fixes from message [3] and others be backported to all other branches,
notjust the master branch? As far as I remember, Tom Lane made corrections to all branches. For example [4]. 
>
> Links:
> 1. https://www.postgresql.org/message-id/343ff14f-3060-4f88-9cc6-efdb390185df%40mail.ru
> 2. https://www.postgresql.org/message-id/386032.1709765547%40sss.pgh.pa.us
> 3. https://www.postgresql.org/message-id/CAPpHfduZqAjF%2B7rDRP-RGNHjOXy7nvFROQ0MGS436f8FPY5DpQ%40mail.gmail.com
> 4. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e07ebd4b

Thank you for your feedback!

Initially I didn't intend to backpatch any of these.  But on second
thought with the references you provided, I think we should backpatch
simple check_stack_depth() checks from d57b7cc333 to all supported
branches, but apply refactoring of memory contextes and transaction
commit/abort just to master.  Opinions?

------
Regards,
Alexander Korotkov



Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Thu, Mar 7, 2024 at 1:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > Sorry for tediousness, but isn't pre-order a variation of depth-first order
> > [1]?
>
> To me, depth-first implies visiting children before parents.
> Do I have the terminology wrong?

According to Wikipedia, depth-first is a general term describing the
tree traversal algorithm, which goes as deep as possible in one branch
before visiting other branches.  The order of between parents and
children, and between siblings specifies the variation of depth-first
search, and pre-order is one of them.  But "pre-order" is the most
accurate term for MemoryContextTraverseNext() anyway.

------
Regards,
Alexander Korotkov



Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Thu, Mar 7, 2024 at 11:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Thu, Mar 7, 2024 at 9:53 AM Egor Chindyaskin <kyzevan23@mail.ru> wrote:
> >
> > > 6 march 2024 г., at 19:17, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >
> > > The revised set of remaining patches is attached.
> > >
> > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > I did minor revision of comments and code blocks order to improve the
> > > readability.
> > >
> > > 0002 Avoid stack overflow in ShowTransactionStateRec()
> > > I didn't notice any issues, leave this piece as is.
> > >
> > > 0003 Avoid recursion in MemoryContext functions
> > > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> > > which I think is a bit more intuitive.  Also I fixed
> > > MemoryContextMemConsumed(), which was still trying to use the removed
> > > argument "print" of MemoryContextStatsInternal() function.
> > >
> > > Generally, I think this patchset fixes important stack overflow holes.
> > > It is quite straightforward, clear and the code has a good shape.  I'm
> > > going to push this if no objections.
> >
> > I have tested the scripts from message [1]. After applying these patches and Tom Lane’s patch from message [2], all
ofthe above mentioned functions no longer caused the server to crash. I also tried increasing the values in the
presentedscripts, which also did not lead to server crashes. Thank you! 
> > Also, I would like to clarify something. Will fixes from message [3] and others be backported to all other
branches,not just the master branch? As far as I remember, Tom Lane made corrections to all branches. For example [4]. 
> >
> > Links:
> > 1. https://www.postgresql.org/message-id/343ff14f-3060-4f88-9cc6-efdb390185df%40mail.ru
> > 2. https://www.postgresql.org/message-id/386032.1709765547%40sss.pgh.pa.us
> > 3. https://www.postgresql.org/message-id/CAPpHfduZqAjF%2B7rDRP-RGNHjOXy7nvFROQ0MGS436f8FPY5DpQ%40mail.gmail.com
> > 4. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e07ebd4b
>
> Thank you for your feedback!
>
> Initially I didn't intend to backpatch any of these.  But on second
> thought with the references you provided, I think we should backpatch
> simple check_stack_depth() checks from d57b7cc333 to all supported
> branches, but apply refactoring of memory contextes and transaction
> commit/abort just to master.  Opinions?

I've just backpatched check_stack_depth() checks to all supported branches.

------
Regards,
Alexander Korotkov



Re: Stack overflow issue

From
Andres Freund
Date:
Hi,

On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> I did minor revision of comments and code blocks order to improve the
> readability.

After sending
https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
I looked some more at important areas where changes didn't have code
coverage. One thing I noticed was that the "non-internal" part of
AbortCurrentTransaction() is uncovered:
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403

Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
and others are in the *Internal functions.

I understand that fefd9a3fed2 needed to remove the recursion in
CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
why that means having some code in in the non-internal and some in the
internal functions?  Wouldn't it be easier to just have all the state handling
code in the Internal() function and just break after the
CleanupSubTransaction() calls?


That's of course largely unrelated to the coverage aspects. I just got
curious.

Greetings,

Andres Freund



Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Tue, Apr 16, 2024 at 1:48 AM Andres Freund <andres@anarazel.de> wrote:
> On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > I did minor revision of comments and code blocks order to improve the
> > readability.
>
> After sending
> https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> I looked some more at important areas where changes didn't have code
> coverage. One thing I noticed was that the "non-internal" part of
> AbortCurrentTransaction() is uncovered:
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
>
> Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
> and others are in the *Internal functions.
>
> I understand that fefd9a3fed2 needed to remove the recursion in
> CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> why that means having some code in in the non-internal and some in the
> internal functions?  Wouldn't it be easier to just have all the state handling
> code in the Internal() function and just break after the
> CleanupSubTransaction() calls?

I'm not sure I correctly get what you mean.  Do you think the attached
patch matches the direction you're pointing?  The patch itself is not
final, it requires cleanup and comments revision, just to check the
direction.

------
Regards,
Alexander Korotkov

Attachment

Re: Stack overflow issue

From
Andres Freund
Date:
Hi,

On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> On Tue, Apr 16, 2024 at 1:48 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > I did minor revision of comments and code blocks order to improve the
> > > readability.
> >
> > After sending
> > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > I looked some more at important areas where changes didn't have code
> > coverage. One thing I noticed was that the "non-internal" part of
> > AbortCurrentTransaction() is uncovered:
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> >
> > Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> > some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
> > and others are in the *Internal functions.
> >
> > I understand that fefd9a3fed2 needed to remove the recursion in
> > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> > why that means having some code in in the non-internal and some in the
> > internal functions?  Wouldn't it be easier to just have all the state handling
> > code in the Internal() function and just break after the
> > CleanupSubTransaction() calls?
> 
> I'm not sure I correctly get what you mean.  Do you think the attached
> patch matches the direction you're pointing?  The patch itself is not
> final, it requires cleanup and comments revision, just to check the
> direction.

Something like that, yea. The split does seem less confusing that way to me,
but also not 100% certain.

Greetings,

Andres Freund



Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Tue, Apr 16, 2024 at 6:35 PM Andres Freund <andres@anarazel.de> wrote:
> On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > > I did minor revision of comments and code blocks order to improve the
> > > > readability.
> > >
> > > After sending
> > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > > I looked some more at important areas where changes didn't have code
> > > coverage. One thing I noticed was that the "non-internal" part of
> > > AbortCurrentTransaction() is uncovered:
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> > >
> > > Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> > > some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
> > > and others are in the *Internal functions.
> > >
> > > I understand that fefd9a3fed2 needed to remove the recursion in
> > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> > > why that means having some code in in the non-internal and some in the
> > > internal functions?  Wouldn't it be easier to just have all the state handling
> > > code in the Internal() function and just break after the
> > > CleanupSubTransaction() calls?
> >
> > I'm not sure I correctly get what you mean.  Do you think the attached
> > patch matches the direction you're pointing?  The patch itself is not
> > final, it requires cleanup and comments revision, just to check the
> > direction.
>
> Something like that, yea. The split does seem less confusing that way to me,
> but also not 100% certain.

Thank you for your feedback.  I'm going to go ahead and polish this patch.

------
Regards,
Alexander Korotkov



Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Tue, Apr 16, 2024 at 7:42 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Tue, Apr 16, 2024 at 6:35 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> > > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund <andres@anarazel.de> wrote:
> > > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > > > I did minor revision of comments and code blocks order to improve the
> > > > > readability.
> > > >
> > > > After sending
> > > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > > > I looked some more at important areas where changes didn't have code
> > > > coverage. One thing I noticed was that the "non-internal" part of
> > > > AbortCurrentTransaction() is uncovered:
> > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> > > >
> > > > Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> > > > some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
> > > > and others are in the *Internal functions.
> > > >
> > > > I understand that fefd9a3fed2 needed to remove the recursion in
> > > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> > > > why that means having some code in in the non-internal and some in the
> > > > internal functions?  Wouldn't it be easier to just have all the state handling
> > > > code in the Internal() function and just break after the
> > > > CleanupSubTransaction() calls?
> > >
> > > I'm not sure I correctly get what you mean.  Do you think the attached
> > > patch matches the direction you're pointing?  The patch itself is not
> > > final, it requires cleanup and comments revision, just to check the
> > > direction.
> >
> > Something like that, yea. The split does seem less confusing that way to me,
> > but also not 100% certain.
>
> Thank you for your feedback.  I'm going to go ahead and polish this patch.

I've invested more time into polishing this.  I'm intended to push
this.  Could you, please, take a look before?

------
Regards,
Alexander Korotkov

Attachment

Re: Stack overflow issue

From
Alexander Korotkov
Date:
On Wed, Apr 17, 2024 at 2:37 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Apr 16, 2024 at 7:42 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Tue, Apr 16, 2024 at 6:35 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> > > > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > > > > I did minor revision of comments and code blocks order to improve the
> > > > > > readability.
> > > > >
> > > > > After sending
> > > > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > > > > I looked some more at important areas where changes didn't have code
> > > > > coverage. One thing I noticed was that the "non-internal" part of
> > > > > AbortCurrentTransaction() is uncovered:
> > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> > > > >
> > > > > Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> > > > > some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
> > > > > and others are in the *Internal functions.
> > > > >
> > > > > I understand that fefd9a3fed2 needed to remove the recursion in
> > > > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> > > > > why that means having some code in in the non-internal and some in the
> > > > > internal functions?  Wouldn't it be easier to just have all the state handling
> > > > > code in the Internal() function and just break after the
> > > > > CleanupSubTransaction() calls?
> > > >
> > > > I'm not sure I correctly get what you mean.  Do you think the attached
> > > > patch matches the direction you're pointing?  The patch itself is not
> > > > final, it requires cleanup and comments revision, just to check the
> > > > direction.
> > >
> > > Something like that, yea. The split does seem less confusing that way to me,
> > > but also not 100% certain.
> >
> > Thank you for your feedback.  I'm going to go ahead and polish this patch.
>
> I've invested more time into polishing this.  I'm intended to push
> this.  Could you, please, take a look before?

Just after sending this I spotted a typo s/untill/until/.  The updated
version is attached.

------
Regards,
Alexander Korotkov

Attachment

Re: Stack overflow issue

From
Andres Freund
Date:
Hi,

On 2024-04-17 14:39:14 +0300, Alexander Korotkov wrote:
> On Wed, Apr 17, 2024 at 2:37 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > I've invested more time into polishing this.  I'm intended to push
> > this.  Could you, please, take a look before?
> 
> Just after sending this I spotted a typo s/untill/until/.  The updated
> version is attached.

Nice, I see you moved the code back to "where it was", the diff to 16 got
smaller this way.


> +    /*
> +     * Repeatedly call CommitTransactionCommandInternal() until all the work
> +     * is done.
> +     */
> +    while (!CommitTransactionCommandInternal());

Personally I'd use
{
}
instead of just ; here. The above scans weirdly for me. But it's also not
important.

Greetings,

Andres Freund