Thread: "memory exhausted" in query parser/simplifier for many nested parentheses

"memory exhausted" in query parser/simplifier for many nested parentheses

From
Niklas Hambüchen
Date:
Hello,

we noticed that when writing a large query of form

    (((A OR B) OR C) OR ...)

with many terms (e.g. 13000) this causes one of two errors:

1. memory exhausted at or near "("
2. stack depth limit exceeded

This seems bugged/unfortunate, as writing

    A OR B OR C OR ...

does not cause the errors.
This means the query fails depending whether or not the user provided parentheses in the query.

Note the query doesn't have to involve `OR`, it can be anyting involving many parentheses.

I tracked down the cause of the two errors:

* In the Bison parser:

      #define YYMAXDEPTH 10000

  http://git.savannah.gnu.org/cgit/bison.git/tree/src/parse-gram.c?h=v3.8.2#n1376
  This causes error:
      ERROR:  memory exhausted at or near \"(\"

* In the expression simplifier:

            simplify_function()                at clauses.c:3921
      calls expression_tree_mutator()          at nodeFuncs.c:3117
      calls eval_const_expressions_mutator()   at clauses.c:2449
      calls simplify_function()

  This causes error:
      stack depth limit exceeded
  from
      https://github.com/postgres/postgres/blob/REL_14_12/src/backend/tcop/postgres.c#L3481-L3498

A minimal repro for this can be found in:

    https://github.com/questdb/questdb/issues/3841#issuecomment-2536599109

(I'm not affiilated with QuestDB, it seems to be a database also using Postgres's parser.)

Repeating the repro here:

    #!/usr/bin/env python3

    import sys

    N = int(sys.argv[1])

    exp = ""
    for i in range(N):
        exp += "lower("

    exp += "'ABCDEFGHI'"

    for i in range(N):
        exp += ")"

    assert exp.count("(") == exp.count(")")
    print(exp)

Example invocation:

    psql -h localhost -p 5432 postgres -c "select $(python3 expression.py 10000); select 1"

For playing with the limits in Postgres, the `YYMAXDEPTH 10000` constant can be changed by manually changing it
`gram.c`(generated from `gram.y`).
 

Both above problems stem from using a recursive descent parser in a language with limited stack size (C, or Bison's
manualstack implemented with `goto`).
 

The Bison parser runs before the expression simplifier.
So if the expression is too large, we get `memory exhausted`.
If the expression is small enough to make it to the simplifer, we get `stack depth limit exceeded` if it's too large
forthe simplier.
 
The simplifier stack limit can be adjusted with

    max_stack_depth = 2MB     (the default)
    max_stack_depth = 7680kB  (possible maximum)

and this maximum is of course still too low if we have tens of terms.

We consider this a bug because it means one cannot easily generate large queries that query a couple thousand entries
ina `SELECT ... WHERE` condition.
 

It would be great if postgres can alleviate these limits, or at least succeed to parse everything that's written with
parenthesesequally well when written with parentheses.
 

Thank you!
Niklas & Ruben



Re: "memory exhausted" in query parser/simplifier for many nested parentheses

From
Greg Sabino Mullane
Date:
While not agreeing that this is a bug (perhaps an under-documented but totally sane limit?) - but here is a shorter repro for the archives:

do $$begin execute format('%s', repeat('(',9999));end$$;

Cheers,
Greg

=?UTF-8?Q?Niklas_Hamb=C3=BCchen?= <mail@nh2.me> writes:
> we noticed that when writing a large query of form
>     (((A OR B) OR C) OR ...)
> with many terms (e.g. 13000) this causes one of two errors:
> 1. memory exhausted at or near "("
> 2. stack depth limit exceeded
> ...
> We consider this a bug because it means one cannot easily generate large queries that query a couple thousand entries
ina `SELECT ... WHERE` condition. 

[ shrug... ]  Even if these examples didn't fail, they would perform
terribly.  Limits are a fact of life.  Write your query some other
way, for example using "x IN (list)" or other shortcut syntaxes.

            regards, tom lane



Re: "memory exhausted" in query parser/simplifier for many nested parentheses

From
Niklas Hambüchen
Date:
Hi Tom,

> Write your query some other way, for example using "x IN (list)" or other shortcut syntaxes.

As a user, how can I know that 10000 list entries in "x in (A, B, ...)" will not also hit an arbitrary parser
implementationdetail hardcoded limit?
 
How shall the user have confidence that the parser is better at handling multiple commas than multiple parens?

None of that seems to be documented anywhere (`max_stack_depth` is, but the code doesn't even get that far, and
`YYMAXDEPTH`isn't).
 

In absence of such docs, one must build systems that later fail at arbitrary limits when e.g. the user clicks some
largernumber of checkboxes that construct a batch query.
 
There might be a different query that works for that number, but a user of postgres cannot know which construct will
workunless they read GNU Bison's source code.
 

If I build some workaround today, e.g. splitting the query into multiple ones of max length N, how do I know it will
stillwork in the future, e.g. if Postgres changes the Bison version or switches to a different parser?
 

The hardcodedness of arbitrary small limits that don't scale itself is also a problem:
One cannot configure postgres to allow queries that the hardware is perfectly able of handling.

It would be a bit like GCC giving up to compile a file if it contains more than 10000 words.

> Limits are a fact of life.  

I agree limits can be a fact of life, but life is better if you know what those limits are, or if you can set them,
versushaving to guess and hope (it's especially difficult to build robust systems with "hope-based programming").
 



Re: "memory exhausted" in query parser/simplifier for many nested parentheses

From
Greg Sabino Mullane
Date:
On Fri, Dec 13, 2024 at 8:53 AM Niklas Hambüchen <mail@nh2.me> wrote:
As a user, how can I know that 10000 list entries in "x in (A, B, ...)" will not also hit an arbitrary parser implementation detail hardcoded limit?
How shall the user have confidence that the parser is better at handling multiple commas than multiple parens?

As an application developer, you test it, especially if your application has the potential to generate insanely large queries. For the record, the catch point on my system using Postgres 17 seems to be around 8.4 *MILLION* items for IN() lists:

greg=# \set VERBOSITY terse
greg=# do $$begin execute format('SELECT 1 WHERE 1 IN (%s1)', repeat('123,',8_385_000));end$$;
DO
greg=# do $$begin execute format('SELECT 1 WHERE 1 IN (%s1)', repeat('123,',8_390_000));end$$;
ERROR:  invalid memory alloc request size 1073741824
 
None of that seems to be documented anywhere

Documentation patches are always welcome. Perhaps at https://www.postgresql.org/docs/current/limits.html ?

In absence of such docs, one must build systems that later fail at arbitrary limits when e.g. the user clicks some larger number of checkboxes that construct a batch query.

That's a bit of a strawman argument: queries this large are not going to be caused by checkboxes being clicked.

If I build some workaround today, e.g. splitting the query into multiple ones of max length N, how do I know it will still work in the future, e.g. if Postgres changes the Bison version or switches to a different parser?

You don't, that's the nature of software. But it's fairly unlikely we'd switch to something that was MORE contraining than in the past. Still, don't play on the edge.
 
The hardcodedness of arbitrary small limits that don't scale itself is also a problem:
One cannot configure postgres to allow queries that the hardware is perfectly able of handling.

It would be a bit like GCC giving up to compile a file if it contains more than 10000 words.

No, that's not an apt comparison at all. We cannot simply push the parser to accept anything, regardless of the impact on other parts of the system. Software engineering is all about tradeoffs. I agree with your point about documentation, but this seems like trying to pick a fight.

Cheers,
Greg

On 12/13/24 15:44, Greg Sabino Mullane wrote:
> On Fri, Dec 13, 2024 at 8:53 AM Niklas Hambüchen <mail@nh2.me
> <mailto:mail@nh2.me>> wrote:
> 
>     As a user, how can I know that 10000 list entries in "x in (A,
>     B, ...)" will not also hit an arbitrary parser implementation detail
>     hardcoded limit?
>     How shall the user have confidence that the parser is better at
>     handling multiple commas than multiple parens?
> 
> 
> As an application developer, you test it, especially if your application
> has the potential to generate insanely large queries. For the record,
> the catch point on my system using Postgres 17 seems to be around 8.4
> *MILLION* items for IN() lists:
> 
> greg=# \set VERBOSITY terse
> greg=# do $$begin execute format('SELECT 1 WHERE 1 IN (%s1)',
> repeat('123,',8_385_000));end$$;
> DO
> greg=# do $$begin execute format('SELECT 1 WHERE 1 IN (%s1)',
> repeat('123,',8_390_000));end$$;
> ERROR:  invalid memory alloc request size 1073741824
>  
> 
>     None of that seems to be documented anywhere
> 
> 
> Documentation patches are always welcome. Perhaps at https://
> www.postgresql.org/docs/current/limits.html <https://www.postgresql.org/
> docs/current/limits.html> ?
> 

FWIW I don't think it's practical to document such limits in detail.
Everyone knows the limits exist, and there's a lot of them - various
libraries we rely on (and we have plenty of them) may have a couple of
them, etc. It'd be a huge effort, and it's not clear where to draw the
line. And I'm not aware of other projects documenting such stuff in
detail - e.g. kernel certainly has a lot of such limits.


>     In absence of such docs, one must build systems that later fail at
>     arbitrary limits when e.g. the user clicks some larger number of
>     checkboxes that construct a batch query.
> 
> 
> That's a bit of a strawman argument: queries this large are not going to
> be caused by checkboxes being clicked.
> 

Not sure I understand the suggestion that "one must build systems that
later fail at arbitrary limits". It's a trade off, and it's perfectly
reasonable to not spend time optimizing for weirdly complex queries when
there's a much better/faster way to write the query, not hitting those
limits ...

>     If I build some workaround today, e.g. splitting the query into
>     multiple ones of max length N, how do I know it will still work in
>     the future, e.g. if Postgres changes the Bison version or switches
>     to a different parser?
> 
> 
> You don't, that's the nature of software. But it's fairly unlikely we'd
> switch to something that was MORE contraining than in the past. Still,
> don't play on the edge.
>  

Right. A lot of the mailing list traffic is discussions about possible
regressions. But even with that we have little control over changes in
the dependencies. This just means it's important to set baselines and do
testing as part of an upgrade.

> 
>     The hardcodedness of arbitrary small limits that don't scale itself
>     is also a problem:
>     One cannot configure postgres to allow queries that the hardware is
>     perfectly able of handling.
> 
>     It would be a bit like GCC giving up to compile a file if it
>     contains more than 10000 words.
> 
> 
> No, that's not an apt comparison at all. We cannot simply push the
> parser to accept anything, regardless of the impact on other parts of
> the system. Software engineering is all about tradeoffs. I agree with
> your point about documentation, but this seems like trying to pick a fight.
> 

Yeah, it's a matter of trade offs. Not just technical ones, but also
(and maybe in the first place) economical - which improvements to spend
time on to get the most benefit. Time is a finite resource.


regards

-- 
Tomas Vondra




Tomas Vondra <tomas@vondra.me> writes:
> On 12/13/24 15:44, Greg Sabino Mullane wrote:
>> Documentation patches are always welcome. Perhaps at https://
>> www.postgresql.org/docs/current/limits.html <https://www.postgresql.org/
>> docs/current/limits.html> ?

> FWIW I don't think it's practical to document such limits in detail.

Yeah.  To take the current example:

1. The grammar nesting limit is imposed by bison, not by us, and it
would be odd for us to be the ones to document it.  We don't even
know whether it's the same across all versions of bison.

2. The stack-overflow limit during optimization is going to manifest
at some ridiculously platform-dependent depth.  It'll depend on
max_stack_depth to begin with, and then the actual number of stack
frames that fit into that will depend on a huge number of factors,
and probably will change in every PG release.  The best the
documentation could possibly do is to mention that there's a limit,
which seems unhelpful.

> Everyone knows the limits exist, and there's a lot of them

This.

            regards, tom lane



Re: "memory exhausted" in query parser/simplifier for many nested parentheses

From
Niklas Hambüchen
Date:
Hi Greg,

thanks for testing the parser behaviour of "x in (A, B, ...)", it is useful to know that.

> queries this large are not going to be caused by checkboxes being clicked.

This is how I encountered it:
Doing a selection of 10k files (equivalent to the Shift-click range-selection in most GUI file managers), and using
themin a query.
 

> We cannot simply push the parser to accept anything, regardless of the impact on other parts of the system.

I'm not suggesting to remove limits regardless of impact.
There are good limits and bad limits:

I think the happy path is when programs scale to the users' machines and tasks, constrained by limits that achieve
specificgoals.
 
Many of postgres's limits from https://www.postgresql.org/docs/17/runtime-config-resource.html achieve the goal of not
havingqueries blow up unreasonably.
 
They are good limits, set to sensible defaults by developers, and tunable by users to fit their workloads.
Same for most of Linux's limits, such as `ulimit`.


In contrast, the impact of `YYMAXDEPTH` on postgres looks like a historical accident.

* It was set to `YMAXDEPTH 500` by Richard Stallman in _1988_ in Bison commit f791019c "entered into RCS" (thus
probablyeven older).
 
* Then changed to `YYMAXDEPTH 10000` in 1993.

So we're looking at a > 35 years old hardcode; a bad limit that probably should always have been a runtime-configurable
parameter.

Maybe in 1993 "10000 of something" was a huge limit, but it isn't today.
Most modern parsers today do not hardcode key aspects such as nesting depth, because what's "a lot" of course depends
onthe application, and it often isn't 4 or 500 or 10000.
 
I also doesn't fit well into what Postgres usually delivers:
As you showed, postgres by default supports "millions of things" in queries, and it really feels like a surprise bug
thatthis limit is much lower when you write the query in an equally long but slightly different way, because that
specificsyntax code path goes through a library with hardcodes from 1993.
 

So I think the ideal solution would be:

* Make `YYMAXDEPTH` configurable at run-time (it always should have been), and add a postgres config for it with a good
default.

Of course it's not for me to judge whether this is a high priority topic for postgres, but I think it would be fair to
classifyit as a "one of the tools we use hardcodes a silly limit that doesn't fit to our other limits and thus triggers
usersurprise" bug, as opposed to "postgres authors designed it this way".
 

Greetings,
Niklas



Re: "memory exhausted" in query parser/simplifier for many nested parentheses

From
"David G. Johnston"
Date:
On Fri, Dec 13, 2024 at 6:53 AM Niklas Hambüchen <mail@nh2.me> wrote:
If I build some workaround today, e.g. splitting the query into multiple ones of max length N, how do I know it will still work in the future, e.g. if Postgres changes the Bison version or switches to a different parser?

If you work-around it by doing "create temp table" and "copy as many rows into as you'd like" the reality of the limit here disappears.

You also gain the added benefit of having less potential exposure to SQL-injection by using a code path that doesn't require placing many potentially user-supplied string literals into a query body.

In some ways this is a design choice that encourages the user to write a better query form that the system has been optimized around.

I don't disagree with the premise that such hard-coded limits are undesirable but they also aren't always worth getting rid of, especially if they are inherited from an upstream dependency.

David J.

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> I don't disagree with the premise that such hard-coded limits are
> undesirable but they also aren't always worth getting rid of, especially if
> they are inherited from an upstream dependency.

Having said all that ... I think there is a case here for raising
the core parser's YYMAXDEPTH, which would be a trivial matter of
inserting a #define for it.  Bison's default value of 10000
was probably set decades ago for much smaller hardware.  If I'm
computing it right, the space consumed per stack entry in PG
is 22 bytes (on 64-bit hardware), so that that limit corresponds
to only 220KB in stack, a fairly negligible number.  We could
make it 10 or 100 times larger without anyone noticing ---
especially since this is the upper limit for resizing the stack,
not the amount of space used to parse simple inputs.  That would
get us to a point where expression complexity would almost always
be limited by max_stack_depth, which the user has control over.

In theory we could make the limit as high as around 45M stack
levels, which would approach palloc's 1GB chunk limit.  But
I can't foresee a reason to let the parser stack get anywhere near
that big.  If you wrote millions of unmatched left parentheses,
you're just trying to cause trouble.  So I'd vote for going to
100K or 1M.

            regards, tom lane