Thread: BUG #18274: Error 'invalid XML content'

BUG #18274: Error 'invalid XML content'

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18274
Logged by:          Dmitry Koval
Email address:      d.koval@postgrespro.ru
PostgreSQL version: 16.1
Operating system:   Ubuntu 22.04
Description:

Hello!
It's easy to get an 'invalid XML content' error when using UTF-8 special
characters:

>select length((repeat('ї', 10 * 1000 * 1000))::xml::text::bytea);
ERROR:  invalid XML content
DETAIL:  line 1: xmlSAX2Characters: huge text node
їїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїї

This error is not directly related to UTF-8, since this query is processed
without an error:

>select length((repeat('a', 100 * 1000 * 1000))::xml::text::bytea);
  length   
-----------
 100000000
(1 row)


The problem is in the libxml2 library (in xmlParseBalancedChunkMemory
function), which is used in PostgreSQL and does not support the
XML_PARSE_HUGE flag.
There have been attempts to correct this problem [1].
Apparently they were unsuccessful because libxml2 technical support refused
to fix the xmlParseBalancedChunkMemory function.

I'd like to know what the community's opinion is regarding this error:
1) the error is correct and does not need to be corrected;
2) corrections should be made in the libxml2 library;
3) corrections should be made in PostgreSQL (maybe need to stop using the
xmlParseBalancedChunkMemory function or make other corrections);
4) ...?

[1] https://gitlab.gnome.org/GNOME/libxml2/-/issues/167
----
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com


Re: BUG #18274: Error 'invalid XML content'

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> The problem is in the libxml2 library (in xmlParseBalancedChunkMemory
> function), which is used in PostgreSQL and does not support the
> XML_PARSE_HUGE flag.
> There have been attempts to correct this problem [1].
> Apparently they were unsuccessful because libxml2 technical support refused
> to fix the xmlParseBalancedChunkMemory function.

> I'd like to know what the community's opinion is regarding this error:
> 1) the error is correct and does not need to be corrected;
> 2) corrections should be made in the libxml2 library;
> 3) corrections should be made in PostgreSQL (maybe need to stop using the
> xmlParseBalancedChunkMemory function or make other corrections);
> 4) ...?

libxml2 seems to be a little moribund, so I'm not sure that waiting
for #2 to happen will yield results.  On the other hand, XML is
pretty much a development backwater in Postgres too, so if you are
hoping for somebody else to do #3 you are unlikely to get anywhere.
If you want to work on #3 yourself, have at it.

            regards, tom lane



Re: BUG #18274: Error 'invalid XML content'

From
Dmitry Koval
Date:
Hi!

Attached a patch that adds the use of XML_PARSE_HUGE flag for libxml2 
functions and replaces some functions (that do not support this flag) 
with their equivalents.

Using libxml2 library functions with support of XML_PARSE_HUGE flag 
increases maximum size allowed for a single text node from 10.000.000 to 
1.000.000.000 (see XML_MAX_TEXT_LENGTH macro, 
libxml2/include/libxml/parserInternals.h) which in most cases solves the 
problem with insufficient memory.

What do you think about the patch?
Maybe it would be a good idea to add a GUC-variable for using of the 
XML_PARSE_HUGE flag? (The current behavior without XML_PARSE_HUGE flag 
is default).

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com
Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Tom Lane
Date:
Dmitry Koval <d.koval@postgrespro.ru> writes:
> Attached a patch that adds the use of XML_PARSE_HUGE flag for libxml2 
> functions and replaces some functions (that do not support this flag) 
> with their equivalents.

Thanks for doing this work!  Please add an entry to the upcoming
commitfest[1] to ensure we don't forget about it.

> Maybe it would be a good idea to add a GUC-variable for using of the 
> XML_PARSE_HUGE flag?

That strikes me as a completely awful idea.  Is there some downside
to XML_PARSE_HUGE?

            regards, tom lane

[1] https://commitfest.postgresql.org/47/



Re: BUG #18274: Error 'invalid XML content'

From
Dmitry Koval
Date:
Thanks!

 > Thanks for doing this work!  Please add an entry to the upcoming
 > commitfest[1] to ensure we don't forget about it.

Added.

 > Is there some downside to XML_PARSE_HUGE?

I didn't see any problems during simple testing of patch with the 
XML_PARSE_HUGE.
Extended testing will be performed soon.
Then (I hope) we will send a trial version of PostgreSQL with a patch to 
customers who use XML.

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com



Re: BUG #18274: Error 'invalid XML content'

From
Michael Paquier
Date:
On Sun, Jan 14, 2024 at 11:58:07PM +0300, Dmitry Koval wrote:
>> Is there some downside to XML_PARSE_HUGE?
>
> I didn't see any problems during simple testing of patch with the
> XML_PARSE_HUGE.
> Extended testing will be performed soon.
> Then (I hope) we will send a trial version of PostgreSQL with a patch to
> customers who use XML.

If one looks at the libxml2 like this mirror at [1], it is possible to
see that the flag is only used to lift internal hard limits, for stuff
like XML_MAX_TEXT_LENGTH and XML_MAX_NAME_LENGTH for size control, or
max node depths.

Knowing that we have full control of the memory contexts for the XML
nodes, just enforcing the huge flag does not seem like there's any
downside for us.  (Right?)

[1]: https://github.com/GNOME/libxml2
--
Michael

Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
>>> Is there some downside to XML_PARSE_HUGE?

> If one looks at the libxml2 like this mirror at [1], it is possible to
> see that the flag is only used to lift internal hard limits, for stuff
> like XML_MAX_TEXT_LENGTH and XML_MAX_NAME_LENGTH for size control, or
> max node depths.

I dug through those sources and I concur that mostly, setting that
flag just results in replacing arbitrary hard-coded limits with higher
arbitrary hard-coded limits.  The one place I found where it looks
like the clamps come off entirely is that the XML_MAX_DICTIONARY_LIMIT
on the number of entries in a dictionary is replaced by "unlimited".
Given that in our usage the input string will be limited to 1GB, the
number of entries you could possibly create is still pretty finite.

> Knowing that we have full control of the memory contexts for the XML
> nodes, just enforcing the huge flag does not seem like there's any
> downside for us.  (Right?)

Blowing out a backend's memory or CPU consumption is not something
we try hard to prevent, so I'm not terribly worried on that score.
The one thing I'm concerned about is that raising these limits could
make bugs (like integer overflow problems) reachable that were not
otherwise, and that such bugs might rise to the level of security
problems.  They've had such issues before (CVE-2022-40303) and it'd be
foolish to be sure that none remain.  Still, that's clearly their bug
not our bug.

On the whole I'm not too worried, and even if I were, I doubt that
an enabling GUC would be the answer.  We'd have to make it SUSET
and default to off for it to be a credible security defense, and that
seems like an excessive amount of paranoia.  Besides, I believe that
downstream packagers who don't trust libxml2 are already just building
PG without XML support.

            regards, tom lane



Re: BUG #18274: Error 'invalid XML content'

From
Michael Paquier
Date:
On Sun, Jan 14, 2024 at 10:16:33PM -0500, Tom Lane wrote:
> Blowing out a backend's memory or CPU consumption is not something
> we try hard to prevent, so I'm not terribly worried on that score.
> The one thing I'm concerned about is that raising these limits could
> make bugs (like integer overflow problems) reachable that were not
> otherwise, and that such bugs might rise to the level of security
> problems.  They've had such issues before (CVE-2022-40303) and it'd be
> foolish to be sure that none remain.  Still, that's clearly their bug
> not our bug.

Interesting.  We could always keep our coding more defensive, not sure
entirely how.  I am not sure that this is enough to not just use the
upper limit, though.  Being able to manipulate larger XML elements
sounds like a fair argument from the user perspective these days,
especially with memory being cheaper and larger.

1fb2e0dfc631 has added the huge option back in 2009 in libxml2, so
it's been around for some time.
--
Michael

Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Dmitry Koval
Date:
Hi!

 > Knowing that we have full control of the memory contexts for the XML
 > nodes, just enforcing the huge flag does not seem like there's any
 > downside for us. (Right?)

I think that's right (flag XML_PARSE_HUGE shouldn't cause any problems).
My main doubts are related to the replacement of the 
xmlParseBalancedChunkMemory() function (that haven't argument for pass 
XML_PARSE_HUGE flag) with xmlNewNode() + xmlParseInNodeContext() 
functions (create a fake node and pass the XML_PARSE_HUGE flag to 
xmlParseInNodeContext function).

I'm not sure if this replacement is 100% equivalent (although simple 
tests work the same).

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com



Re: BUG #18274: Error 'invalid XML content'

From
Michael Paquier
Date:
On Mon, Jan 15, 2024 at 06:47:17PM +0300, Dmitry Koval wrote:
> I think that's right (flag XML_PARSE_HUGE shouldn't cause any problems).
> My main doubts are related to the replacement of the
> xmlParseBalancedChunkMemory() function (that haven't argument for pass
> XML_PARSE_HUGE flag) with xmlNewNode() + xmlParseInNodeContext() functions
> (create a fake node and pass the XML_PARSE_HUGE flag to
> xmlParseInNodeContext function).
>
> I'm not sure if this replacement is 100% equivalent (although simple tests
> work the same).

Hmm, it looks like this is actually equivalent in terms of parsing a
well-balanced chunk.  This was suggested in the upstream ticket you
have opened and I find that pretty cool, reusing the trick of a fake
root node to use the other API.  Now, there are two things that we'd
better do here:
- Document in a comment why a fake root node is necessary (aka the
current routines don't give enough control over the limits you'd like
to enforce).
- The top comment of xml_parse() still mentions
xmlParseBalancedChunkMemory() as an effect of 483bdb2afec9, so this
needs to be updated.

The switch xmlParseMemory() -> xmlReadMemory() is recommended by the
upstream docs and the former is deprecated:
https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-parser.html#xmlParseMemory

Also, it may be worth double-checking if there are other things marked
as deprecated in the upstream doc, and switch to "newer" things.

It seems like anything discussed here should only be done on HEAD.
I suspect that the buildfarm may get bumpy on that, but let's see.
--
Michael

Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Dmitry Koval
Date:
Hi!

Thanks for the recommendations.

 >- Document in a comment why a fake root node is necessary (aka the
 > current routines don't give enough control over the limits you'd
 >like to enforce).

Comment expanded (see src/backend/utils/adt/xml.c).

 >- The top comment of xml_parse() still mentions
 >xmlParseBalancedChunkMemory() as an effect of 483bdb2afec9, so this
 >needs to be updated.

Updated.

 >Also, it may be worth double-checking if there are other things marked
 >as deprecated in the upstream doc, and switch to "newer" things.

I found another deprecated function xmlSubstituteEntitiesDefault [1] 
(contrib/xml2). xmlSubstituteEntitiesDefault was removed and added the 
XML_PARSE_NOENT flag in functions xmlReadMemory instead it.

[1]https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-parser.html#xmlSubstituteEntitiesDefault

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com
Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Michael Paquier
Date:
On Tue, Jan 16, 2024 at 04:55:34PM +0300, Dmitry Koval wrote:
>> - Document in a comment why a fake root node is necessary (aka the
>>  current routines don't give enough control over the limits you'd
>> like to enforce).
>
> Comment expanded (see src/backend/utils/adt/xml.c).
>
>> - The top comment of xml_parse() still mentions
>> xmlParseBalancedChunkMemory() as an effect of 483bdb2afec9, so this
>> needs to be updated.
>
> Updated.

I still need to evaluate a bit more this part, though that looks fine.

>> Also, it may be worth double-checking if there are other things marked
>> as deprecated in the upstream doc, and switch to "newer" things.
>
> I found another deprecated function xmlSubstituteEntitiesDefault [1]
> (contrib/xml2). xmlSubstituteEntitiesDefault was removed and added the
> XML_PARSE_NOENT flag in functions xmlReadMemory instead it.

This one had better be done first because it is required by your
original issue, and that's what could make the buildfarm shaky.

I have checked the other XML calls in the tree and did not spot
anything else that ought to be changed, so I have extracted this stuff
from your v2 and applied it on HEAD.  Let's see how it goes.
--
Michael

Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Michael Paquier
Date:
On Wed, Jan 17, 2024 at 08:59:26AM +0900, Michael Paquier wrote:
> This one had better be done first because it is required by your
> original issue, and that's what could make the buildfarm shaky.
>
> I have checked the other XML calls in the tree and did not spot
> anything else that ought to be changed, so I have extracted this stuff
> from your v2 and applied it on HEAD.  Let's see how it goes.

The security team has discussed 2197d06224a1 after a report from
coverity regarding the effects that issues like [1] would create in
the backend, and concluded that this patch should be reverted because
this could cause the backend to waste plently of CPU and/or memory
even if the application applied checks on the size of the data given
in input, and libxml2 does not offer guarantees that input limits are
respected under XML_PARSE_HUGE.

So I am planning to do do so in the next 24 hours.  Note that this
does not impact 65c5864d7fac, as XML_PARSE_NOENT is an immediate
substitute of xmlSubstituteEntitiesDefault().

[1]: https://en.wikipedia.org/wiki/Billion_laughs_attack
--
Michael

Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Dmitry Koval
Date:
> he security team has discussed 2197d06224a1 after a report from
> coverity regarding the effects that issues like [1] would create in
> the backend, and concluded that this patch should be reverted because
> this could cause the backend to waste plently of CPU and/or memory
> even if the application applied checks on the size of the data given
> in input, and libxml2 does not offer guarantees that input limits are
> respected under XML_PARSE_HUGE.

Thanks for info!

I agree that reverting a patch is a good idea if there are concerns 
about server resources (XML is used by few users and there are even 
fewer users who need to parse elements larger than 10Mb).
For such users it is better to create custom PostgreSQL build.

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com



Re: BUG #18274: Error 'invalid XML content'

From
Michael Paquier
Date:
On Thu, Jan 25, 2024 at 03:12:07PM +0300, Dmitry Koval wrote:
> I agree that reverting a patch is a good idea if there are concerns about
> server resources (XML is used by few users and there are even fewer users
> who need to parse elements larger than 10Mb).
> For such users it is better to create custom PostgreSQL build.

And done with f2743a7d70e7.
--
Michael

Attachment

Re: BUG #18274: Error 'invalid XML content'

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Thu, Jan 25, 2024 at 03:12:07PM +0300, Dmitry Koval wrote:
>> I agree that reverting a patch is a good idea if there are concerns about
>> server resources (XML is used by few users and there are even fewer users
>> who need to parse elements larger than 10Mb).
>> For such users it is better to create custom PostgreSQL build.

> And done with f2743a7d70e7.

Related to this: I just read some interesting things in libxml2 2.12's
release notes:

    Most of the known issues leading to quadratic behavior in the XML parser
    were fixed. Internal hash tables were rewritten to reduce memory
    consumption.

    A new API function xmlCtxtSetMaxAmplification was added to allow parsing
    of files that would otherwise trigger the billion laughs protection.

Could it be that if we see this new function is available and use it,
we could allow more than we have done historically?  I don't have a
whole lot of faith here, but perhaps this is worth investigation.

(BTW, 2.12 has created some annoying API breaks, which seems to be why
caiman is failing.  We have some work to do there in any case.)

            regards, tom lane