Thread: BUG #18274: Error 'invalid XML content'
The following bug has been logged on the website: Bug reference: 18274 Logged by: Dmitry Koval Email address: d.koval@postgrespro.ru PostgreSQL version: 16.1 Operating system: Ubuntu 22.04 Description: Hello! It's easy to get an 'invalid XML content' error when using UTF-8 special characters: >select length((repeat('ї', 10 * 1000 * 1000))::xml::text::bytea); ERROR: invalid XML content DETAIL: line 1: xmlSAX2Characters: huge text node їїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїїї This error is not directly related to UTF-8, since this query is processed without an error: >select length((repeat('a', 100 * 1000 * 1000))::xml::text::bytea); length ----------- 100000000 (1 row) The problem is in the libxml2 library (in xmlParseBalancedChunkMemory function), which is used in PostgreSQL and does not support the XML_PARSE_HUGE flag. There have been attempts to correct this problem [1]. Apparently they were unsuccessful because libxml2 technical support refused to fix the xmlParseBalancedChunkMemory function. I'd like to know what the community's opinion is regarding this error: 1) the error is correct and does not need to be corrected; 2) corrections should be made in the libxml2 library; 3) corrections should be made in PostgreSQL (maybe need to stop using the xmlParseBalancedChunkMemory function or make other corrections); 4) ...? [1] https://gitlab.gnome.org/GNOME/libxml2/-/issues/167 ---- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com
PG Bug reporting form <noreply@postgresql.org> writes: > The problem is in the libxml2 library (in xmlParseBalancedChunkMemory > function), which is used in PostgreSQL and does not support the > XML_PARSE_HUGE flag. > There have been attempts to correct this problem [1]. > Apparently they were unsuccessful because libxml2 technical support refused > to fix the xmlParseBalancedChunkMemory function. > I'd like to know what the community's opinion is regarding this error: > 1) the error is correct and does not need to be corrected; > 2) corrections should be made in the libxml2 library; > 3) corrections should be made in PostgreSQL (maybe need to stop using the > xmlParseBalancedChunkMemory function or make other corrections); > 4) ...? libxml2 seems to be a little moribund, so I'm not sure that waiting for #2 to happen will yield results. On the other hand, XML is pretty much a development backwater in Postgres too, so if you are hoping for somebody else to do #3 you are unlikely to get anywhere. If you want to work on #3 yourself, have at it. regards, tom lane
Hi! Attached a patch that adds the use of XML_PARSE_HUGE flag for libxml2 functions and replaces some functions (that do not support this flag) with their equivalents. Using libxml2 library functions with support of XML_PARSE_HUGE flag increases maximum size allowed for a single text node from 10.000.000 to 1.000.000.000 (see XML_MAX_TEXT_LENGTH macro, libxml2/include/libxml/parserInternals.h) which in most cases solves the problem with insufficient memory. What do you think about the patch? Maybe it would be a good idea to add a GUC-variable for using of the XML_PARSE_HUGE flag? (The current behavior without XML_PARSE_HUGE flag is default). -- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com
Attachment
Dmitry Koval <d.koval@postgrespro.ru> writes: > Attached a patch that adds the use of XML_PARSE_HUGE flag for libxml2 > functions and replaces some functions (that do not support this flag) > with their equivalents. Thanks for doing this work! Please add an entry to the upcoming commitfest[1] to ensure we don't forget about it. > Maybe it would be a good idea to add a GUC-variable for using of the > XML_PARSE_HUGE flag? That strikes me as a completely awful idea. Is there some downside to XML_PARSE_HUGE? regards, tom lane [1] https://commitfest.postgresql.org/47/
Thanks! > Thanks for doing this work! Please add an entry to the upcoming > commitfest[1] to ensure we don't forget about it. Added. > Is there some downside to XML_PARSE_HUGE? I didn't see any problems during simple testing of patch with the XML_PARSE_HUGE. Extended testing will be performed soon. Then (I hope) we will send a trial version of PostgreSQL with a patch to customers who use XML. -- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com
On Sun, Jan 14, 2024 at 11:58:07PM +0300, Dmitry Koval wrote: >> Is there some downside to XML_PARSE_HUGE? > > I didn't see any problems during simple testing of patch with the > XML_PARSE_HUGE. > Extended testing will be performed soon. > Then (I hope) we will send a trial version of PostgreSQL with a patch to > customers who use XML. If one looks at the libxml2 like this mirror at [1], it is possible to see that the flag is only used to lift internal hard limits, for stuff like XML_MAX_TEXT_LENGTH and XML_MAX_NAME_LENGTH for size control, or max node depths. Knowing that we have full control of the memory contexts for the XML nodes, just enforcing the huge flag does not seem like there's any downside for us. (Right?) [1]: https://github.com/GNOME/libxml2 -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: >>> Is there some downside to XML_PARSE_HUGE? > If one looks at the libxml2 like this mirror at [1], it is possible to > see that the flag is only used to lift internal hard limits, for stuff > like XML_MAX_TEXT_LENGTH and XML_MAX_NAME_LENGTH for size control, or > max node depths. I dug through those sources and I concur that mostly, setting that flag just results in replacing arbitrary hard-coded limits with higher arbitrary hard-coded limits. The one place I found where it looks like the clamps come off entirely is that the XML_MAX_DICTIONARY_LIMIT on the number of entries in a dictionary is replaced by "unlimited". Given that in our usage the input string will be limited to 1GB, the number of entries you could possibly create is still pretty finite. > Knowing that we have full control of the memory contexts for the XML > nodes, just enforcing the huge flag does not seem like there's any > downside for us. (Right?) Blowing out a backend's memory or CPU consumption is not something we try hard to prevent, so I'm not terribly worried on that score. The one thing I'm concerned about is that raising these limits could make bugs (like integer overflow problems) reachable that were not otherwise, and that such bugs might rise to the level of security problems. They've had such issues before (CVE-2022-40303) and it'd be foolish to be sure that none remain. Still, that's clearly their bug not our bug. On the whole I'm not too worried, and even if I were, I doubt that an enabling GUC would be the answer. We'd have to make it SUSET and default to off for it to be a credible security defense, and that seems like an excessive amount of paranoia. Besides, I believe that downstream packagers who don't trust libxml2 are already just building PG without XML support. regards, tom lane
On Sun, Jan 14, 2024 at 10:16:33PM -0500, Tom Lane wrote: > Blowing out a backend's memory or CPU consumption is not something > we try hard to prevent, so I'm not terribly worried on that score. > The one thing I'm concerned about is that raising these limits could > make bugs (like integer overflow problems) reachable that were not > otherwise, and that such bugs might rise to the level of security > problems. They've had such issues before (CVE-2022-40303) and it'd be > foolish to be sure that none remain. Still, that's clearly their bug > not our bug. Interesting. We could always keep our coding more defensive, not sure entirely how. I am not sure that this is enough to not just use the upper limit, though. Being able to manipulate larger XML elements sounds like a fair argument from the user perspective these days, especially with memory being cheaper and larger. 1fb2e0dfc631 has added the huge option back in 2009 in libxml2, so it's been around for some time. -- Michael
Attachment
Hi! > Knowing that we have full control of the memory contexts for the XML > nodes, just enforcing the huge flag does not seem like there's any > downside for us. (Right?) I think that's right (flag XML_PARSE_HUGE shouldn't cause any problems). My main doubts are related to the replacement of the xmlParseBalancedChunkMemory() function (that haven't argument for pass XML_PARSE_HUGE flag) with xmlNewNode() + xmlParseInNodeContext() functions (create a fake node and pass the XML_PARSE_HUGE flag to xmlParseInNodeContext function). I'm not sure if this replacement is 100% equivalent (although simple tests work the same). -- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com
On Mon, Jan 15, 2024 at 06:47:17PM +0300, Dmitry Koval wrote: > I think that's right (flag XML_PARSE_HUGE shouldn't cause any problems). > My main doubts are related to the replacement of the > xmlParseBalancedChunkMemory() function (that haven't argument for pass > XML_PARSE_HUGE flag) with xmlNewNode() + xmlParseInNodeContext() functions > (create a fake node and pass the XML_PARSE_HUGE flag to > xmlParseInNodeContext function). > > I'm not sure if this replacement is 100% equivalent (although simple tests > work the same). Hmm, it looks like this is actually equivalent in terms of parsing a well-balanced chunk. This was suggested in the upstream ticket you have opened and I find that pretty cool, reusing the trick of a fake root node to use the other API. Now, there are two things that we'd better do here: - Document in a comment why a fake root node is necessary (aka the current routines don't give enough control over the limits you'd like to enforce). - The top comment of xml_parse() still mentions xmlParseBalancedChunkMemory() as an effect of 483bdb2afec9, so this needs to be updated. The switch xmlParseMemory() -> xmlReadMemory() is recommended by the upstream docs and the former is deprecated: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-parser.html#xmlParseMemory Also, it may be worth double-checking if there are other things marked as deprecated in the upstream doc, and switch to "newer" things. It seems like anything discussed here should only be done on HEAD. I suspect that the buildfarm may get bumpy on that, but let's see. -- Michael
Attachment
Hi! Thanks for the recommendations. >- Document in a comment why a fake root node is necessary (aka the > current routines don't give enough control over the limits you'd >like to enforce). Comment expanded (see src/backend/utils/adt/xml.c). >- The top comment of xml_parse() still mentions >xmlParseBalancedChunkMemory() as an effect of 483bdb2afec9, so this >needs to be updated. Updated. >Also, it may be worth double-checking if there are other things marked >as deprecated in the upstream doc, and switch to "newer" things. I found another deprecated function xmlSubstituteEntitiesDefault [1] (contrib/xml2). xmlSubstituteEntitiesDefault was removed and added the XML_PARSE_NOENT flag in functions xmlReadMemory instead it. [1]https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-parser.html#xmlSubstituteEntitiesDefault -- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com
Attachment
On Tue, Jan 16, 2024 at 04:55:34PM +0300, Dmitry Koval wrote: >> - Document in a comment why a fake root node is necessary (aka the >> current routines don't give enough control over the limits you'd >> like to enforce). > > Comment expanded (see src/backend/utils/adt/xml.c). > >> - The top comment of xml_parse() still mentions >> xmlParseBalancedChunkMemory() as an effect of 483bdb2afec9, so this >> needs to be updated. > > Updated. I still need to evaluate a bit more this part, though that looks fine. >> Also, it may be worth double-checking if there are other things marked >> as deprecated in the upstream doc, and switch to "newer" things. > > I found another deprecated function xmlSubstituteEntitiesDefault [1] > (contrib/xml2). xmlSubstituteEntitiesDefault was removed and added the > XML_PARSE_NOENT flag in functions xmlReadMemory instead it. This one had better be done first because it is required by your original issue, and that's what could make the buildfarm shaky. I have checked the other XML calls in the tree and did not spot anything else that ought to be changed, so I have extracted this stuff from your v2 and applied it on HEAD. Let's see how it goes. -- Michael
Attachment
On Wed, Jan 17, 2024 at 08:59:26AM +0900, Michael Paquier wrote: > This one had better be done first because it is required by your > original issue, and that's what could make the buildfarm shaky. > > I have checked the other XML calls in the tree and did not spot > anything else that ought to be changed, so I have extracted this stuff > from your v2 and applied it on HEAD. Let's see how it goes. The security team has discussed 2197d06224a1 after a report from coverity regarding the effects that issues like [1] would create in the backend, and concluded that this patch should be reverted because this could cause the backend to waste plently of CPU and/or memory even if the application applied checks on the size of the data given in input, and libxml2 does not offer guarantees that input limits are respected under XML_PARSE_HUGE. So I am planning to do do so in the next 24 hours. Note that this does not impact 65c5864d7fac, as XML_PARSE_NOENT is an immediate substitute of xmlSubstituteEntitiesDefault(). [1]: https://en.wikipedia.org/wiki/Billion_laughs_attack -- Michael
Attachment
> he security team has discussed 2197d06224a1 after a report from > coverity regarding the effects that issues like [1] would create in > the backend, and concluded that this patch should be reverted because > this could cause the backend to waste plently of CPU and/or memory > even if the application applied checks on the size of the data given > in input, and libxml2 does not offer guarantees that input limits are > respected under XML_PARSE_HUGE. Thanks for info! I agree that reverting a patch is a good idea if there are concerns about server resources (XML is used by few users and there are even fewer users who need to parse elements larger than 10Mb). For such users it is better to create custom PostgreSQL build. -- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com
On Thu, Jan 25, 2024 at 03:12:07PM +0300, Dmitry Koval wrote: > I agree that reverting a patch is a good idea if there are concerns about > server resources (XML is used by few users and there are even fewer users > who need to parse elements larger than 10Mb). > For such users it is better to create custom PostgreSQL build. And done with f2743a7d70e7. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Thu, Jan 25, 2024 at 03:12:07PM +0300, Dmitry Koval wrote: >> I agree that reverting a patch is a good idea if there are concerns about >> server resources (XML is used by few users and there are even fewer users >> who need to parse elements larger than 10Mb). >> For such users it is better to create custom PostgreSQL build. > And done with f2743a7d70e7. Related to this: I just read some interesting things in libxml2 2.12's release notes: Most of the known issues leading to quadratic behavior in the XML parser were fixed. Internal hash tables were rewritten to reduce memory consumption. A new API function xmlCtxtSetMaxAmplification was added to allow parsing of files that would otherwise trigger the billion laughs protection. Could it be that if we see this new function is available and use it, we could allow more than we have done historically? I don't have a whole lot of faith here, but perhaps this is worth investigation. (BTW, 2.12 has created some annoying API breaks, which seems to be why caiman is failing. We have some work to do there in any case.) regards, tom lane