Thread: generating a large XML document
Hello, postgresql guru's When developing ticket export mechanism for our ticketing system I thought the wise thing would be to generate the result (XML file) in stored function. This worked fine for small datasets, but for the larger ones I see it takes much longer to generate the document than to select the data needed for it. Here are several "explain alalyze"s : explain analyze SELECT * from tex.fnk_access_control_tickets(8560, 0); "Function Scan on fnk_access_control_tickets (cost=0.25..10.25 rows=1000 width=246) (actual time=479.276..484.429 rows=16292 loops=1)" "Total runtime: 486.774 ms" explain analyze SELECT --xmlagg( XMLELEMENT ( NAME "bar", XMLELEMENT ( NAME "code", tick_barcode), XMLELEMENT ( NAME "stat", status), CASE WHEN type IS NOT NULL THEN XMLELEMENT ( NAME "tip", type) ELSE NULL END, CASE WHEN sec_name IS NOT NULL THEN XMLELEMENT ( NAME "sec", sec_name) ELSE NULL END, CASE WHEN row_name IS NOT NULL THEN XMLELEMENT ( NAME "row", row_name) ELSE NULL END, CASE WHEN seat_name IS NOT NULL THEN XMLELEMENT ( NAME "plc", seat_name) ELSE NULL END, CASE WHEN substr(tick_barcode,length(tick_barcode),1)= '1' THEN XMLELEMENT ( NAME "groups", XMLELEMENT ( NAME "group", 1) ) ELSE NULL END ) -- ) FROM tex.fnk_access_control_tickets(8560, 0); "Function Scan on fnk_access_control_tickets (cost=0.25..17.75 rows=1000 width=238) (actual time=476.446..924.785 rows=16292 loops=1)" "Total runtime: 928.768 ms" explain analyze SELECT xmlagg( XMLELEMENT ( NAME "bar", XMLELEMENT ( NAME "code", tick_barcode), XMLELEMENT ( NAME "stat", status), CASE WHEN type IS NOT NULL THEN XMLELEMENT ( NAME "tip", type) ELSE NULL END, CASE WHEN sec_name IS NOT NULL THEN XMLELEMENT ( NAME "sec", sec_name) ELSE NULL END, CASE WHEN row_name IS NOT NULL THEN XMLELEMENT ( NAME "row", row_name) ELSE NULL END, CASE WHEN seat_name IS NOT NULL THEN XMLELEMENT ( NAME "plc", seat_name) ELSE NULL END, CASE WHEN substr(tick_barcode,length(tick_barcode),1)= '1' THEN XMLELEMENT ( NAME "groups", XMLELEMENT ( NAME "group", 1) ) ELSE NULL END ) ) FROM tex.fnk_access_control_tickets(8560, 0); "Aggregate (cost=12.75..12.77 rows=1 width=238) (actual time=16110.847..16110.848 rows=1 loops=1)" " -> Function Scan on fnk_access_control_tickets (cost=0.25..10.25 rows=1000 width=238) (actual time=500.029..520.974 rows=16292 loops=1)" "Total runtime: 16111.264 ms" It seems the aggregate combining the elements is to blame... What I did next was rewriting it in stored function using FOR loop and combining it in a text variable. Sadly the result was the same (approximately 16 seconds). From that I had to conclude that text := text + some_text operation is an expensive one, but now I have no ideas left how to solve the problem. Any notices, recommendations, advices are very welcome. I've also tried google'ing on XML creation in Postgresql, but found no warnings or even mentioning xmlagg could cause a headache. I have nowhere to turn for help now, so please advice... Not sure if that will help, but anyway: Server: PostgreSQL 9.0.3 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Debian 4.4.5-10) 4.4.5, 32-bit see the result of select * from pg_settings in attachment if needed. Client: Windows XP, pgAdmin 1.12.3 Thank you in advance. -- Julius Tuskenis Programavimo skyriaus vadovas UAB nSoft mob. +37068233050
Attachment
Hello, I'm sorry to write again, but as I received no answer I wonder if there is a better mailing list to address concerning this question? Or is there nothing to be done about the speed of xmlagg ?. Please let me as no answer is the worst answer to get.... -- Julius Tuskenis Programavimo skyriaus vadovas UAB nSoft mob. +37068233050
Hello 2011/6/20 Julius Tuskenis <julius@nsoft.lt>: > Hello, > > I'm sorry to write again, but as I received no answer I wonder if there is a > better mailing list to address concerning this question? Or is there nothing > to be done about the speed of xmlagg ?. Please let me as no answer is the > worst answer to get.... > It's hard to say where is problem - PostgreSQL wraps libxml2 library for xml functionality, so problem can be a) inside libxml2 b) on interface between libxml2 and PostgreSQL c) on PostgreSQL memory management can you send a profile? Regards Pavel Stehule > -- > Julius Tuskenis > Programavimo skyriaus vadovas > UAB nSoft > mob. +37068233050 > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Sun, Jun 19, 2011 at 11:36 PM, Julius Tuskenis <julius@nsoft.lt> wrote:
Hello,
I'm sorry to write again, but as I received no answer I wonder if there is a better mailing list to address concerning this question? Or is there nothing to be done about the speed of xmlagg ?. Please let me as no answer is the worst answer to get....
I played around a little after your initial post, but I'm not a postgresql developer and I have no familiarity with the internals, so everything I'm about to write is pure conjecture.
Programming languages that treat strings as immutable often suffer from this kind of problem. WIth each string concatenation, both strings have to be copied to a new location in memory. I suspect that this is happening in this case. The only viable fix would be to use a buffer that is mutable and append into it rather than doing raw string concatenation. If this is truly the problem, I don't see that you have much choice but to re-implement xmlagg in one of the available languages such that it uses a buffer instead of immutable string concatenation. It is probably that the xmlelement function doesn't need to be rewritten, since it is only concatenating relatively short strings. It is less efficient than appending to a buffer, but shouldn't get catastrophically slow. But xmlagg is concatenating many rows. If each concatenation results in a full copy of the already concatenated rows, you can see where performance would drop catastrophically.
Here's the first google result for 'mutable string python' that I found, which addresses this problem in python. http://www.skymind.com/~ocrow/python_string/ You could rewrite the aggregate function in plpython using one of the techniques in that file. I just attempted to find the source to xm_agg in the postgresql source code and it is pretty well obfuscated, so I don't see that being much help. I wasn't even able to determine if the problem actually is immutable string concatenation. So we don't know if xmlagg is building a DOM tree and then serializing it once (which would imply that XMLELEMENT returns a single DOM node, or if it is all workign with strings. Barring answers from someone who actually knows, I can only suggest that you read through the documentation on writing an aggregate function and then do some experimentation to see what you get when you use your own aggregate function instead of xml_agg. Hopefully, such experimentation won't take long to determine if re-implementing xml_agg with a mutable buffer is a viable option.
--sam
Thank you, Pavel for your answer 2011.06.20 09:51, Pavel Stehule rašė: > can you send a profile? Excuse me, but what do you mean by saying "profile"? I've sent content of pg_settings in the first post. Please be more specific as I am more of a programmer than an server administrator. -- Julius Tuskenis Programavimo skyriaus vadovas UAB nSoft mob. +37068233050
Thank You, Samuel for the time it took to investigate the issue. I'll try to use buffer to see what the results are... I'll post results to the list if I succeed. -- Julius Tuskenis Programavimo skyriaus vadovas UAB nSoft mob. +37068233050
2011/6/20 Julius Tuskenis <julius@nsoft.lt>: > Thank you, Pavel for your answer > > 2011.06.20 09:51, Pavel Stehule rašė: >> >> can you send a profile? > > Excuse me, but what do you mean by saying "profile"? I've sent content of > pg_settings in the first post. Please be more specific as I am more of a > programmer than an server administrator. > a result from oprofile profiler Regards Pavel > -- > Julius Tuskenis > Programavimo skyriaus vadovas > UAB nSoft > mob. +37068233050 > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
2011/6/20 Pavel Stehule <pavel.stehule@gmail.com>: > 2011/6/20 Julius Tuskenis <julius@nsoft.lt>: >> Thank you, Pavel for your answer >> >> 2011.06.20 09:51, Pavel Stehule rašė: >>> >>> can you send a profile? >> >> Excuse me, but what do you mean by saying "profile"? I've sent content of >> pg_settings in the first post. Please be more specific as I am more of a >> programmer than an server administrator. >> > > a result from oprofile profiler > I looked into code - probably a implementation of xmlagg is too silly xmlagg use a xmlconcat functions - that means repeated xml parsing and xmlserialization. So it is not effective on larger trees :( string_agg is more effective now. The solution is only radical refactoring of xmlagg function. Pavel > Regards > > Pavel > >> -- >> Julius Tuskenis >> Programavimo skyriaus vadovas >> UAB nSoft >> mob. +37068233050 >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> >
2011.06.20 10:58, Pavel Stehule rašė: > string_agg is more effective now. The solution is only radical > refactoring of xmlagg function. Thank you, Pavel for letting me know of string_agg. explain analyze SELECT string_agg( XMLELEMENT ( NAME "bar", XMLELEMENT ( NAME "code", tick_barcode), XMLELEMENT ( NAME "stat", status), CASE WHEN type IS NOT NULL THEN XMLELEMENT ( NAME "tip", type) ELSE NULL END, CASE WHEN sec_name IS NOT NULL THEN XMLELEMENT ( NAME "sec", sec_name) ELSE NULL END, CASE WHEN row_name IS NOT NULL THEN XMLELEMENT ( NAME "row", row_name) ELSE NULL END, CASE WHEN seat_name IS NOT NULL THEN XMLELEMENT ( NAME "plc", seat_name) ELSE NULL END, CASE WHEN substr(tick_barcode,length(tick_barcode),1)= '1' THEN XMLELEMENT ( NAME "groups", XMLELEMENT ( NAME "group", 1) ) ELSE NULL END )::text, NULL )::xml FROM tex.fnk_access_control_tickets(8560, 0); "Aggregate (cost=12.75..12.77 rows=1 width=238) (actual time=1025.502..1025.502 rows=1 loops=1)" " -> Function Scan on fnk_access_control_tickets (cost=0.25..10.25 rows=1000 width=238) (actual time=495.703..503.999 rows=16292 loops=1)" "Total runtime: 1036.775 ms" Its over 10 times faster than using xmlagg. -- Julius Tuskenis Programavimo skyriaus vadovas UAB nSoft mob. +37068233050