Thread: Speeding up loops in pl/pgsql function
Hi, everyone. I'm working on a project that's using PostgreSQL 8.3, that requires me to translate strings of octal digits into strings of characters -- so '141142143' should become 'abc', although the database column containing this data (both before and after) is a bytea. While the function I've written is accurate, it turns out that it's also ridiculously slow. I've managed to speed it up a fair amount, to twice what it was previously doing, by folding a helper function into a main one, and appending to an array (which I then join into a string at the end of the function) instead of concatenating a string onto itself time after time. I realize that pl/pgsql is not a good choice for doing this sort of task, and that another language -- say, one with direct support for octal digits, or with built-in, speedy array functions such as pop() and push() -- would be a better choice. But that's not an option at this point. I should also note that I'm not manipulating a huge amount of data here. We're talking about 300 or so rows, each of which contains about 250 KB of data. (Hmm, could the problem be that I'm constantly forcing the system to compress and uncompress the data in TOAST? I hadn't thought of that until just now...) I thus have two basic questions: (1) Are there any good guidelines for what operations in pl/pgsql are optimized for which data structures? For example, it turns out that a great deal of time is being spent in the substring() function, which surprised me. I thought that by switching to an array, it might be faster, but that wasn't the case, at least in my tests. Having a sense of what I should and shouldn't be trying, and which built-in functions are particularly fast or slow, would be useful to know. (2) Is there any configuration setting that would (perhaps) speed things up a bit? I thought that maybe work_mem would help, but the documentation didn't indicate this at all, and sure enough, nothing really changed when I increased it. Of course, any suggestions for how to deal with octal digits in PostgreSQL 8.3, such as an octal equivalent to the x'ff' syntax, would be more than welcome. Thanks in advance, Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
Hi, Alex. You wrote: > Have you tried something like: > SELECT encode(regexp_replace('141142143', '(\d{3})', '\\\1', > 'g')::bytea, 'escape'); Hmm, forgot about regexp_replace. It might do the trick, but without a full-blown eval that I can run on the replacement side, it'll be a bit more challenging. But that's a good direction to consider, for sure. > I think select E'\XXX' is what you are looking for (per the fine > manual: http://www.postgresql.org/docs/current/static/datatype-binary.html) I didn't think that I could (easily) build a string like that from digits in a variable or a column, but I'll poke around and see if it can work. Thanks, Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
On Wed, May 25, 2011 at 12:45, Reuven M. Lerner <reuven@lerner.co.il> wrote: > Hi, Alex. You wrote: >> I think select E'\XXX' is what you are looking for (per the fine >> manual: >> http://www.postgresql.org/docs/current/static/datatype-binary.html) > > I didn't think that I could (easily) build a string like that from digits in > a variable or a column, but I'll poke around and see if it can work. Well, if you build '\XXX' you can call escape(..., 'escape') on it like I did with the regex above.
Hello > > (1) Are there any good guidelines for what operations in pl/pgsql are > optimized for which data structures? For example, it turns out that a great > deal of time is being spent in the substring() function, which surprised me. > I thought that by switching to an array, it might be faster, but that > wasn't the case, at least in my tests. Having a sense of what I should and > shouldn't be trying, and which built-in functions are particularly fast or > slow, would be useful to know. > PL/pgSQL is perfect like glue for SQL. For all other isn't good http://okbob.blogspot.com/2010/04/frequent-mistakes-in-plpgsql-design.html http://www.pgsql.cz/index.php/PL/pgSQL_%28en%29#When_PL.2FpgSQL_is_not_applicable > > (2) Is there any configuration setting that would (perhaps) speed things up > a bit? I thought that maybe work_mem would help, but the documentation > didn't indicate this at all, and sure enough, nothing really changed when I > increased it. > > probably not Just PL/pgSQL is not C, and you cannot do some heavy string or array operations. Regards Pavel Stehule
On Wed, May 25, 2011 at 10:59, Reuven M. Lerner <reuven@lerner.co.il> wrote: > Hi, everyone. I'm working on a project that's using PostgreSQL 8.3, that > requires me to translate strings of octal digits into strings of characters > -- so '141142143' should become 'abc', although the database column > containing this data (both before and after) is a bytea. Have you tried something like: SELECT encode(regexp_replace('141142143', '(\d{3})', '\\\1', 'g')::bytea, 'escape'); > ... > Of course, any suggestions for how to deal with octal digits in PostgreSQL > 8.3, such as an octal equivalent to the x'ff' syntax, would be more than > welcome. I think select E'\XXX' is what you are looking for (per the fine manual: http://www.postgresql.org/docs/current/static/datatype-binary.html)
On Wed, May 25, 2011 at 11:59 AM, Reuven M. Lerner <reuven@lerner.co.il> wrote: > Hi, everyone. I'm working on a project that's using PostgreSQL 8.3, that > requires me to translate strings of octal digits into strings of characters > -- so '141142143' should become 'abc', although the database column > containing this data (both before and after) is a bytea. > > > While the function I've written is accurate, it turns out that it's also > ridiculously slow. I've managed to speed it up a fair amount, to twice what > it was previously doing, by folding a helper function into a main one, and > appending to an array (which I then join into a string at the end of the > function) instead of concatenating a string onto itself time after time. > > > I realize that pl/pgsql is not a good choice for doing this sort of task, > and that another language -- say, one with direct support for octal digits, > or with built-in, speedy array functions such as pop() and push() -- would > be a better choice. But that's not an option at this point. > > > I should also note that I'm not manipulating a huge amount of data here. > We're talking about 300 or so rows, each of which contains about 250 KB of > data. (Hmm, could the problem be that I'm constantly forcing the system to > compress and uncompress the data in TOAST? I hadn't thought of that until > just now...) > > > I thus have two basic questions: > > > (1) Are there any good guidelines for what operations in pl/pgsql are > optimized for which data structures? For example, it turns out that a great > deal of time is being spent in the substring() function, which surprised me. > I thought that by switching to an array, it might be faster, but that > wasn't the case, at least in my tests. Having a sense of what I should and > shouldn't be trying, and which built-in functions are particularly fast or > slow, would be useful to know. > > > (2) Is there any configuration setting that would (perhaps) speed things up > a bit? I thought that maybe work_mem would help, but the documentation > didn't indicate this at all, and sure enough, nothing really changed when I > increased it. > > > Of course, any suggestions for how to deal with octal digits in PostgreSQL > 8.3, such as an octal equivalent to the x'ff' syntax, would be more than > welcome. let's see the source. I bet we can get this figured out. merlin
Hi, everyone. Merlin wrote: > let's see the source. I bet we can get this figured out. Here you go... it looked nicer before I started to make optimizations; I've gotten it to run about 2x as fast as the previous version, but now I'm sorta stuck, looking for further optimizations, including possible use of builtin functions. Thanks for any suggestions you can offer. CREATE OR REPLACE FUNCTION translate_octals_into_decimals(bytea_string BYTEA) RETURNS BYTEA AS $$ DECLARE bytea_string_length INTEGER := length(bytea_string); current_substring TEXT := ''; translated_string_array BYTEA[]; output_number INTEGER := 0; output_number_text TEXT := ''; current_digit TEXT := ''; BEGIN RAISE NOTICE '[translate_octals_into_decimals] start at %, string of length %', clock_timestamp(), pg_size_pretty(length(bytea_string)); FOR i IN 1..length(bytea_string) BY 3 LOOP current_substring := substring(bytea_string from i for 3); output_number := 0; FOR j IN 0..(length(current_substring) - 1) LOOP current_digit := substring(current_substring from (length(current_substring) - j) for 1); output_number := output_number + current_digit::integer * (8 ^ j); END LOOP; output_number_text = lpad(output_number::text, 3, '0'); IF output_number_text::int = 92 THEN translated_string_array := array_append(translated_string_array, E'\\\\'::bytea); ELSIF output_number_text::int = 0 THEN translated_string_array := array_append(translated_string_array, E'\\000'::bytea); ELSE translated_string_array := array_append( translated_string_array, chr(output_number_text::integer)::bytea ); END IF; END LOOP; RETURN array_to_string(translated_string_array, ''); END; $$ LANGUAGE 'plpgsql'; Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
On 05/25/2011 11:45 AM, Reuven M. Lerner wrote: > Hi, Alex. You wrote: >> Have you tried something like: >> SELECT encode(regexp_replace('141142143', '(\d{3})', '\\\1', >> 'g')::bytea, 'escape'); > Hmm, forgot about regexp_replace. It might do the trick, but without > a full-blown eval that I can run on the replacement side, it'll be a > bit more challenging. But that's a good direction to consider, for sure. The function given didn't work exactly as written for me but it is on the right track. See if this works for you (input validation is left as an exercise for the reader...:)): create or replace function octal_string_to_text(someoctal text) returns text as $$ declare binstring text; begin execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', 'g') || '''' into binstring; return binstring; end $$ language plpgsql; Cheers, Steve
On Wed, May 25, 2011 at 8:03 PM, Steve Crawford <scrawford@pinpointresearch.com> wrote: > On 05/25/2011 11:45 AM, Reuven M. Lerner wrote: >> >> Hi, Alex. You wrote: >>> >>> Have you tried something like: >>> SELECT encode(regexp_replace('141142143', '(\d{3})', '\\\1', >>> 'g')::bytea, 'escape'); >> >> Hmm, forgot about regexp_replace. It might do the trick, but without a >> full-blown eval that I can run on the replacement side, it'll be a bit more >> challenging. But that's a good direction to consider, for sure. > > The function given didn't work exactly as written for me but it is on the > right track. See if this works for you (input validation is left as an > exercise for the reader...:)): > > create or replace function octal_string_to_text(someoctal text) returns text > as $$ > declare > binstring text; > begin > execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', 'g') > || '''' into binstring; > return binstring; > end > $$ language plpgsql; four points (minor suggestions btw): 1. if you are dealing with strings that have backslashes in them, don't escape, but dollar quote. Also try not to use dollar parameter notation if you can help it: ($1, E'(\\d{3})', E'\\\\\\1', 'g') -> (someoctal , $q$(\d{3})$q$, $q$\\\1$q$, 'g') this is particularly true with feeding strings to regex: that way you can use the same string pg as in various validators. 2. there is no need for execute here. execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', 'g') becomes: binstring := 'E''' || regexp_replace($1, $q$(\d{3})$q$, $q$\\\1$q$, 'g') /* I *think* I got this right */ 3. if your function does not scribble on tables and has no or is not influenced by any side effects, mark it as IMMUTABLE. always. $$ language plpgsql IMMUTABLE; 4. since all we are doing is generating a variable, prefer sql function vs plpgsql. this is particularly true in pre 8.4 postgres (IIRC) where you can call the function much more flexibly (select func(); vs select * from func();) if that's the case. Putting it all together, create or replace function octal_string_to_text(someoctal text) returns text as $$ SELECT 'E''' || regexp_replace($1, $q$(\d{3})$q$, $q$\\\1$q$, 'g'); $$ sql immutable; Note I didn't actually check to see what your regex is donig (I'm assuming it's correct)... merlin
Dkloskxe Steve Crawford <scrawford@pinpointresearch.com> wrote: >On 05/25/2011 11:45 AM, Reuven M. Lerner wrote: >> Hi, Alex. You wrote: >>> Have you tried something like: >>> SELECT encode(regexp_replace('141142143', '(\d{3})', '\\\1', >>> 'g')::bytea, 'escape'); >> Hmm, forgot about regexp_replace. It might do the trick, but without >> a full-blown eval that I can run on the replacement side, it'll be a >> bit more challenging. But that's a good direction to consider, for sure. > >The function given didn't work exactly as written for me but it is on >the right track. See if this works for you (input validation is left as >an exercise for the reader...:)): > >create or replace function octal_string_to_text(someoctal text) returns >text as $$ >declare > binstring text; >begin > execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', >'g') || '''' into binstring; >return binstring; >end >$$ language plpgsql; > >Cheers, >Steve > > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance
Thursday, May 26, 2011, 12:26:17 AM you wrote: > Here you go... it looked nicer before I started to make optimizations; > I've gotten it to run about 2x as fast as the previous version, but now > I'm sorta stuck, looking for further optimizations, including possible > use of builtin functions. I've got only a 9.0.4 to play with, and bytea's are passed as an hexadecimal string, so I resorted to writing the function with TEXT as parameters, but maybe the following helps a bit, avoiding a few IMHO useless string/int-operations: CREATE OR REPLACE FUNCTION translate_octals_into_decimals(bytea_string text) RETURNS text AS $$ DECLARE bytea_string_length INTEGER := length(bytea_string); translated_string_array BYTEA[]; output_number INTEGER := 0; num1 INTEGER; num2 INTEGER; num3 INTEGER; npos INTEGER; nlen INTEGER; BEGIN RAISE NOTICE '[translate_octals_into_decimals] start at %, string of length %', clock_timestamp(), pg_size_pretty(length(bytea_string)); npos := 1; FOR i IN 1..bytea_string_length BY 3 LOOP num1 := substring(bytea_string from i for 1); num2 := substring(bytea_string from i+1 for 1); num3 := substring(bytea_string from i+2 for 1); output_number := 64*num1 + 8*num2 + num3; IF output_number = 0 THEN translated_string_array[npos] := E'\\000'::bytea; ELSIF output_number = 92 THEN translated_string_array[npos] := E'\\\\'::bytea; ELSE translated_string_array[npos] := chr(output_number)::bytea; END IF; npos := npos+1; END LOOP; RETURN array_to_string(translated_string_array, ''); END; $$ LANGUAGE 'plpgsql'; -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
On Wed, May 25, 2011 at 9:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, May 25, 2011 at 8:03 PM, Steve Crawford > <scrawford@pinpointresearch.com> wrote: >> On 05/25/2011 11:45 AM, Reuven M. Lerner wrote: >>> >>> Hi, Alex. You wrote: >>>> >>>> Have you tried something like: >>>> SELECT encode(regexp_replace('141142143', '(\d{3})', '\\\1', >>>> 'g')::bytea, 'escape'); >>> >>> Hmm, forgot about regexp_replace. It might do the trick, but without a >>> full-blown eval that I can run on the replacement side, it'll be a bit more >>> challenging. But that's a good direction to consider, for sure. >> >> The function given didn't work exactly as written for me but it is on the >> right track. See if this works for you (input validation is left as an >> exercise for the reader...:)): >> >> create or replace function octal_string_to_text(someoctal text) returns text >> as $$ >> declare >> binstring text; >> begin >> execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', 'g') >> || '''' into binstring; >> return binstring; >> end >> $$ language plpgsql; > > four points (minor suggestions btw): > 1. if you are dealing with strings that have backslashes in them, > don't escape, but dollar quote. Also try not to use dollar parameter > notation if you can help it: > ($1, E'(\\d{3})', E'\\\\\\1', 'g') -> (someoctal , $q$(\d{3})$q$, > $q$\\\1$q$, 'g') > > this is particularly true with feeding strings to regex: that way you > can use the same string pg as in various validators. > > 2. there is no need for execute here. > execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', 'g') > becomes: > binstring := 'E''' || regexp_replace($1, $q$(\d{3})$q$, $q$\\\1$q$, > 'g') /* I *think* I got this right */ > > 3. if your function does not scribble on tables and has no or is not > influenced by any side effects, mark it as IMMUTABLE. always. > $$ language plpgsql IMMUTABLE; > > 4. since all we are doing is generating a variable, prefer sql > function vs plpgsql. this is particularly true in pre 8.4 postgres > (IIRC) where you can call the function much more flexibly (select > func(); vs select * from func();) if that's the case. Putting it all > together, > > create or replace function octal_string_to_text(someoctal text) > returns text as $$ > SELECT 'E''' || regexp_replace($1, $q$(\d{3})$q$, $q$\\\1$q$, 'g'); > $$ sql immutable; > > Note I didn't actually check to see what your regex is donig (I'm > assuming it's correct)... hm, I slept on this and had the vague unsettling feeling I had said something stupid -- and I did. Double +1 to you for being cleverer than me -- you are using 'execute' to eval the string back in to the string. Only plpgsql can do that, so point 4 is also moot. Still, the above points hold in principle, so if a way could be figured out to do this without execute, that would be nice. merlin
On Thu, May 26, 2011 at 8:11 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, May 25, 2011 at 9:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, May 25, 2011 at 8:03 PM, Steve Crawford >> <scrawford@pinpointresearch.com> wrote: >>> On 05/25/2011 11:45 AM, Reuven M. Lerner wrote: >>>> >>>> Hi, Alex. You wrote: >>>>> >>>>> Have you tried something like: >>>>> SELECT encode(regexp_replace('141142143', '(\d{3})', '\\\1', >>>>> 'g')::bytea, 'escape'); >>>> >>>> Hmm, forgot about regexp_replace. It might do the trick, but without a >>>> full-blown eval that I can run on the replacement side, it'll be a bit more >>>> challenging. But that's a good direction to consider, for sure. >>> >>> The function given didn't work exactly as written for me but it is on the >>> right track. See if this works for you (input validation is left as an >>> exercise for the reader...:)): >>> >>> create or replace function octal_string_to_text(someoctal text) returns text >>> as $$ >>> declare >>> binstring text; >>> begin >>> execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', 'g') >>> || '''' into binstring; >>> return binstring; >>> end >>> $$ language plpgsql; >> >> four points (minor suggestions btw): >> 1. if you are dealing with strings that have backslashes in them, >> don't escape, but dollar quote. Also try not to use dollar parameter >> notation if you can help it: >> ($1, E'(\\d{3})', E'\\\\\\1', 'g') -> (someoctal , $q$(\d{3})$q$, >> $q$\\\1$q$, 'g') >> >> this is particularly true with feeding strings to regex: that way you >> can use the same string pg as in various validators. >> >> 2. there is no need for execute here. >> execute 'select E''' || regexp_replace($1, E'(\\d{3})', E'\\\\\\1', 'g') >> becomes: >> binstring := 'E''' || regexp_replace($1, $q$(\d{3})$q$, $q$\\\1$q$, >> 'g') /* I *think* I got this right */ >> >> 3. if your function does not scribble on tables and has no or is not >> influenced by any side effects, mark it as IMMUTABLE. always. >> $$ language plpgsql IMMUTABLE; >> >> 4. since all we are doing is generating a variable, prefer sql >> function vs plpgsql. this is particularly true in pre 8.4 postgres >> (IIRC) where you can call the function much more flexibly (select >> func(); vs select * from func();) if that's the case. Putting it all >> together, >> >> create or replace function octal_string_to_text(someoctal text) >> returns text as $$ >> SELECT 'E''' || regexp_replace($1, $q$(\d{3})$q$, $q$\\\1$q$, 'g'); >> $$ sql immutable; >> >> Note I didn't actually check to see what your regex is donig (I'm >> assuming it's correct)... > > hm, I slept on this and had the vague unsettling feeling I had said > something stupid -- and I did. Double +1 to you for being cleverer > than me -- you are using 'execute' to eval the string back in to the > string. Only plpgsql can do that, so point 4 is also moot. Still, > the above points hold in principle, so if a way could be figured out > to do this without execute, that would be nice. got it: select decode(regexp_replace('141142143', '([0-9][0-9][0-9])', $q$\\\1$q$ , 'g'), 'escape'); decode -------- abc (1 row) merlin
Wow. Color me impressed and grateful. I've been working on a different project today, but I'll test these tonight. I'll never underestimate the regexp functionality in PostgreSQL again! Reuven
On 05/26/2011 05:36 AM, Merlin Moncure wrote: > ... > got it: > select decode(regexp_replace('141142143', '([0-9][0-9][0-9])', > $q$\\\1$q$ , 'g'), 'escape'); > decode > -------- > abc > (1 row) > > merlin > Nice. A word of warning, in 9.0 this returns a hex string: select decode(regexp_replace('141142143', '([0-9][0-9][0-9])', $q$\\\1$q$ , 'g'), 'escape'); decode ---------- \x616263 See http://www.postgresql.org/docs/9.0/static/release-9-0.html: E.5.2.3. Data Types bytea output now appears in hex format by default (Peter Eisentraut) The server parameter bytea_output can be used to select the traditional output format if needed for compatibility. Another wrinkle, the function I wrote sort of ignored the bytea issue by using text. But text is subject to character-encoding (for both good and bad) while bytea is not so the ultimate solution will depend on whether the input string is the octal representation of an un-encoded sequence of bytes or represents a string of ASCII/UTF-8/whatever... encoded text. Cheers, Steve
Hi, everyone. First of all, thanks for all of your help several days ago. The improvements to our program were rather dramatic (in a positive sense). Based on the help that everyone gave, I'm working on something similar, trying to use regexp_replace to transform a string into the result of invoking a function on each character. For example, I'd like to do the following: regexp_replace('abc', '(.)', ascii(E'\\1')::text, 'g'); Unfortunately, the above invokes ascii() on the literal string E'\\1', rather than on the value of the backreference, which isn't nearly as useful. I'd like to get '979899' back as a string. And of course, once I can get back the value of ascii(), I figure that it should work for any function that I define. Thanks again for any suggestions everyone might have. (And if this should go to pgsql-general, then I'll understand. If it helps, my alternative to regexp_replace is a super-slow function, akin to the one that I showed here last week.) Reuven
On Wed, Jun 1, 2011 at 4:19 AM, Reuven M. Lerner <reuven@lerner.co.il> wrote: > Hi, everyone. > > First of all, thanks for all of your help several days ago. The > improvements to our program were rather dramatic (in a positive sense). > > Based on the help that everyone gave, I'm working on something similar, > trying to use regexp_replace to transform a string into the result of > invoking a function on each character. For example, I'd like to do the > following: > > regexp_replace('abc', '(.)', ascii(E'\\1')::text, 'g'); > > Unfortunately, the above invokes ascii() on the literal string E'\\1', > rather than on the value of the backreference, which isn't nearly as useful. > I'd like to get '979899' back as a string. And of course, once I can get > back the value of ascii(), I figure that it should work for any function > that I define. > > Thanks again for any suggestions everyone might have. > > (And if this should go to pgsql-general, then I'll understand. If it helps, > my alternative to regexp_replace is a super-slow function, akin to the one > that I showed here last week.) select string_agg(v, '') from (select ascii(regexp_split_to_table('abc', $$\s*$$))::text as v) q; (what about 3 digit ascii codes?) merlin
Hi, Merlin. You wrote: > select string_agg(v, '') from (select > ascii(regexp_split_to_table('abc', $$\s*$$))::text as v) q; Wow. I've been programming with pl/pgsql for a good number of years, but only now do I see the amazing usefulness of regexp_split_to_table and string_agg, neither of which I really used until now. Thanks for both the solution and for opening my eyes. > (what about 3 digit ascii codes?) I have to put the number into a text field anyway, so I've been converting the resulting number to text, and then using lpad to add leading zeroes as necessary. Thanks again, Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner