Thread: MAP syntax for arrays
Hello hackers, Recently I was working with sql arrays in postgres and it turned out that postgres doesn't have such very convinient functional constructions as map, reduce and filter. Currently to map function over array user has to make a subquery like: select u.* from my_table, lateral ( select array_agg(lower(elem)) from unnest(arr) as elem ) as u; Which is not only inconvenient but not very efficient as well (see 'Demo' section below). When I dug into the code I found that postgres already has the needed infrastructure for implementing map for arrays; actually array coercing already works that way (it basically maps cast function). In the attached patch there is a simple map implementation which introduces new expression type and syntax: MAP(<func_name> OVER <array_expression>) For example: SELECT MAP(upper OVER array['one', 'two', 'three']::text[]); ?column? ----------------- {ONE,TWO,THREE} (1 row) This is probably not the most useful notation and it would be better to have syntax for mapping arbitrary expressions over array, not just function. I'm struggling to come up with a good idea of how it should look like. It could look something like following: MAP(<expr> FOR <placeholder> IN <array_expressin>) For instance: SELECT MAP(x*2 FOR x IN array[1, 2, 3]::int[]); Looking forward for community's suggestions! Demo ---- Here is a small comparison between map and unnest/aggregate ways for per-element processing of arrays. Given a table with 1K rows which contains single column of text[] type. Each array contains 5/10/100 elements. create table my_table (arr text[]); insert into my_table select array_agg(md5(random()::text)) from generate_series(1, 1000) as rows, generate_series(1, 10) as elements group by rows; There are two scripts for pgbench. One for 'map' syntax: select map(upper over arr) from my_table; And one for unnest/aggregate: select u.* from my_table, lateral ( select array_agg(upper(elem)) from unnest(arr) as elem ) as u; Results are: elements per array | map (tps) | unnest/aggregate (tps) --------------------+------------+------------------------ 5 | 139.105359 | 74.434010 10 | 74.089743 | 43.622554 100 | 7.693000 | 5.325805 Apparently map is more efficient for small arrays. And as the size of array increases the difference decreases. I'll be glad to any input from the community. Thanks! -- Ildar Musin i.musin@postgrespro.ru
Attachment
On Fri, May 4, 2018 at 6:38 PM, Ildar Musin <i.musin@postgrespro.ru> wrote: > Hello hackers, > > Recently I was working with sql arrays in postgres and it turned out > that postgres doesn't have such very convinient functional constructions > as map, reduce and filter. Currently to map function over array user has > to make a subquery like: > > select u.* from > my_table, > lateral ( > select array_agg(lower(elem)) > from unnest(arr) as elem > ) as u; > > Which is not only inconvenient but not very efficient as well (see > 'Demo' section below). Is there a way we can improve unnest() and array_agg() to match the performance you have specified by let's say optimizing the cases specially when those two are used together. Identifying that may be some work, but will not require introducing new syntax. > > When I dug into the code I found that postgres already has the needed > infrastructure for implementing map for arrays; actually array coercing > already works that way (it basically maps cast function). > > In the attached patch there is a simple map implementation which > introduces new expression type and syntax: > > MAP(<func_name> OVER <array_expression>) > > For example: > > SELECT MAP(upper OVER array['one', 'two', 'three']::text[]); > ?column? > ----------------- > {ONE,TWO,THREE} > (1 row) > > This is probably not the most useful notation and it would be better to > have syntax for mapping arbitrary expressions over array, not just > function. I'm struggling to come up with a good idea of how it should > look like. It could look something like following: > > MAP(<expr> FOR <placeholder> IN <array_expressin>) > > For instance: > > SELECT MAP(x*2 FOR x IN array[1, 2, 3]::int[]); > > Looking forward for community's suggestions! What if the expression has more than one variable, each mapping to a different array? What if the arrays have different lengths or worse different dimensions? This looks like the way SRFs used to work. Instead of introducing a new syntax, is it possible to detect that argument to a function is an array of the same type as the argument and apply MAP automatically? In your example, upper(arr) would detect that the input is an array of the same type as the scalar argument type and do array_agg(upper(arr[id1], arr[id2], ...). > > elements per array | map (tps) | unnest/aggregate (tps) > --------------------+------------+------------------------ > 5 | 139.105359 | 74.434010 > 10 | 74.089743 | 43.622554 > 100 | 7.693000 | 5.325805 > > Apparently map is more efficient for small arrays. And as the size of > array increases the difference decreases. I am afraid that the way difference is diminishing with increase in the number of elements, unnest, array_agg combination might win for large number of elements. Have you tried that? If we try to improve unnest, array_agg combination for small array, we will get consistent performance without any additional syntax. Although, I admit that query involving unnest and array_agg is not very readable. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: > Is there a way we can improve unnest() and array_agg() to match the > performance you have specified by let's say optimizing the cases > specially when those two are used together. Identifying that may be > some work, but will not require introducing new syntax. +1. The first thing I thought on seeing this proposal was "I wonder how long it will be before the SQL committee introduces some syntax that uses the MAP keyword and breaks this". ISTM the planner could be taught to notice the combination of unnest and array_agg and produce a special output plan from that. It is, however, fair to wonder whether this is worth our time to optimize. I've not noticed a lot of people complaining about performance of this sort of thing, at least not since we fixed array_agg to not be O(N^2). regards, tom lane
Hello Tom, Ashutosh, On 07.05.2018 18:16, Tom Lane wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: >> Is there a way we can improve unnest() and array_agg() to match >> the performance you have specified by let's say optimizing the >> cases specially when those two are used together. Identifying that >> may be some work, but will not require introducing new syntax. > > +1. The first thing I thought on seeing this proposal was "I wonder > how long it will be before the SQL committee introduces some syntax > that uses the MAP keyword and breaks this". > > ISTM the planner could be taught to notice the combination of unnest > and array_agg and produce a special output plan from that. > > It is, however, fair to wonder whether this is worth our time to > optimize. I've not noticed a lot of people complaining about > performance of this sort of thing, at least not since we fixed > array_agg to not be O(N^2). The main point of this patch was about convenience; the performance thing came out later just as a side effect :) Many users are familiar with "map/reduce/filter" concept that many languages (not only functional ones) utilized. And my though was that it would be great to have those in postgres too. Apparently there is also a case when unnest/array_agg may not produce the result we're looking for. I mean multidimensional arrays. E.g. select array_agg(x * 2) from unnest(array[[1,2],[3,4],[5,6]]) as x; array_agg ----------------- {2,4,6,8,10,12} (1 row) select map(x * 2 for x in array[[1,2],[3,4],[5,6]]); ?column? ----------------------- {{2,4},{6,8},{10,12}} (1 row) array_agg produces plain arrays no matter what the input was. There is a new version of the patch in the attachment which introduces arbitrary per-element expressions (instead of single function call). So now user can specify a placeholder representing one element of the array and use it in the expressions. Like following: select map (pow(x, 2) - 1 for x in array[1,2,3]); ?column? --------------- {1,3,7,15,31} (1 row) -- Ildar Musin i.musin@postgrespro.ru
Attachment
On 08.05.2018 15:49, Ildar Musin wrote: > select map (pow(x, 2) - 1 for x in array[1,2,3]); Sorry, the example should be: select map (pow(2, x) - 1 for x in array[1,2,3,4,5]); ?column? --------------- {1,3,7,15,31} (1 row) -- Ildar Musin i.musin@postgrespro.ru
On 05/08/2018 08:57 AM, Ildar Musin wrote: > > select map (pow(2, x) - 1 for x in array[1,2,3,4,5]); I wonder how efficient an implementation would be possible strictly as a function, without grammar changes? While PostgreSQL certainly has extensions to and divergences from standard SQL syntax, some historical and some recent, it seems like there ought to be a highish bar for adding new ones; or, looking at it another way, has this feature been proposed to the SQL committee? It doesn't sound like a bad idea, and supporting new syntax for it would be an easier call it if it were known to be in an SQL draft somewhere. -Chap
On 05/08/2018 09:19 AM, Chapman Flack wrote: > > While PostgreSQL certainly has extensions to and divergences from > standard SQL syntax, some historical and some recent, it seems like > there ought to be a highish bar for adding new ones; or, looking at it > another way, has this feature been proposed to the SQL committee? > It doesn't sound like a bad idea, and supporting new syntax for it > would be an easier call it if it were known to be in an SQL draft > somewhere. There seems to have been some work at Databricks adding higher-order function syntax to their SQL; they've chosen the name 'transform' for 'map', and also provided 'exists', 'filter', and 'aggregate'. https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html -Chap
On 5/8/18 09:19, Chapman Flack wrote: > On 05/08/2018 08:57 AM, Ildar Musin wrote: >> >> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]); > > I wonder how efficient an implementation would be possible strictly > as a function, without grammar changes? Yeah, you can pass a function to another function (using regprocedure or just oid), so this should be possible entirely in user space. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote: > On 5/8/18 09:19, Chapman Flack wrote: > > On 05/08/2018 08:57 AM, Ildar Musin wrote: > >> > >> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]); > > > > I wonder how efficient an implementation would be possible strictly > > as a function, without grammar changes? > > Yeah, you can pass a function to another function (using regprocedure or > just oid), so this should be possible entirely in user space. How would you invoke it? It seems you'd be forced to use EXECUTE in a plpgsql function, or a C function. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 05/08/2018 02:49 PM, Ildar Musin wrote: > The main point of this patch was about convenience; the performance > thing came out later just as a side effect :) Many users are familiar > with "map/reduce/filter" concept that many languages (not only > functional ones) utilized. And my though was that it would be great to > have those in postgres too. Map is a feature I have quite often wished to have, but I am not sure it is quite useful enough to be worth adding our own proprietary syntax. It would be a pain if the SQL committee started using MAP for something. Andreas
>>>>> "Andreas" == Andreas Karlsson <andreas@proxel.se> writes: Andreas> It would be a pain if the SQL committee started using MAP for Andreas> something. They already did - MAP is a non-reserved keyword in sql2016, used at least with <user-defined ordering definition>. Can't see any obvious conflict with use in expressions, but I haven't checked all the references. -- Andrew (irc:RhodiumToad)
On 5/8/18 10:18, Alvaro Herrera wrote: > Peter Eisentraut wrote: >> On 5/8/18 09:19, Chapman Flack wrote: >>> On 05/08/2018 08:57 AM, Ildar Musin wrote: >>>> >>>> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]); >>> >>> I wonder how efficient an implementation would be possible strictly >>> as a function, without grammar changes? >> >> Yeah, you can pass a function to another function (using regprocedure or >> just oid), so this should be possible entirely in user space. > > How would you invoke it? It seems you'd be forced to use EXECUTE in a > plpgsql function, or a C function. Yes, I was thinking about a C function. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 5/8/18 10:18, Alvaro Herrera wrote: >> How would you invoke it? It seems you'd be forced to use EXECUTE in a >> plpgsql function, or a C function. > Yes, I was thinking about a C function. The thing actually implementing MAP would presumably be in C, so this doesn't seem like a problem technically. But having to create a function seems like a big usability stumbling block, probably a big enough one to make the "select array_agg(expression) from unnest(something)" approach more attractive. I do see the usability benefit of a dedicated MAP syntax --- I'm just afraid of getting out in front of the SQL committee on such things. I doubt that it's enough nicer than the sub-select way to justify risking future standards-compliance issues. Realistically, we're talking about this: select a, b, (select array_agg(x*2) from unnest(arraycol) x) from ... versus something on the order of this: select a, b, map(x*2 over x from arraycol) from ... Yeah, it's a bit shorter, but not that much ... and there's a lot more you can do with the sub-select syntax, eg add a WHERE filter. regards, tom lane
On 08.05.2018 17:15, Peter Eisentraut wrote: > On 5/8/18 09:19, Chapman Flack wrote: >> On 05/08/2018 08:57 AM, Ildar Musin wrote: >>> >>> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]); >> >> I wonder how efficient an implementation would be possible >> strictly as a function, without grammar changes? > > Yeah, you can pass a function to another function (using regprocedure > or just oid), so this should be possible entirely in user space. > The problem with this approach is that extension should either have single map() function with input and output type of anyarray which cannot be used when user needs to map int[] to text[] for example. Or the other way there should be a set of map functions for different intput/output types. Another thing is that this approach is not as versatile since user need to create a function before running map, while with the proposed patch they could run arbitrary expression over the array directly. -- Ildar Musin i.musin@postgrespro.ru
Andrew Gierth wrote: > >>>>> "Andreas" == Andreas Karlsson <andreas@proxel.se> writes: > > Andreas> It would be a pain if the SQL committee started using MAP for > Andreas> something. > > They already did - MAP is a non-reserved keyword in sql2016, used at > least with <user-defined ordering definition>. Can't see any obvious > conflict with use in expressions, but I haven't checked all the > references. Ah, so in SQL2011 (and 2016) there already is a designator for routines, which was a thing we were lacking previously, as I recall. <specific routine designator> ::= SPECIFIC <routine type> <specific name> | <routine type> <member name> [ FOR <schema-resolved user-defined type name> ] <routine type> ::= ROUTINE | FUNCTION | PROCEDURE | [ INSTANCE | STATIC | CONSTRUCTOR ] METHOD <member name> ::= <member name alternatives> [ <data type list> ] <member name alternatives> ::= <schema qualified routine name> | <method name> <data type list> ::= <left paren> [ <data type> [ { <comma> <data type> }... ] ] <right paren> [elsewhere] <specific name> ::= <schema qualified name> -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/05/18 18:11, Ildar Musin wrote: > On 08.05.2018 17:15, Peter Eisentraut wrote: >> On 5/8/18 09:19, Chapman Flack wrote: >>> On 05/08/2018 08:57 AM, Ildar Musin wrote: >>>> >>>> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]); >>> >>> I wonder how efficient an implementation would be possible >>> strictly as a function, without grammar changes? >> >> Yeah, you can pass a function to another function (using regprocedure >> or just oid), so this should be possible entirely in user space. > > The problem with this approach is that extension should either have > single map() function with input and output type of anyarray which > cannot be used when user needs to map int[] to text[] for example. Or > the other way there should be a set of map functions for different > intput/output types. > > Another thing is that this approach is not as versatile since user need > to create a function before running map, while with the proposed patch > they could run arbitrary expression over the array directly. The consensus on this seems to be that we don't want this. Yeah, it's a handy syntax, but it's not that much better than using a subselect or a writing a user-defined function. And there's risk of future conflicts with the SQL standard. I'll mark this as "Rejected" in the commitfest. - Heikki